High Accuracy Parsing of Name Internal Structure

High Accuracy Parsing of Name Internal Structure (HAPNIS)

A while ago I wanted a parser for identifying internal structure in people's names. I couldn't find one, so I tagged data with the goal of building a tagger to do this. In the process of tagging data, I realized that this problem is really really easy, so I wrote a short rule-based perl script for doing it, called HAPNIS. I took the 220 data points I annotated and randomly chose 100 of them for development data and 120 for test data. After a few rounds of development, I get 100% on the development data. I then ran it on the test data. On the test data, I get 99.1% accuracy; the two errors I make are:

Truth:  Queen_Role Latifah_Surname
  Hyp:  Queen_Forename Latifah_Surname

Truth:  Lee_Forename Ann_Continue Womack_Surname
  Hyp:  Lee_Forename Ann_Middle Womack_Surname

The first error would be trivial to fix, but I didn't want to cheat. The second error would be a little harder, but given context (i.e., the document source), you could probably fix errors of this kind, too.

The script has a single option, "-names", which toggles whether a list of common first (but not last) names are used to help disambiguate single token names. Without this option, the system scores 96.8% on the test data.

The tag set I use is:

Surname: Last (family) names.
Forename: Given name.
Middle: Given middle names (i.e., not first names).
Link: A link between two names of the same kind. Used for conjoined names, and Arabic names like "Al - Jones", where the "-" will be tagged as a link.
Role: Mr., Dr., etc.
Suffix: Name suffixes, like on my name, "III", "Jr", etc.
Continue: When a name is part of a multi-word unit, but not a link, I used continue. The only example is in the test data, where "Lee Ann" is really a whole first name that just happens to have a space.

You can download the development data, the test data, the scoring script and the HAPNIS Perl script (you will have to rename these *.pl; unfortunately my web server doesn't like to serve up .pl files). If you find this useful, or find serious bugs, please email me at . Note that this is developed based on the ACE 2004 training data, and is mostly based on news, so it is biased toward calling single-word entries surnames, rather than first names.