High Accuracy Parsing of Name Internal Structure (HAPNIS)

A while ago I wanted a parser for identifying internal structure in people's names. I couldn't find one, so I tagged data with the goal of building a tagger to do this. In the process of tagging data, I realized that this problem is really really easy, so I wrote a short rule-based perl script for doing it, called HAPNIS. I took the 220 data points I annotated and randomly chose 100 of them for development data and 120 for test data. After a few rounds of development, I get 100% on the development data. I then ran it on the test data. On the test data, I get 99.1% accuracy; the two errors I make are:
Truth:  Queen_Role Latifah_Surname
  Hyp:  Queen_Forename Latifah_Surname

Truth:  Lee_Forename Ann_Continue Womack_Surname
  Hyp:  Lee_Forename Ann_Middle Womack_Surname
The first error would be trivial to fix, but I didn't want to cheat. The second error would be a little harder, but given context (i.e., the document source), you could probably fix errors of this kind, too.

The script has a single option, "-names", which toggles whether a list of common first (but not last) names are used to help disambiguate single token names. Without this option, the system scores 96.8% on the test data.

The tag set I use is:

You can download the development data, the test data, the scoring script and the HAPNIS Perl script (you will have to rename these *.pl; unfortunately my web server doesn't like to serve up .pl files). If you find this useful, or find serious bugs, please email me at . Note that this is developed based on the ACE 2004 training data, and is mostly based on news, so it is biased toward calling single-word entries surnames, rather than first names.