Customizing GATE grammars
GATE is a tool for natural language processing of unstructured text. It includes an information extraction system called ANNIE which provides a basic set of annotations over documents, including parts of speech, word tokens and it is also able to recognize locations, cities, people etc.
For example, GATE can identify a bit of text as referencing a particular named individual (or Person). This bit of text is then highlighted in the marked up document, to show the way in which GATE has annotated it.
Lets say that the name Nick appears in a document. To identify that this bit of text refers to a Person, GATE refers to its own internal gazeeteer of male and female first names. The word token Nick is found in the male first name list, so GATE annotates the text Nick to say that it refers to a particular ‘male’ individual named Nick
But, how do we best deal with false positives. A classic example is the word PAGE (which in my case was simply the PAGE number on the footer of a word document). This was marked up by GATE as a Person because it identified Page as a woman’s first name.
Here is an idea:
Create a local ‘negative’ list of names (gazetteer) and an associated jape grammar rule which would overide (ie have a higher priority) than the corresponding GATE rule. This new grammar would use the existing gazetteer of names supplied by GATE, but if a particular ‘name’ was found on this local ‘negative list’ the grammar rule would not create a Person type annotation for it. Over time we could add to this negative list of names as we found more false positives (of text incorrectly being annotated and classified by a GATE grammar)
Another example of a local customization would be if we are dealing with sets of documents in a very confined context within a specific organization. We could create a gazetteer ‘list of names ‘(populated from the organization LDAP repository) with actual names of people in the organization. We would then create a local grammar (jape file)with a higher priority than the one used by GATE to create the ‘Person’ annotation. This grammar would only add a Person annotation if the name existed on the local gazetteer.