Archive for the ‘Natural Language Processing’ Category

Linked Data and the SOA Software Development Process

Thursday, November 17th, 2011

We have quite a rigorous SOA software development process however the full value of the collected information is not being realized because the artifacts are stored in disconnected information silos. So far attempts to introduce tools which could improve the situation (e.g. zAgile Teamwork and Semantic Media Wiki) have been unsuccessful, possibly because the value of a Linked Data approach is not yet fully appreciated.

To provide an example Linked Data view of the SOA services and their associated artifacts I created a prototype consisting of  Sesame running on a Tomcat server with Pubby providing the Linked Data view via the Sesame SPARQL end point. TopBraid was connected directly to the Sesame native store (configured via the Sesame Workbench) to create a subset of services sufficient to demonstrate the value of publishing information as Linked Data. In particular the prototype showed how easy it became to navigate from the requirements for a SOA service through to details of its implementation.

The  prototype also highlighted that auto generation of the RDF graph (the data providing the Linked Data view) from the actual source artifacts would be preferable to manual entry, especially if this could be transparently integrated with the current software development process. This is has become the focus of the next step, automated knowledge extraction from the source artifacts.


Key artifact types of our process include:

A Graph of Concepts and Instances

There is a rich graph of relationships linking the things described in the artifacts listed above. For example the business entities defined in the UML analysis model are the subject of the service and service operations defined in the Service Contracts. The service and service operations are mapped to the WSDLs which utilize the Xml Schema’s that provide an XML view of business entities. The JAX-WS implementations are linked to the WSDLs and Xml Schema’s and deployed to the Oracle Weblogic Application Server where the configuration files list the external dependencies. The log files and defects link back to specific parts of the code base (Subversion revisions) within the context of specific service operations. The people associated with the different artifacts can often be determined from artifact meta-data.

RDF, OWL and Linked Data are a natural fit for modelling and viewing this graph since there is a mix of concepts plus a lot of instances, many of whom already have a HTTP representation. Also the graph contains a number of transitive relationships , (for example a WSDL may import an Xml Schema which in turn imports another Xml Schema etc …) promoting the use of the owl:TransitiveProperty to help obtain a full picture of all the dependencies a component may have.

Knowledge Extraction

Another advantage of the RDF, OWL, Linked Data approach is the utilization of unique URIs for identifying concepts and instances. This allows information contain in one artifact, e.g. a WSDL, to be extracted as RDF triples which would later be combined with the RDF triples extracted from the JAX-WS annotation of Java source code. The combined RDF triples tell us more about the WSDL and its Java implementation than could be derived from just one of the artifacts.

We have made some progress with knowledge extraction but this is still definitely a work in progress. Sites such as ConverterToRdf, RDFizers and the Virtuoso Sponger provide tools and information on generating RDF from different artifact types. Part of the current experimentation is around finding tools that can be transparently layered over the top of the current software development process. Finding the best way to extract the full set of desired RDF triples from Microsoft Word documents is also proving problematic since some natural language processing is required.

Tools currently being evaluated include:

The Benefits of Linked Data

The prototype showed the benefits of Linked Data for navigating from the requirements for a SOA service through to details of its implementation. Looking at all the information that could be extracted leads on to a broader view of the benefits Linked Data would bring to the SOA software development process.

One specific use being planned is the creation of a Service Registry application providing the following functionality:

  • Linking the services to the implementations running in a given environment, e.g. dev, test and production. This includes linking the specific versions of the requirement, design or implementation artifacts and detailing the runtime dependencies of each service implementation.
  • Listing the consumers of each service and providing summary statistics on the performance, e.g. daily usage figures derived from audit logs.
  • Providing a list of who to contact when a service is not available. This includes notifying consumers of a service outage and also contacting providers if a service is being affected by an external component being offline, e.g. a database or an external web service.
  • Search of the services by different criteria, e.g. business entity
  • Tracking the evolution of services and being able to assist with refactoring, e.g answering questions such as “Are there older versions of the Xml Schemas that can be deprecated?”
  • Simplify the running of a specific Soapui test case for a service operation in a given environment.
  • Provide the equivalent of a class lookup that includes all project classes plus all required infrastructure classes and returns information such as the jar file the class is contained in and JIRA and Subversion information.

Customizing GATE grammars

Tuesday, July 19th, 2011

GATE is a tool for natural language processing of unstructured text.  It includes an information extraction system called ANNIE which provides a basic set of annotations over documents, including parts of speech, word tokens and it is also able to recognize locations, cities, people etc.

For example, GATE can identify a bit of text as referencing a particular named individual (or Person). This bit of text is then highlighted in the marked up document, to show the way in which GATE has annotated it.

Lets say that the name Nick appears in a document. To identify that this bit of text refers to a Person, GATE refers to its own internal gazeeteer of male and female first names. The word token Nick is found in the male first name list, so GATE annotates the text Nick to say that it refers to a particular ‘male’ individual named Nick

But, how do we best deal with false positives.  A classic example is the word PAGE (which in my case was simply the PAGE number on the footer of a word document). This was marked up by GATE as a Person because it identified Page as a woman’s first name.

Here is an idea:

Create a local ‘negative’ list of names (gazetteer) and an associated jape grammar rule which would overide (ie have a higher priority) than the corresponding GATE rule.  This new grammar would use the existing gazetteer of names supplied by GATE, but if a particular ‘name’ was found on this local ‘negative list’ the grammar rule would not create a Person type annotation for it.  Over time we could add to this negative list of names as we found more false positives (of text incorrectly being annotated and classified by a GATE grammar)

Another example of a local customization would be if we are dealing with sets of documents in a very confined context within a specific organization. We could create a gazetteer ‘list of names ‘(populated from the organization LDAP repository) with actual names of people in the organization.  We would then create a local grammar (jape file)with a higher priority than the one used by GATE to create the ‘Person’ annotation. This grammar would only add a Person annotation if the name existed on the local gazetteer.

Using the Neon Toolkit and ANNIE to demonstrate extracting RDF from Natural Language

Sunday, July 10th, 2011

The Neon Toolkit is an open source ontology engineering environment providing an extensive set of plug-ins for various ontology engineering activities.

One such plugin is the GATE web services plugin which adds Natural Language Entity Recognition functionality from the GATE (General Architecture for Text Engineering) framework.

The GATE web services plugin can be quickly added to the Neon Toolkit by

  • opening the Help | Install New Software … menu option
  • selecting “NeOn Toolkit Update Site v2.4 –″ from the Work With drop down combo box.
  • and selecting GATE Web Services as shown below.

The GATE web services plugin includes ANNIE (Ontology Generation Based on Named Entity Recognition) which can be used to demonstrate basic Named Entity Recognition and onotology generation. The main GATE site provides more details on how ANNIE: a Nearly-New Information Extraction System works.

After the GATE web services plugin has been installed GATE Services appears as an additional top level menu option. Selecting GATE Services | Call Multi-document Web Service opens the Call GATE web service dialog box below which provides the option to select ANNIE as the service to call.

Selecting ANNIE and Next invokes an additional dialog box where the Input directory: containing the documents to be processed and the Output ontology: can be specified.

Once the Input directory: and the Output ontology: have been specified and the Finish button selected ANNIE reads the input and generates a basic ontology according to the concepts, instances and relations found in the text.

When the text below is provided as input ANNIE generates the following RDF output.

Input Text

Nick lives in Toronto and studies at Concordia University. Toronto is six hours from Montreal. Toronto is a nice place to live.

RDF Output

<?xml version="1.0" encoding="UTF-8"?>
<!-- All statement -->

<rdf:Description rdf:about="">
	<rdf:type rdf:resource=""/>
	<rdfs:label xml:lang="en">nick</rdfs:label>
	<rdfs:label xml:lang="en">Nick</rdfs:label>

<rdf:Description rdf:about="">
	<rdf:type rdf:resource=""/>
	<rdfs:label xml:lang="en">montreal</rdfs:label>
	<rdfs:label xml:lang="en">Montreal</rdfs:label>

<rdf:Description rdf:about="">
	<rdf:type rdf:resource=""/>
	<rdfs:label xml:lang="en">Location</rdfs:label>

<rdf:Description rdf:about="">
	<rdf:type rdf:resource=""/>
	<rdfs:label xml:lang="en">concordia university</rdfs:label>
	<rdfs:label xml:lang="en">Concordia University</rdfs:label>
	<rdfs:label xml:lang="en">Concordia_University</rdfs:label>

<rdf:Description rdf:about="">
	<rdf:type rdf:resource=""/>
	<rdfs:label xml:lang="en">Organization</rdfs:label>

<rdf:Description rdf:about="">
	<rdf:type rdf:resource=""/>
	<rdfs:label xml:lang="en">Person</rdfs:label>

<rdf:Description rdf:about="">
	<rdf:type rdf:resource=""/>
	<rdfs:label xml:lang="en">toronto</rdfs:label>
	<rdfs:label xml:lang="en">Toronto</rdfs:label>


The entities recognized are:

  • Nick as a Person
  • Montreal and Toronto as Locations
  • Concordia University as an Organization.

While relatively simplistic the overall example comprising the input text,  the generated RDF output and the quick setup process for the Neon Toolkit and the GATE web services plugin helped to demonstrate the potential of Named Entity Recognition and ontology generation.

The input text actually comes from a demo of the OwlExporter which provides similar functionality for GATE itself. Longer term GATE is likely to be part of a Natural Language Processing solution for a government department where the sensitivity of the private data would preclude the use of an external web service. Hopefully there will also be time later on to write up the results of using GATE and the OwlExporter with the same input text.

(For this article  Neon Toolkit version 2.4.2 was used.)

Understanding the OpenCalais RDF Response

Saturday, September 26th, 2009

I’m using an XML version of an article published by Scoop in February 2000, Senior UN Officials Protest UN Sanctions On Iraq, to understand the OpenCalais RDF response as part of a larger project of linking extracted entities to existing Linked Data datasets.

OpenCalais uses natural language processing (NLP), machine learning and other methods to analyze content and return the entities it finds, such as the cities, countries and people with dereferenceable Linked Data style URIs. The entity types are defined in the OpenCalais RDF Schemas.

When I submit the content to the OpenCalais REST web service (using the default RDF response format) an RDF document is returned. Opened below with TopBraid Composer a portion of the input content and some of the entity types OpenCalais can detect is shown. The numbers in brackets indicate how many instances of an entity type have been detected, for example cle.Person(13) indicates that thirteen people have been detected.

The TopBraid Composer Instances tab contains the URIs of the people  detected. Opening the highlighted URI reveals that it is for a person named Saddam Hussein.

Entity Disambiguation

One of the challenges when analyzing content and extracting entities is entity disambiguation. Can the person named Saddam Hussein be uniquely identified. Usually the context is needed in order to disambiguate similar entities. As described in the OpenCalais FAQ if the “rdf:type rdf:resource” of a given entity contains /er/ the entity has been disambiguated by OpenCalais while if contains /em/ its not.

In the example above cle.Person is <>. There is no obvious link to an “rdf:type rdf:resource” containing /er/. It looks like OpenCalais has been able to determined that the text “Saddam Hussein” equates to a Person, but has not been able to determine specifically who that person is.

In contrast Iraq ( one of three countries detected) is shown below with the Incoming Reference

Opening the URI with either an HTML browser as or with an rdf browser as ( in Tabulator below ) shows that the country has been disambiguated with <rdf:type rdf:resource=””/>.

Linked Data

In the RDF response returned by OpenCalais neither Iraq nor “Saddam Hussein” were linked to other Linked Data datasets. Some OpenCalais entities are. For example Moscow,Russia is linked via owl:sameAs to

Since I know that the context of the article is international news I can safely add some owl:sameAs links such as the following for Dbpedia links for “Saddam Hussein” (below) and Iraq.

Entity Relevance

For both detected entities “Saddam Hussein” and “Iraq” OpenCalais provides an entity relevance score (shown for each respectively in the screen shots below ) The relevance capability detects the importance of each unique entity and assigns a relevance score in the range 0-1 (1 being the most relevant and important). From the screen shots its clear that “Iraq” has been ranked more relevant.

Detection Information

The RDF Response includes the following properties relating to the subjects detection

  • c:docId: URI of the document this mention was detected in.
  • c:subject: URI of the unique entity.
  • c:detection: snippet of the input content where the metadata element was identified
  • c:prefix: snippet of the input content that precedes the current instance
  • c:exact: snippet of the input content in the matched portion of text
  • c:suffix: snippet of the input content that follows the current instance
  • c:offset: the character offset relative to the input content after it has been converted into XML
  • c:length: length of the instance.

The screen shot below for Saddam Hussein provides an example of how these properties work.


OpenCalais is a very impressive tool. It takes awhile though to fully understand the RDF response, especially in the areas of entity disambiguation and the linking of OpenCalais entities to other Linked Data datasets. Most likely there are some subtleties that I have missed or misunderstood so all clarifications welcome.

For entities extracted from international news sources and not linked to other Linked Data datasets it would be interesting to try some equivalence mining.