Archive for the ‘Semantic Web’ Category

Using Linked Data to provide a different perspective on Software Architecture

Saturday, December 17th, 2011

As outlined in my previous post Linked Data and the SOA Software Development Process I am interested in using Linked Data to provided a more detailed view of  SOA services.

A coupled of scenarios during the past week highlighted the value of the approach and also that it would benefit with extending the scope to include more information about the consumers of the SOA services and also the external data sources (in particular databases) used by the SOA services.

Both scenarios involved setting up environments for the development and testing of new functionality involving a number of different systems, with each system needing to be deployed at a specific version level.

The first scenario related to the software versions. The UML diagrams presented to describe the architecture were at too high a level to show the actual dependencies, but to add the level of detail needed would have made the diagrams too busy.

Although not yet complete the work already done to provide a Linked Data perspective of the SOA services enabled a more fined grained view of the actual dependencies.  Knowing what the specific lower level dependencies were resulted in more flexibility with the actual deployment. In particular work could start on developing the new functionality for one component since it was not going to be affected by proposed changes in another component. On the original UML diagram both components were shown as requiring changes. The Linked Data perspective provided enough additional detail to see that the changes could happen in parallel.

The second scenario related to finding the owners of external data sources so that we could determine if they were available for use in a given test environment. Adding this ownership information to our  Linked Data repository would speed up this part of the process in the future.

Using the Neon Toolkit and ANNIE to demonstrate extracting RDF from Natural Language

Sunday, July 10th, 2011

The Neon Toolkit is an open source ontology engineering environment providing an extensive set of plug-ins for various ontology engineering activities.

One such plugin is the GATE web services plugin which adds Natural Language Entity Recognition functionality from the GATE (General Architecture for Text Engineering) framework.

The GATE web services plugin can be quickly added to the Neon Toolkit by

  • opening the Help | Install New Software … menu option
  • selecting “NeOn Toolkit Update Site v2.4 – http://neon-toolkit.org/plugins/2.4″ from the Work With drop down combo box.
  • and selecting GATE Web Services as shown below.

The GATE web services plugin includes ANNIE (Ontology Generation Based on Named Entity Recognition) which can be used to demonstrate basic Named Entity Recognition and onotology generation. The main GATE site provides more details on how ANNIE: a Nearly-New Information Extraction System works.

After the GATE web services plugin has been installed GATE Services appears as an additional top level menu option. Selecting GATE Services | Call Multi-document Web Service opens the Call GATE web service dialog box below which provides the option to select ANNIE as the service to call.

Selecting ANNIE and Next invokes an additional dialog box where the Input directory: containing the documents to be processed and the Output ontology: can be specified.

Once the Input directory: and the Output ontology: have been specified and the Finish button selected ANNIE reads the input and generates a basic ontology according to the concepts, instances and relations found in the text.

When the text below is provided as input ANNIE generates the following RDF output.

Input Text

Nick lives in Toronto and studies at Concordia University. Toronto is six hours from Montreal. Toronto is a nice place to live.

RDF Output

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
	xmlns:protons="http://proton.semanticweb.org/2005/04/protons#"
	xmlns:protonu="http://proton.semanticweb.org/2005/04/protonu#"
	xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
	xmlns:owl="http://www.w3.org/2002/07/owl#"
	xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
	xmlns:protonkm="http://proton.semanticweb.org/2005/04/protonkm#"
	xmlns:protont="http://proton.semanticweb.org/2005/04/protont#">
<!-- All statement -->

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Nick">
	<rdf:type rdf:resource="http://gate.ac.uk/owlim#Person"/>
	<rdfs:label xml:lang="en">nick</rdfs:label>
	<rdfs:label xml:lang="en">Nick</rdfs:label>
	<rdfs:label>Nick</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Montreal">
	<rdf:type rdf:resource="http://gate.ac.uk/owlim#Location"/>
	<rdfs:label xml:lang="en">montreal</rdfs:label>
	<rdfs:label xml:lang="en">Montreal</rdfs:label>
	<rdfs:label>Montreal</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Location">
	<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
	<rdfs:label xml:lang="en">Location</rdfs:label>
	<rdfs:label>Location</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Concordia_University">
	<rdf:type rdf:resource="http://gate.ac.uk/owlim#Organization"/>
	<rdfs:label xml:lang="en">concordia university</rdfs:label>
	<rdfs:label xml:lang="en">Concordia University</rdfs:label>
	<rdfs:label xml:lang="en">Concordia_University</rdfs:label>
	<rdfs:label>Concordia_University</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Organization">
	<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
	<rdfs:label xml:lang="en">Organization</rdfs:label>
	<rdfs:label>Organization</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Person">
	<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
	<rdfs:label xml:lang="en">Person</rdfs:label>
	<rdfs:label>Person</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Toronto">
	<rdf:type rdf:resource="http://gate.ac.uk/owlim#Location"/>
	<rdfs:label xml:lang="en">toronto</rdfs:label>
	<rdfs:label xml:lang="en">Toronto</rdfs:label>
	<rdfs:label>Toronto</rdfs:label>
</rdf:Description>

</rdf:RDF>

The entities recognized are:

  • Nick as a Person
  • Montreal and Toronto as Locations
  • Concordia University as an Organization.

While relatively simplistic the overall example comprising the input text,  the generated RDF output and the quick setup process for the Neon Toolkit and the GATE web services plugin helped to demonstrate the potential of Named Entity Recognition and ontology generation.

The input text actually comes from a demo of the OwlExporter which provides similar functionality for GATE itself. Longer term GATE is likely to be part of a Natural Language Processing solution for a government department where the sensitivity of the private data would preclude the use of an external web service. Hopefully there will also be time later on to write up the results of using GATE and the OwlExporter with the same input text.

(For this article  Neon Toolkit version 2.4.2 was used.)

Drupal7 RDFa XMLLiteral content processing

Saturday, March 12th, 2011

Drupal 7 supports RDFa 1.0 as part of the core product. RDFa 1.0 is the current specification but RDFa 1.1 is to be released shortly.

RDFa 1.0 metadata can be parsed using the RDFa Distiller and Parser while the RDFa Distiller and Parser (Test Version for RDFa 1.1) can be used to extract RDFa 1.1.

Creating a simple Drupal 7 test blog and parsing out the RDFa 1.0 metadata with the RDFa Distiller and Parser shows that Drupal 7 is using the SIOC (Semantically-Interlinked Online Communities) ontology to describe blog posts and identifies the Drupal user as the creator of the post using the sioc:has_creator property.

<sioc:Post rdf:about="http://137breakerbay.3kbo.com/test">
  <rdf:type rdf:resource="http://rdfs.org/sioc/types#BlogPost"/>
  ...
  <sioc:has_creator>
    <sioc:UserAccount rdf:about="http://137breakerbay.3kbo.com/user/2">
      <foaf:name>Richard</foaf:name>
    </sioc:UserAccount>
  </sioc:has_creator>
  ...
  <content:encoded rdf:parseType="Literal"><p xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">Test Blog</p>
  </content:encoded>
</sioc:Post>

The sioc:UserAccount is a sub class of foaf:OnlineAccount.

What I would like to do is add additional RDFa metadata within the content of the blog to associate me, the Drupal 7 user with a sioc:UserAccount, to me the foaf:Person identified by my FOAF file.

Drupal 7 content is wrapped by XHTML elements containing the property=”content:encoded” (shown below) and an RDFa parser treats this content as an XMLLiteral.

<div property="content:encoded">
...
</div>

The problem is that RDFa 1.0 parsers don’t extract metadata contained within the XMLLiteral.

This was raised in the issue “XMLLiteral content isn’t processed for RDFa attributes in RDFa 1.0 – should this change in RDFa 1.1? a while back with the result that in RDFa 1.1 parsers should now also process the XMLLiteral content.

To make sure that the RDFa parsers know that I want to use RDFa 1.1 processing I need to update Drupal 7 to use the  XHTML+RDFa Driver Module defined in the XHTML+RDFa 1.1 spec.

This turns out to be a simple update of one Drupal 7 file, site/modules/system/html.tpl.php.

Near the top of the file the version is changed to 1.1 (in two places) and the dtd changed to  “http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd”.

?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN"
  "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="<?php print $language->language; ?>" version="XHTML+RDFa 1.1" dir="<?php print $language->dir; ?>"<?php print $rdf_namespaces; ?>>

With these changes made I can create another blog containing the following RDFa metadata

<div about="http://www.3kbo.com/people/richard.hancock/foaf.rdf#i" typeof="foaf:Person">
<div rel="foaf:account" resource="http://137breakerbay.3kbo.com/user/2">
...
<div>
</div>

knowing that that an RDFa 1.1 parser will create the RDF triples below which link the Drupal 7 user to me the person identified in my FOAF file.

  <foaf:Person rdf:about="http://www.3kbo.com/people/richard.hancock/foaf.rdf#i">
    <foaf:account>
      <sioc:UserAccount rdf:about="http://137breakerbay.3kbo.com/user/2">
        <foaf:name>Richard</foaf:name>
      </sioc:UserAccount>
    </foaf:account>
  </foaf:Person>

The differences between the RDF extracted with an RDFa 1.0 parser and an RDFa 1.1 parser can be seen using the two links below.

Now that I know that the RDFa 1.1 metadata embedded in the content will be processed accordingly I can move on to the task of building 137 Breaker Bay, a simple accommodation site where the plan is to use RDFa and ontologies such as GoodRelations to describe the both the accommodation services available and the attractions and services of the surrounding area.

Developing a Semantic Web Strategy

Tuesday, August 10th, 2010

In the last chapter of his book “Pull: The Power of the Semantic Web to Transform Your Business” David Siegel outlines some steps for developing a successful Semantic Web strategy for your business or organization.

One approach that worked for me recently was to organize a meeting titled “Developing a Semantic Web Strategy”  and invite along developers, architects, analysts and managers. This was in the context of a government organization and the managers were from the applications development area.

Sharing out books like Semantic Web for the Working Ontologist, Semantic Web For Dummies, Programming the Semantic Web and Semantic Web Programming prior to the meeting helped people get familiar with concepts like URIs as names for things, RDF, RDFS, OWL, SPARQL and RDFa.

To highlight how rapidly the Web of Data is evolving and the amount of information now being published as Linked Open Data, I stepped through Mark Greaves excellent presentation The Maturing Semantic Web: Lessons in Web-Scale Knowledge Representation.

During the meeting I took a business strategy first, technology second approach, taking the time to explore how an approach that has worked for someone else might fit with our organization.

Areas explored included:

Enterprise Modeling

I spent some time comparing RDF / OWL modeling with the UML modeling, highlighting how URIs enable modeling across distributed information sources without the need to consolidate everything in a central repository like you do with UML tools.

Also touched on OWL features such as:

Because it is a government department I highlighted the Federal Enterprise Architecture Reference Model Ontology (FEA-RMO) and how such an ontology could be used to map a parliamentary initiative to the software providing its implementation.

Open Government

Given the current trend for governments to make datasets freely available I presented the Linked Data approaches taken by http://data.gov and http://data.gov.uk as examples to follow in this area.

The business case for Linked Data in this scenario is that Linked Data is seen as the best available approach for publishing data in hugely diverse and distributed environments, in a gradual and sustainable way (see Why Linked Data for data.gov.uk? for details).

RDFa Based Integration

One example that struck a chord was RDFa and Linked Data in UK Government Websites where job vacancy details  from different sites can easily be combined since each web site publishes their web pages using HTML with RDFa added to annotate the job vacancy. Using RDFa allows the same page to be read as either HTML or RDF. The end result is that integration can be achieved with minimal changes to the original sites.

Search Engine Optimisation (SEO)

For anyone advertising products and services online the business strategy to follow is the example set by BestBuy.com which describes its stores and products using the Good Relations ontology and embeds these descriptions into its web pages using RDFa, increasing search engine traffic by 30%.

Enterprise Web of Data

Within our software development process, from project inception to production release and subsequent maintenance release, information is being copied and duplicated in a number of different places. Silos abound, in the form of word documents, spread sheets and the sticky notes that are part of the “Agile” process. There is some good information on our wiki pages but it is unstructured and not machine readable.

The information that forms our internal processes fails David Siegel’s Semantic Web Acid Test:

  • It’s not semantic and
  • It’s not on the web.

Introducing a Semantic Wiki such as Semantic MediaWiki, to hold project information and link this information to other datasources was raised as a candidate for a semantic web proof of concept.

Outcomes

Just scheduling the meeting was in itself a successful outcome since it started discussion around the role Semantic Web technologies could play in our organization. For a number of people, including the Applications Development manager, this is new technology and they need time to absorb it but the end result was agreement that it was technology that couldn’t be ignored.

In order to gain some practical experience two internal prototypes were agreed to,  both with practical value for the organization.

The first is a small application that will show the full set of runtime dependencies for a given software component as well as the other components affected when the specified component is changed. The application will be based on a simple ontology that defines dependencies between components using the owl:TransitiveProperty and uses a reasoner (e.g. Pellet) to infer the full set of dependencies for a component.

The second prototype will trial Semantic MediaWiki for project management (potentially using the Teamwork Ontology). The longer term view is customize Semantic MediaWiki to include artifacts created as part of the software development process, addressing some of the silo problems found in our current internal enterprise web of data.

Once practical knowledge has been gained from the internal prototypes a meeting will be scheduled with the Enterprise Architecture team to canvas the establishment of a wider vision for the use of Linked Data and Semantic Web technologies, potentially leading to its use on the public web sites, actively publishing to the Web of Data.

Understanding the OpenCalais RDF Response

Saturday, September 26th, 2009

I’m using an XML version of an article published by Scoop in February 2000, Senior UN Officials Protest UN Sanctions On Iraq, to understand the OpenCalais RDF response as part of a larger project of linking extracted entities to existing Linked Data datasets.

OpenCalais uses natural language processing (NLP), machine learning and other methods to analyze content and return the entities it finds, such as the cities, countries and people with dereferenceable Linked Data style URIs. The entity types are defined in the OpenCalais RDF Schemas.

When I submit the content to the OpenCalais REST web service (using the default RDF response format) an RDF document is returned. Opened below with TopBraid Composer a portion of the input content and some of the entity types OpenCalais can detect is shown. The numbers in brackets indicate how many instances of an entity type have been detected, for example cle.Person(13) indicates that thirteen people have been detected.

The TopBraid Composer Instances tab contains the URIs of the people  detected. Opening the highlighted URI reveals that it is for a person named Saddam Hussein.

Entity Disambiguation

One of the challenges when analyzing content and extracting entities is entity disambiguation. Can the person named Saddam Hussein be uniquely identified. Usually the context is needed in order to disambiguate similar entities. As described in the OpenCalais FAQ if the “rdf:type rdf:resource” of a given entity contains /er/ the entity has been disambiguated by OpenCalais while if contains /em/ its not.

In the example above cle.Person is <http://s.opencalais.com/1/type/em/e/Person>. There is no obvious link to an “rdf:type rdf:resource” containing /er/. It looks like OpenCalais has been able to determined that the text “Saddam Hussein” equates to a Person, but has not been able to determine specifically who that person is.

In contrast Iraq ( one of three countries detected) is shown below with the Incoming Reference http://d.opencalais.com/er/geo/country/ralg-geo1/d3b1cee2-327c-fa35-7dab-f0289958c024.

Opening the URI http://d.opencalais.com/er/geo/country/ralg-geo1/d3b1cee2-327c-fa35-7dab-f0289958c024 with either an HTML browser as http://d.opencalais.com/er/geo/country/ralg-geo1/d3b1cee2-327c-fa35-7dab-f0289958c024.html or with an rdf browser as http://d.opencalais.com/er/geo/country/ralg-geo1/d3b1cee2-327c-fa35-7dab-f0289958c024.rdf ( in Tabulator below ) shows that the country has been disambiguated with <rdf:type rdf:resource=”http://s.opencalais.com/1/type/er/Geo/Country”/>.

Linked Data

In the RDF response returned by OpenCalais neither Iraq nor “Saddam Hussein” were linked to other Linked Data datasets. Some OpenCalais entities are. For example Moscow,Russia is linked via owl:sameAs to

Since I know that the context of the article is international news I can safely add some owl:sameAs links such as the following for Dbpedia links for “Saddam Hussein” (below) and Iraq.

Entity Relevance

For both detected entities “Saddam Hussein” and “Iraq” OpenCalais provides an entity relevance score (shown for each respectively in the screen shots below ) The relevance capability detects the importance of each unique entity and assigns a relevance score in the range 0-1 (1 being the most relevant and important). From the screen shots its clear that “Iraq” has been ranked more relevant.

Detection Information

The RDF Response includes the following properties relating to the subjects detection

  • c:docId: URI of the document this mention was detected in.
  • c:subject: URI of the unique entity.
  • c:detection: snippet of the input content where the metadata element was identified
  • c:prefix: snippet of the input content that precedes the current instance
  • c:exact: snippet of the input content in the matched portion of text
  • c:suffix: snippet of the input content that follows the current instance
  • c:offset: the character offset relative to the input content after it has been converted into XML
  • c:length: length of the instance.

The screen shot below for Saddam Hussein provides an example of how these properties work.

Conclusions

OpenCalais is a very impressive tool. It takes awhile though to fully understand the RDF response, especially in the areas of entity disambiguation and the linking of OpenCalais entities to other Linked Data datasets. Most likely there are some subtleties that I have missed or misunderstood so all clarifications welcome.

For entities extracted from international news sources and not linked to other Linked Data datasets it would be interesting to try some equivalence mining.