Developing a Semantic Web Strategy

August 10th, 2010

In the last chapter of his book “Pull: The Power of the Semantic Web to Transform Your Business” David Siegel outlines some steps for developing a successful Semantic Web strategy for your business or organization.

One approach that worked for me recently was to organize a meeting titled “Developing a Semantic Web Strategy”  and invite along developers, architects, analysts and managers. This was in the context of a government organization and the managers were from the applications development area.

Sharing out books like Semantic Web for the Working Ontologist, Semantic Web For Dummies, Programming the Semantic Web and Semantic Web Programming prior to the meeting helped people get familiar with concepts like URIs as names for things, RDF, RDFS, OWL, SPARQL and RDFa.

To highlight how rapidly the Web of Data is evolving and the amount of information now being published as Linked Open Data, I stepped through Mark Greaves excellent presentation The Maturing Semantic Web: Lessons in Web-Scale Knowledge Representation.

During the meeting I took a business strategy first, technology second approach, taking the time to explore how an approach that has worked for someone else might fit with our organization.

Areas explored included:

Enterprise Modeling

I spent some time comparing RDF / OWL modeling with the UML modeling, highlighting how URIs enable modeling across distributed information sources without the need to consolidate everything in a central repository like you do with UML tools.

Also touched on OWL features such as:

Because it is a government department I highlighted the Federal Enterprise Architecture Reference Model Ontology (FEA-RMO) and how such an ontology could be used to map a parliamentary initiative to the software providing its implementation.

Open Government

Given the current trend for governments to make datasets freely available I presented the Linked Data approaches taken by http://data.gov and http://data.gov.uk as examples to follow in this area.

The business case for Linked Data in this scenario is that Linked Data is seen as the best available approach for publishing data in hugely diverse and distributed environments, in a gradual and sustainable way (see Why Linked Data for data.gov.uk? for details).

RDFa Based Integration

One example that struck a chord was RDFa and Linked Data in UK Government Websites where job vacancy details  from different sites can easily be combined since each web site publishes their web pages using HTML with RDFa added to annotate the job vacancy. Using RDFa allows the same page to be read as either HTML or RDF. The end result is that integration can be achieved with minimal changes to the original sites.

Search Engine Optimisation (SEO)

For anyone advertising products and services online the business strategy to follow is the example set by BestBuy.com which describes its stores and products using the Good Relations ontology and embeds these descriptions into its web pages using RDFa, increasing search engine traffic by 30%.

Enterprise Web of Data

Within our software development process, from project inception to production release and subsequent maintenance release, information is being copied and duplicated in a number of different places. Silos abound, in the form of word documents, spread sheets and the sticky notes that are part of the “Agile” process. There is some good information on our wiki pages but it is unstructured and not machine readable.

The information that forms our internal processes fails David Siegel’s Semantic Web Acid Test:

  • It’s not semantic and
  • It’s not on the web.

Introducing a Semantic Wiki such as Semantic MediaWiki, to hold project information and link this information to other datasources was raised as a candidate for a semantic web proof of concept.

Outcomes

Just scheduling the meeting was in itself a successful outcome since it started discussion around the role Semantic Web technologies could play in our organization. For a number of people, including the Applications Development manager, this is new technology and they need time to absorb it but the end result was agreement that it was technology that couldn’t be ignored.

In order to gain some practical experience two internal prototypes were agreed to,  both with practical value for the organization.

The first is a small application that will show the full set of runtime dependencies for a given software component as well as the other components affected when the specified component is changed. The application will be based on a simple ontology that defines dependencies between components using the owl:TransitiveProperty and uses a reasoner (e.g. Pellet) to infer the full set of dependencies for a component.

The second prototype will trial Semantic MediaWiki for project management (potentially using the Teamwork Ontology). The longer term view is customize Semantic MediaWiki to include artifacts created as part of the software development process, addressing some of the silo problems found in our current internal enterprise web of data.

Once practical knowledge has been gained from the internal prototypes a meeting will be scheduled with the Enterprise Architecture team to canvas the establishment of a wider vision for the use of Linked Data and Semantic Web technologies, potentially leading to its use on the public web sites, actively publishing to the Web of Data.

Using Groovy to Upload RDF files to the Talis Platform

March 13th, 2010

The Talis Platform provides free stores for developers to host RDF data online. Each store has its own SPARQL end point for querying the RDF data.

Options for uploading individual RDF files into a store include:

A nice to have option would be to be able to upload all the RDF files found in a directory directly into a store using a simple command like TalisStore.load.

Groovy with its flexible scripting is a good candidate for this type of work. Code like the following makes it easy to traverse directories and list the RDF files

  • in the current directory:
    new File(".").eachFileMatch(~/.*\.rdf/) { println it }
    
  • or in a specific directory:
    new File("/data/rdf").eachFileMatch(~/.*\.rdf/) { println it }
    

Once Groovy is installed the above lines of code can be run directly in both the Groovy Shell (groovysh) and the Groovy Console (groovyConsole). For example when run in the Groovy Shell (groovysh) :

$ groovysh
Groovy Shell (1.6.4, JVM: 1.6.0_15)
Type 'help' or '\h' for help.
-------------------------------------------------------------------------------------
groovy:000> new File(".").eachFileMatch(~/.*\.rdf/) { println it }
./WO0002.rdf
./WO0003.rdf
./WO0004.rdf
./WO0005.rdf

The Groovy RESTClient simplifies REST operations like POSTing (uploading) files to a web site. It is an extension of HTTPBuilder which in turn is a wrapper of Apache’s HttpClient. The main addition required for the RESTClient to upload RDF/XML files to a Talis store is an “application/rdf+xml” encoder. This is easy to create following the example provided in the article Groovy RESTClient and Putting Zip Files.

The result is the encodeRDF method shown below.

import groovyx.net.http.RESTClient
import org.apache.http.entity.FileEntity
TalisStoreLoader() {
 talis = new RESTClient( "http://api.talis.com/" )
 talis.auth.basic TALIS_USERNAME, TALIS_PASSWORD
 talis.encoder.'application/rdf+xml' = this.&encodeRDF
 }
def encodeRDF( Object data ) throws UnsupportedEncodingException {
 if ( data instanceof File ) {
 def entity = new FileEntity( (File) data, "application/rdf+xml" );
 entity.setContentType( "application/rdf+xml" );
 return entity
 } else {
 throw new IllegalArgumentException(
 "Don't know how to encode ${data.class.name} as application/rdf+xml" );
 }
 }

The line talis.encoder.’application/rdf+xml’ = this.&encodeRDF registers it with an instance of the RESTClient.

With the RDF encoder in place a file can be uploaded to a stores metabox as follows.

def res = talis.post( path: metaboxPath, body: file, requestContentType: "application/rdf+xml" )

This functionality is encapsulated in the class com._3kbo.talis.TalisStoreLoader which is part of a maven project available for download as a zip file. It includes the script TalisStore.groovy which is a simplified wrapper of com._3kbo.talis.TalisStoreLoader.

The jar file create by the project talis-store-0.2.jar can be downloaded separately.

The RESTClient is not bundled with the standard Groovy install. Trying to access it from the shell or console without explicitly installing it will results in errors like the following:

groovy:000> import groovyx.net.http.RESTClient
ERROR org.codehaus.groovy.tools.shell.CommandException:
Invalid import definition: 'import groovyx.net.http.RESTClient';
reason: startup failed, script1266050039289.groovy:
1: unable to resolve class groovyx.net.http.RESTClient
 @ line 1, column 1. 1 error at java_lang_Runnable$run.call (Unknown Source)

Installing the RESTClient requires downloading HTTPBuilder and adding it and its dependencies (http-builder-xxx-all.zip) to the ${user.home}/.groovy/lib directory. Also add talis-store-0.2.jar to this directory. The ${user.home}/.groovy/lib directory may need to be created manually but the Groovy install should have created a file named “$GROOVY_HOME/conf/groovy-starter.conf” containing the line

load ${user.home}/.groovy/lib/*

which enables the loading of the additional jar files required by RESTClient plus the com._3kbo.talis.TalisStoreLoader i.e:

  • http-builder-0.5.0-RC2.jar
  • httpclient-4.0.jar
  • httpcore-4.0.1.jar
  • json-lib-2.3-jdk15.jar
  • xml-resolver-1.2.jar
  • commons-collections-3.2.1.jar
  • commons-logging-1.1.1.jar
  • talis-store-0.2.jar

Using the Groovy Shell to Upload

With the RESTClient and the talis-store-0.2.jar installed the Groovy Shell (groovysh) makes it easy to run the TalisStore.groovy script and upload either individual RDF files or all the RDF files in a directory to a Talis store.

The four options for running the TalisStore.groovy script are:

  1. TalisStore.load “mystore”,”user”,”password”,”file_or_directory”
  2. TalisStore.load “mystore”,”user”,”password”
  3. TalisStore.load “file_or_directory”
  4. TalisStore.load()

The first and second options both explicitly set the store, user and password. The first option also nominates either a specific RDF file to upload or a directory to scan and upload all the RDF files found. The second option uploads all the RDF files found in the current directory, i.e. the directory in which the Groovy Shell (groovysh) was invoked.

The third and forth options read the store, user and password from the configuration file TalisConfig.groovy, updated for a specific store and available on the classpath (see below).

With the configuration file TalisConfig.groovy in place uploading a specific RDF file or a directory simplifies to TalisStore.load “file_or_directory”

Uploading the RDF files in the current directory is just TalisStore.load() as shown in the example
Loading all RDF files from the current directory below.

Using the Script to Upload

Adding the line #!/usr/bin/env groovy to the TalisStore.groovy script and making the script executable allows it to be run independent of the Groovy Shell (groovysh), for example ./TalisStore.groovy /sioc/forum/WO0902.rdf explicitly loads the RDF, using the configuration file to set the store, user and password.

See the TalisStore.groovy javadoc for more details on running as an executable script.

Summary

There is a bit of configuration to set everything up but once in place the combination of Groovy, the RESTClient and the TalisStore loader code described here makes it easy to load RDF files to the Talis Platform.

My preference is to run the Groovy Shell (groovysh) and use simple commands like TalisStore.load().

Possible extensions for the future include commands like TalisStore.sparql.select etc…

Appendix A: Examples

Loading a specific file

$ groovysh
Groovy Shell (1.7.1, JVM: 1.6.0_15)
Type 'help' or '\h' for help.
-------------------------------------------------------------------------------
groovy:000> TalisStore.load "mystore","user","password","/sioc/WO0401.rdf"
Using store: mystore user password
Loading a file or directory: /sioc/WO0401.rdf
Loading /sioc/WO0401.rdf
Loaded 1565688 bytes in 58518 milliseconds. (Status: 204)

Loading all RDF files from the current directory

$ cd /scoop/forum/
$ ls -l
-rw-r--r--  1  3847192  2 Jan 12:11 WO0903.rdf
-rw-r--r--  1  2485605  2 Jan 12:11 WO0904.rdf
-rw-r--r--  1  2321233  2 Jan 12:12 WO0905.rdf
-rw-r--r--  1  2551787  2 Jan 12:12 WO0906.rdf
$ groovysh
Groovy Shell (1.7.1, JVM: 1.6.0_17)
Type 'help' or '\h' for help.
--------------------------------------------
groovy:000> TalisStore.load()
Classpath:
...
Loading RDF files from directory /scoop/forum/.
2010-03-14 11:32:31.477: Loading /scoop/forum/./WO0903.rdf
2010-03-14 11:33:49.289: Loaded 3847192 bytes in 77808 milliseconds. (Status: 204)
2010-03-14 11:33:49.304: Loading /scoop/forum/./WO0904.rdf
2010-03-14 11:34:38.288: Loaded 2485605 bytes in 48984 milliseconds. (Status: 204)
2010-03-14 11:34:38.289: Loading /scoop/forum/./WO0905.rdf
2010-03-14 11:35:25.429: Loaded 2321233 bytes in 47140 milliseconds. (Status: 204)
2010-03-14 11:35:25.43: Loading /scoop/forum/./WO0906.rdf
2010-03-14 11:36:15.952: Loaded 2551787 bytes in 50523 milliseconds. (Status: 204)
Loaded 4 files in 224488 milliseconds.
===> 4
groovy:000>

Appendix B: Adding the Groovy Configuration File to the Classpath

The structure of the config file is:

// TalisConfig.groovy
talis {
    user = "myusername"
    password = "mypassword"
    store = "mystore"
}

Once the values have been updated for a specific store the steps for adding to the classpath and also verifying that it is being read correctly are as follows:

  • Create a directory to hold property files ( e.g. . ${user.home}/.groovy/conf/ ) and
  • Add a matching line to “$GROOVY_HOME/conf/groovy-starter.conf” to add the directory to the classpath,e.g. load ${user.home}/.groovy/conf/./
  • Place the Groovy configuration file TalisConfig.groovy in the directory (i.e. ${user.home}/.groovy/conf/)

ConfigSlurper is used to read the configuration file. The shell input below shows how to:

  • Check what is on the classpath using loader.URLs.each{ println it }
  • Get the config file using url = loader.getResource(”TalisConfig.groovy”)
  • Read the config file using def config = new ConfigSlurper().parse(url)
groovy:000> import groovyx.net.http.RESTClient
===> [import groovyx.net.http.RESTClient]
groovy:000> talis = new RESTClient( "http://api.talis.com/" )
===> groovyx.net.http.RESTClient@1798928
groovy:000> loader = talis.class.classLoader.rootLoader
===> org.codehaus.groovy.tools.RootLoader@4d20a47e
groovy:000> loader.URLs.each{ println it }
file:/Users/richardhancock/./
file:/Users/richardhancock/groovy-1.6.4/lib/ant-1.7.1.jar
file:/Users/richardhancock/groovy-1.6.4/lib/ant-junit-1.7.1.jar
file:/Users/richardhancock/groovy-1.6.4/lib/ant-launcher-1.7.1.jar
file:/Users/richardhancock/groovy-1.6.4/lib/antlr-2.7.7.jar
file:/Users/richardhancock/groovy-1.6.4/lib/asm-2.2.3.jar
file:/Users/richardhancock/groovy-1.6.4/lib/asm-analysis-2.2.3.jar
file:/Users/richardhancock/groovy-1.6.4/lib/asm-tree-2.2.3.jar
file:/Users/richardhancock/groovy-1.6.4/lib/asm-util-2.2.3.jar
file:/Users/richardhancock/groovy-1.6.4/lib/bsf-2.4.0.jar
file:/Users/richardhancock/groovy-1.6.4/lib/commons-cli-1.2.jar
file:/Users/richardhancock/groovy-1.6.4/lib/commons-logging-1.1.jar
file:/Users/richardhancock/groovy-1.6.4/lib/groovy-1.6.4.jar
file:/Users/richardhancock/groovy-1.6.4/lib/ivy-2.1.0-rc2.jar
file:/Users/richardhancock/groovy-1.6.4/lib/jline-0.9.94.jar
file:/Users/richardhancock/groovy-1.6.4/lib/jsp-api-2.0.jar
file:/Users/richardhancock/groovy-1.6.4/lib/junit-3.8.2.jar
file:/Users/richardhancock/groovy-1.6.4/lib/servlet-api-2.4.jar
file:/Users/richardhancock/groovy-1.6.4/lib/xstream-1.3.1.jar
file:/Users/richardhancock/.groovy/lib/http-builder-0.5.0-RC2.jar
file:/Users/richardhancock/.groovy/lib/httpclient-4.0.jar
file:/Users/richardhancock/.groovy/lib/httpcore-4.0.1.jar
file:/Users/richardhancock/.groovy/lib/json-lib-2.3-jdk15.jar
file:/Users/richardhancock/.groovy/lib/xml-resolver-1.2.jar
file:/Users/richardhancock/.groovy/conf/./
===> [Ljava.net.URL;@3ebc312f
groovy:000> url = loader.getResource("TalisConfig.groovy")
===> file:/Users/richardhancock/.groovy/conf/TalisConfig.groovy
groovy:000> def config = new ConfigSlurper().parse(url)
===> {talis={username=myusername, password=mypassword, store=mystore}}
groovy:000>

Appendix C: Using Maven to run the Groovy Script

The TalisStore script can also be run via maven. This approach uses the jar file dependencies defined in the maven  project and does not require the standard Groovy install. If a valid “TalisConfig.groovy” configuration file is available on the classpath, the parameters for “store”, “username” and “password” are not required. By default the pom.xml file excludes the dummy configuration file but once it has been updated with real values it can be included by changing the exclude(s) to include(s) .  The TalisStore script can be run by executing command lines such as the following which invoke the TalisStore main method (optionally with parameters).

mvn exec:java -Dexec.mainClass=TalisStore

mvn exec:java -Dexec.mainClass=TalisStore -Dexec.args=”/sioc/forum/2007″

Appendix D: Authentication

The method “talis.auth.basic TALIS_USERNAME, TALIS_PASSWORD” is a bit of an anomaly since the Talis Platform uses HTTP Digest Authentication. RESTClient uses the groovyx.net.http.AuthConfigbasic” method which works for “digest” authentication as well.

Understanding the OpenCalais RDF Response

September 26th, 2009

I’m using an XML version of an article published by Scoop in February 2000, Senior UN Officials Protest UN Sanctions On Iraq, to understand the OpenCalais RDF response as part of a larger project of linking extracted entities to existing Linked Data datasets.

OpenCalais uses natural language processing (NLP), machine learning and other methods to analyze content and return the entities it finds, such as the cities, countries and people with dereferenceable Linked Data style URIs. The entity types are defined in the OpenCalais RDF Schemas.

When I submit the content to the OpenCalais REST web service (using the default RDF response format) an RDF document is returned. Opened below with TopBraid Composer a portion of the input content and some of the entity types OpenCalais can detect is shown. The numbers in brackets indicate how many instances of an entity type have been detected, for example cle.Person(13) indicates that thirteen people have been detected.

The TopBraid Composer Instances tab contains the URIs of the people  detected. Opening the highlighted URI reveals that it is for a person named Saddam Hussein.

Entity Disambiguation

One of the challenges when analyzing content and extracting entities is entity disambiguation. Can the person named Saddam Hussein be uniquely identified. Usually the context is needed in order to disambiguate similar entities. As described in the OpenCalais FAQ if the “rdf:type rdf:resource” of a given entity contains /er/ the entity has been disambiguated by OpenCalais while if contains /em/ its not.

In the example above cle.Person is <http://s.opencalais.com/1/type/em/e/Person>. There is no obvious link to an “rdf:type rdf:resource” containing /er/. It looks like OpenCalais has been able to determined that the text “Saddam Hussein” equates to a Person, but has not been able to determine specifically who that person is.

In contrast Iraq ( one of three countries detected) is shown below with the Incoming Reference http://d.opencalais.com/er/geo/country/ralg-geo1/d3b1cee2-327c-fa35-7dab-f0289958c024.

Opening the URI http://d.opencalais.com/er/geo/country/ralg-geo1/d3b1cee2-327c-fa35-7dab-f0289958c024 with either an HTML browser as http://d.opencalais.com/er/geo/country/ralg-geo1/d3b1cee2-327c-fa35-7dab-f0289958c024.html or with an rdf browser as http://d.opencalais.com/er/geo/country/ralg-geo1/d3b1cee2-327c-fa35-7dab-f0289958c024.rdf ( in Tabulator below ) shows that the country has been disambiguated with <rdf:type rdf:resource=”http://s.opencalais.com/1/type/er/Geo/Country”/>.

Linked Data

In the RDF response returned by OpenCalais neither Iraq nor “Saddam Hussein” were linked to other Linked Data datasets. Some OpenCalais entities are. For example Moscow,Russia is linked via owl:sameAs to

Since I know that the context of the article is international news I can safely add some owl:sameAs links such as the following for Dbpedia links for “Saddam Hussein” (below) and Iraq.

Entity Relevance

For both detected entities “Saddam Hussein” and “Iraq” OpenCalais provides an entity relevance score (shown for each respectively in the screen shots below ) The relevance capability detects the importance of each unique entity and assigns a relevance score in the range 0-1 (1 being the most relevant and important). From the screen shots its clear that “Iraq” has been ranked more relevant.

Detection Information

The RDF Response includes the following properties relating to the subjects detection

  • c:docId: URI of the document this mention was detected in.
  • c:subject: URI of the unique entity.
  • c:detection: snippet of the input content where the metadata element was identified
  • c:prefix: snippet of the input content that precedes the current instance
  • c:exact: snippet of the input content in the matched portion of text
  • c:suffix: snippet of the input content that follows the current instance
  • c:offset: the character offset relative to the input content after it has been converted into XML
  • c:length: length of the instance.

The screen shot below for Saddam Hussein provides an example of how these properties work.

Conclusions

OpenCalais is a very impressive tool. It takes awhile though to fully understand the RDF response, especially in the areas of entity disambiguation and the linking of OpenCalais entities to other Linked Data datasets. Most likely there are some subtleties that I have missed or misunderstood so all clarifications welcome.

For entities extracted from international news sources and not linked to other Linked Data datasets it would be interesting to try some equivalence mining.

Australias Government 2.0 Taskforce commissions Semantic Web Project

September 5th, 2009

The Australian Government initiated the Government 2.0 Taskforce in June 2009.

The launch video features Lindsay Tanner, Minister for Finance and Deregulation and chair Dr Nicholas Gruen in an enthusiastic presentation, outlining two key themes the government is keen for the taskforce to pursue.

These are:

  • Transparency and Openess. Using technology “to maximise the extent to which government information, data, and material can be put out into the public domain that we can be as accountable as possible, as transparent as possible and that this data is available for use in the general community.”
  • Community Engagement. Improving “the ways in which we engage with people in the wider community; in consultation, in discussion, in dialogue, about regulation, about government decisions, about policy generally.”

Examples of early government innovation include:

On 1 September 2009 the taskforce announced that it was Open for business commissioning six projects and inviting interested parties (individuals or companies) to submit quotes to be received by 9 September 2009.

Early leadership in Semantic Web

Of particular interest is the Early leadership in Semantic Web project. The project deliverable is to be a report which includes:

  • a guide for use by Australian Government agencies that will assist them with proper semantic tagging of datasets;
  • identified Australian Government datasets that could benefit from proper semantic tagging;
  • and a case study on the process and any issues from of applying proper semantic tagging to an indentified agency dataset.

Both this and the fact that government departments such as the Australian Bureau of Statistics are moving to release data under a creative commons license is another encouraging sign that an open web of linked data is in the process of evolving.

PricewaterhouseCoopers forecast the Semantic Web

June 7th, 2009

The freely available PricewaterhouseCoopers Spring 2009 Technology Forecast explains the value of the Semantic Web and Linked Data in the context of Enterprise applications, presenting interviews with leaders in the field and outlining how CIOs and individual departments can introduce Semantic Web technologies into their organizations.

Forecasts include:

  • “During the next three to five years, we forecast a transformation of the enterprise data management function driven by explicit engagement with data semantics” and
  • “PricewaterhouseCoopers believes a Web of data will develop that fully augments the document Web of today”.

W3C standards providing the foundation for this Web of data include URIs, RDF, RDF Schema (RDFS), the Web Ontology Language (OWL) and the Semantic Protocol and RDF Query Language (SPARQL).

URIs are more specific in a Semantic Web context than URLs, often including a hash that points to a thing such as an individual musician, a song of hers, or the label she records for within a page, rather than just the page itself.”

RDF takes the data elements identified by URIs and makes statements about the relationship of one element to another.”

Each statement is a triple, a subject-predicate-object combination.

Ontologies (based on RDFS and OWL) describe the characteristics of these RDF data elements and their relationships within specific domains, facilitating machine interpretability of the data content.

“In this universe of nouns and verbs, the verbs articulate the connections, or relationships, between nouns. Each noun then connects as a node in a networked structure, one that scales easily because of the simplicity and uniformity of its Web-like connections.”

The Web of data approach clearly benefits a company such as the British Broadcasting Corporation (BBC) which “links to URIs at DBpedia.org, a version of the structured information on Wikipedia, to enrich sites such as its music site (http://www.bbc.co.uk/music/)”. It also links MusicBrainz for information about artists and recording.

As described by Tom Scott of BBC Earth:

“The relationship between the BBC content, the DBpedia content, and MusicBrainz is no more than URIs. We just have links between these things, and we have an ontology that describes how this stuff maps together.”

Other reviews of the PricewaterhouseCoopers Spring 2009 Technology Forecast include:

Tom Scott has a presentation on Linking bbc.co.uk to the Linked Data cloud and the article  DBpedia Examples using Linked Data and Sparql provides a simple example of using SPARQL to query Dbpedia.