Using Linked Data to provide a different perspective on Software Architecture

December 17th, 2011

As outlined in my previous post Linked Data and the SOA Software Development Process I am interested in using Linked Data to provided a more detailed view of  SOA services.

A coupled of scenarios during the past week highlighted the value of the approach and also that it would benefit with extending the scope to include more information about the consumers of the SOA services and also the external data sources (in particular databases) used by the SOA services.

Both scenarios involved setting up environments for the development and testing of new functionality involving a number of different systems, with each system needing to be deployed at a specific version level.

The first scenario related to the software versions. The UML diagrams presented to describe the architecture were at too high a level to show the actual dependencies, but to add the level of detail needed would have made the diagrams too busy.

Although not yet complete the work already done to provide a Linked Data perspective of the SOA services enabled a more fined grained view of the actual dependencies.  Knowing what the specific lower level dependencies were resulted in more flexibility with the actual deployment. In particular work could start on developing the new functionality for one component since it was not going to be affected by proposed changes in another component. On the original UML diagram both components were shown as requiring changes. The Linked Data perspective provided enough additional detail to see that the changes could happen in parallel.

The second scenario related to finding the owners of external data sources so that we could determine if they were available for use in a given test environment. Adding this ownership information to our  Linked Data repository would speed up this part of the process in the future.

Linked Data and the SOA Software Development Process

November 17th, 2011

We have quite a rigorous SOA software development process however the full value of the collected information is not being realized because the artifacts are stored in disconnected information silos. So far attempts to introduce tools which could improve the situation (e.g. zAgile Teamwork and Semantic Media Wiki) have been unsuccessful, possibly because the value of a Linked Data approach is not yet fully appreciated.

To provide an example Linked Data view of the SOA services and their associated artifacts I created a prototype consisting of  Sesame running on a Tomcat server with Pubby providing the Linked Data view via the Sesame SPARQL end point. TopBraid was connected directly to the Sesame native store (configured via the Sesame Workbench) to create a subset of services sufficient to demonstrate the value of publishing information as Linked Data. In particular the prototype showed how easy it became to navigate from the requirements for a SOA service through to details of its implementation.

The  prototype also highlighted that auto generation of the RDF graph (the data providing the Linked Data view) from the actual source artifacts would be preferable to manual entry, especially if this could be transparently integrated with the current software development process. This is has become the focus of the next step, automated knowledge extraction from the source artifacts.

Artifacts

Key artifact types of our process include:

A Graph of Concepts and Instances

There is a rich graph of relationships linking the things described in the artifacts listed above. For example the business entities defined in the UML analysis model are the subject of the service and service operations defined in the Service Contracts. The service and service operations are mapped to the WSDLs which utilize the Xml Schema’s that provide an XML view of business entities. The JAX-WS implementations are linked to the WSDLs and Xml Schema’s and deployed to the Oracle Weblogic Application Server where the configuration files list the external dependencies. The log files and defects link back to specific parts of the code base (Subversion revisions) within the context of specific service operations. The people associated with the different artifacts can often be determined from artifact meta-data.

RDF, OWL and Linked Data are a natural fit for modelling and viewing this graph since there is a mix of concepts plus a lot of instances, many of whom already have a HTTP representation. Also the graph contains a number of transitive relationships , (for example a WSDL may import an Xml Schema which in turn imports another Xml Schema etc …) promoting the use of the owl:TransitiveProperty to help obtain a full picture of all the dependencies a component may have.

Knowledge Extraction

Another advantage of the RDF, OWL, Linked Data approach is the utilization of unique URIs for identifying concepts and instances. This allows information contain in one artifact, e.g. a WSDL, to be extracted as RDF triples which would later be combined with the RDF triples extracted from the JAX-WS annotation of Java source code. The combined RDF triples tell us more about the WSDL and its Java implementation than could be derived from just one of the artifacts.

We have made some progress with knowledge extraction but this is still definitely a work in progress. Sites such as ConverterToRdf, RDFizers and the Virtuoso Sponger provide tools and information on generating RDF from different artifact types. Part of the current experimentation is around finding tools that can be transparently layered over the top of the current software development process. Finding the best way to extract the full set of desired RDF triples from Microsoft Word documents is also proving problematic since some natural language processing is required.

Tools currently being evaluated include:

The Benefits of Linked Data

The prototype showed the benefits of Linked Data for navigating from the requirements for a SOA service through to details of its implementation. Looking at all the information that could be extracted leads on to a broader view of the benefits Linked Data would bring to the SOA software development process.

One specific use being planned is the creation of a Service Registry application providing the following functionality:

  • Linking the services to the implementations running in a given environment, e.g. dev, test and production. This includes linking the specific versions of the requirement, design or implementation artifacts and detailing the runtime dependencies of each service implementation.
  • Listing the consumers of each service and providing summary statistics on the performance, e.g. daily usage figures derived from audit logs.
  • Providing a list of who to contact when a service is not available. This includes notifying consumers of a service outage and also contacting providers if a service is being affected by an external component being offline, e.g. a database or an external web service.
  • Search of the services by different criteria, e.g. business entity
  • Tracking the evolution of services and being able to assist with refactoring, e.g answering questions such as “Are there older versions of the Xml Schemas that can be deprecated?”
  • Simplify the running of a specific Soapui test case for a service operation in a given environment.
  • Provide the equivalent of a class lookup that includes all project classes plus all required infrastructure classes and returns information such as the jar file the class is contained in and JIRA and Subversion information.

Customizing GATE grammars

July 19th, 2011

GATE is a tool for natural language processing of unstructured text.  It includes an information extraction system called ANNIE which provides a basic set of annotations over documents, including parts of speech, word tokens and it is also able to recognize locations, cities, people etc.

For example, GATE can identify a bit of text as referencing a particular named individual (or Person). This bit of text is then highlighted in the marked up document, to show the way in which GATE has annotated it.

Lets say that the name Nick appears in a document. To identify that this bit of text refers to a Person, GATE refers to its own internal gazeeteer of male and female first names. The word token Nick is found in the male first name list, so GATE annotates the text Nick to say that it refers to a particular ‘male’ individual named Nick

But, how do we best deal with false positives.  A classic example is the word PAGE (which in my case was simply the PAGE number on the footer of a word document). This was marked up by GATE as a Person because it identified Page as a woman’s first name.

Here is an idea:

Create a local ‘negative’ list of names (gazetteer) and an associated jape grammar rule which would overide (ie have a higher priority) than the corresponding GATE rule.  This new grammar would use the existing gazetteer of names supplied by GATE, but if a particular ‘name’ was found on this local ‘negative list’ the grammar rule would not create a Person type annotation for it.  Over time we could add to this negative list of names as we found more false positives (of text incorrectly being annotated and classified by a GATE grammar)

Another example of a local customization would be if we are dealing with sets of documents in a very confined context within a specific organization. We could create a gazetteer ‘list of names ‘(populated from the organization LDAP repository) with actual names of people in the organization.  We would then create a local grammar (jape file)with a higher priority than the one used by GATE to create the ‘Person’ annotation. This grammar would only add a Person annotation if the name existed on the local gazetteer.

Using the Neon Toolkit and ANNIE to demonstrate extracting RDF from Natural Language

July 10th, 2011

The Neon Toolkit is an open source ontology engineering environment providing an extensive set of plug-ins for various ontology engineering activities.

One such plugin is the GATE web services plugin which adds Natural Language Entity Recognition functionality from the GATE (General Architecture for Text Engineering) framework.

The GATE web services plugin can be quickly added to the Neon Toolkit by

  • opening the Help | Install New Software … menu option
  • selecting “NeOn Toolkit Update Site v2.4 – http://neon-toolkit.org/plugins/2.4″ from the Work With drop down combo box.
  • and selecting GATE Web Services as shown below.

The GATE web services plugin includes ANNIE (Ontology Generation Based on Named Entity Recognition) which can be used to demonstrate basic Named Entity Recognition and onotology generation. The main GATE site provides more details on how ANNIE: a Nearly-New Information Extraction System works.

After the GATE web services plugin has been installed GATE Services appears as an additional top level menu option. Selecting GATE Services | Call Multi-document Web Service opens the Call GATE web service dialog box below which provides the option to select ANNIE as the service to call.

Selecting ANNIE and Next invokes an additional dialog box where the Input directory: containing the documents to be processed and the Output ontology: can be specified.

Once the Input directory: and the Output ontology: have been specified and the Finish button selected ANNIE reads the input and generates a basic ontology according to the concepts, instances and relations found in the text.

When the text below is provided as input ANNIE generates the following RDF output.

Input Text

Nick lives in Toronto and studies at Concordia University. Toronto is six hours from Montreal. Toronto is a nice place to live.

RDF Output

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF
	xmlns:protons="http://proton.semanticweb.org/2005/04/protons#"
	xmlns:protonu="http://proton.semanticweb.org/2005/04/protonu#"
	xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
	xmlns:owl="http://www.w3.org/2002/07/owl#"
	xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
	xmlns:protonkm="http://proton.semanticweb.org/2005/04/protonkm#"
	xmlns:protont="http://proton.semanticweb.org/2005/04/protont#">
<!-- All statement -->

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Nick">
	<rdf:type rdf:resource="http://gate.ac.uk/owlim#Person"/>
	<rdfs:label xml:lang="en">nick</rdfs:label>
	<rdfs:label xml:lang="en">Nick</rdfs:label>
	<rdfs:label>Nick</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Montreal">
	<rdf:type rdf:resource="http://gate.ac.uk/owlim#Location"/>
	<rdfs:label xml:lang="en">montreal</rdfs:label>
	<rdfs:label xml:lang="en">Montreal</rdfs:label>
	<rdfs:label>Montreal</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Location">
	<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
	<rdfs:label xml:lang="en">Location</rdfs:label>
	<rdfs:label>Location</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Concordia_University">
	<rdf:type rdf:resource="http://gate.ac.uk/owlim#Organization"/>
	<rdfs:label xml:lang="en">concordia university</rdfs:label>
	<rdfs:label xml:lang="en">Concordia University</rdfs:label>
	<rdfs:label xml:lang="en">Concordia_University</rdfs:label>
	<rdfs:label>Concordia_University</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Organization">
	<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
	<rdfs:label xml:lang="en">Organization</rdfs:label>
	<rdfs:label>Organization</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Person">
	<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>
	<rdfs:label xml:lang="en">Person</rdfs:label>
	<rdfs:label>Person</rdfs:label>
</rdf:Description>

<rdf:Description rdf:about="http://gate.ac.uk/owlim#Toronto">
	<rdf:type rdf:resource="http://gate.ac.uk/owlim#Location"/>
	<rdfs:label xml:lang="en">toronto</rdfs:label>
	<rdfs:label xml:lang="en">Toronto</rdfs:label>
	<rdfs:label>Toronto</rdfs:label>
</rdf:Description>

</rdf:RDF>

The entities recognized are:

  • Nick as a Person
  • Montreal and Toronto as Locations
  • Concordia University as an Organization.

While relatively simplistic the overall example comprising the input text,  the generated RDF output and the quick setup process for the Neon Toolkit and the GATE web services plugin helped to demonstrate the potential of Named Entity Recognition and ontology generation.

The input text actually comes from a demo of the OwlExporter which provides similar functionality for GATE itself. Longer term GATE is likely to be part of a Natural Language Processing solution for a government department where the sensitivity of the private data would preclude the use of an external web service. Hopefully there will also be time later on to write up the results of using GATE and the OwlExporter with the same input text.

(For this article  Neon Toolkit version 2.4.2 was used.)

Configuring Persistence for Lift Web Applications

May 29th, 2011

Generating a basic Lift web application using Maven (see Using Maven and Eclipse to generate Scala Lift Web Applications) creates a project that by default uses a H2 database. The generated application provides three obvious options for configuring persistence. The starting point is Boot.scala (located in src/main/scala/bootstrap/liftweb/Boot.scala).

In examples below we look at:

  • how H2 is specified as the default database in Boot.scala
  • using a properties file to specify a different database (e.g. MySQL)
  • using JNDI to specify a third database (e.g. PostgreSQL).

The Default Database

The code fragment below from the generated Boot.scala file determines that:

  • if there is no JNDI entry for the database and
  • there are no JDBC entries specified in the properties file
  • then connect to a H2 database using JDBC driver class “org.h2.Driver” and JDBC url “jdbc:h2:lift_proto.db; AUTO_SERVER=TRUE”.
class Boot {
def boot {
  if (!DB.jndiJdbcConnAvailable_?) {
    val vendor =
       new StandardDBVendor(Props.get("db.driver") openOr "org.h2.Driver",
	   Props.get("db.url") openOr
	  "jdbc:h2:lift_proto.db;AUTO_SERVER=TRUE",
	   Props.get("db.user"), Props.get("db.password"))

    LiftRules.unloadHooks.append(vendor.closeAllConnections_! _)

    DB.defineConnectionManager(DefaultConnectionIdentifier, vendor)
  }
 ...
}

Since the property file is not automatically generated with JDBC properties the H2 database becomes the default.

Updating the Maven POM File

When configuring for a database other than H2 the Maven POM file needs to be updated so that a jar file containing the appropriate JDBC driver is available at runtime. The following adds the JDBC drivers for both MySQL and Postgres.

<project ... >
   ...
  <dependencies>
    <dependency>
     ...
      <groupId>com.h2database</groupId>
      <artifactId>h2</artifactId>
      <version>1.2.138</version>
      <scope>runtime</scope>
    </dependency>
    <!--  Added for MySQL datasource -->
    <dependency>
      <groupId>mysql</groupId>
      <artifactId>mysql-connector-java</artifactId>
      <version>5.1.15</version>
      <scope>runtime</scope>
    </dependency>
    <!--  Added for PostgreSQL datasource -->
    <dependency>
    	<groupId>postgresql</groupId>
    	<artifactId>postgresql</artifactId>
    	<version>9.0-801.jdbc4</version>
      <scope>runtime</scope>
    </dependency>
     ...
  </dependencies>
   ...
<project>

Configuring JDBC with a Properties File

Adding the property file “default.props” to the project at

  • src/main/resources/props/default.props

allows a different database to be configured by setting JDBC properties with names matching those expected in Boot.scala.

The example default.props below configures a MySQL database.

# Properties in this file will be read when running in dev mode
db.driver=com.mysql.jdbc.Driver
db.url=jdbc:mysql://localhost:3306/liftbasic
db.user=mysql_username
db.password=mysql_password

Using JNDI to Configure a Datasource

A JNDI configured database can be enabled by adding one additional line specifying the JNDI name of the datasource, for example:

DefaultConnectionIdentifier.jndiName = "jdbc/liftbasic"

The original generated code becomes.

def boot {
  ...
  DefaultConnectionIdentifier.jndiName = "jdbc/liftbasic"

  if (!DB.jndiJdbcConnAvailable_?) {
    val vendor =
       new StandardDBVendor(Props.get("db.driver") openOr "org.h2.Driver",
	   Props.get("db.url") openOr
	  "jdbc:h2:lift_proto.db;AUTO_SERVER=TRUE",
	   Props.get("db.user"), Props.get("db.password"))

    LiftRules.unloadHooks.append(vendor.closeAllConnections_! _)

    DB.defineConnectionManager(DefaultConnectionIdentifier, vendor)
  }
 ...
}

A more configurable option is to replace the code just added with the following which checks to see if the JNDI datasource has been set by the “jndi.name” property in the “default.props” property file. If not “jdbc/liftbasic” is used.

DefaultConnectionIdentifier.jndiName = Props.get("jndi.name") openOr "jdbc/liftbasic"

A resource-ref element is also added to the web.xml file (i.e. src/main/webapp/WEB-INF/web.xml) to allow the container to manage the connection to the database.


<web-app>
    ...
   <resource-ref>
      	<description>Database Connection</description>
      	<res-ref-name>jdbc/liftbasic</res-ref-name>
      	<res-type>javax.sql.DataSource</res-type>
      	<res-auth>Container</res-auth>
   </resource-ref>
    ...
</web-app>

Container Specific JNDI Settings

Each server platform has its own specific way of configuring the JNDI settings that map the JNDI name read by Lift to a specific database. Below are examples for Tomcat, Jetty and the Cloudbees platform.

Tomcat

Tomcat manages JNDI settings via the Context element. The following adds a Context element to the Tomcat server.xml file ( i.e. TOMCAT_HOME/conf/server.xml),  mapping the JNDI name to a PostgreSQL database.

<Server>
  <Service>
    <Engine>
      <Host>
         ...
		<Context path="/liftbasic" docBase="liftbasic" reloadable="true" crossContext="true">
			<Resource name="jdbc/liftbasic"
			auth="Container"
			description="DB Connection"
			type="javax.sql.DataSource"
			driverClassName="org.postgresql.Driver"
			url="jdbc:postgresql://127.0.0.1:5432/postgres"
			username="postgres_username"
			password="postgres_password"
			maxActive="4"
			maxIdle="2" maxWait="-1"/>
		</Context>
        ...
      </Host>
    </Engine>
  </Service>
</Server>

Jetty

There are a number of options for configuring JNDI for Jetty. One option is to add a jetty-env.xml file to the WEB-INF directory to configure JNDI resources specifically for that webapp. The example jetty-env.xml file below configures a MySQL JNDI datasource for Jetty.

<?xml version="1.0"?>
<!DOCTYPE Configure PUBLIC "-//Mort Bay Consulting//DTD Configure//EN" "http://jetty.mortbay.org/configure.dtd">
<Configure class="org.mortbay.jetty.webapp.WebAppContext">

	<New id="liftbasic" class="org.mortbay.jetty.plus.naming.Resource">
	    <Arg></Arg>
	    <Arg>jdbc/liftbasic</Arg>
	    <Arg>
	     <New class="com.mysql.jdbc.jdbc2.optional.MysqlConnectionPoolDataSource">
	                 <Set name="Url">jdbc:mysql://localhost:3306/liftbasic</Set>
	                 <Set name="User">mysql_username</Set>
	                 <Set name="Password">mysql_password</Set>
	     </New>
	    </Arg>
	   </New>
</Configure>

CloudBees

CloudBees specific configuration is placed in the cloudbees-web.xml file in the WEB-INF directory of the deployed WAR file (i.e. src/main/webapp/WEB-INF/cloudbees-web.xml).  The example below maps a CloudBees Managed MySQL datasource to the JNDI name read by Lift.

<?xml version="1.0"?>
<cloudbees-web-app xmlns="http://www.cloudbees.com/xml/webapp/1">
    <appid>lift</appid>
    <context-param>
        <param-name>application.environment</param-name>
        <param-value>prod</param-value>
    </context-param>
	<resource name="jdbc/liftbasic" auth="Container" type="javax.sql.DataSource">
	 <param name="username" value="cloudbees_mysql_username" />
	 <param name="password" value="cloudbees_mysql_password" />
	 <param name="url" value="jdbc:cloudbees://liftbasic" />
	</resource>
</cloudbees-web-app>