Apache Stanbol

 


Note: This blog conforms to Stanbol 0.10.0-incubating SNAPSHOT release.

We at FORMCEPT are really excited to become an early adopter of Apache Stanbol product. We are working on its integration with our Big Data Analysis Stack. While working with Apache Stanbol we created multiple enhancement engines and would like to share the experience on building an enhancement engine.

Overview

Apache Stanbol is built as a modular set of components. Each component is accessible via its own RESTful web interface. From this viewpoint, all Apache Stanbol features can be used via RESTful service calls. The components are implemented as OSGi components based on Apache Felix.

Apache Stanbol Components

  1. Enhancer and Enhancement Engines: Annotate possible entities and link them to public or private entity repositories.
  2. Entityhub: Caches and manages entities stored in local indexes of linked-data repositories including entities specific to a particular domain.
  3. Contenthub: Provides persistent document store on top of Apache Solr. It enables semantic indexing facilities and semantic search including faceted search capability on the documents. Endpoint- http://localhost:8080/contenthub
  4. CMS Adapter: Acts as a bridge between JCR/CMIS compliant content management systems and Apache Stanbol.
  5. Ontology Manager: Manages ontologies that are used to define the knowledge models that describe the metadata of content.
  6. Reasoners: Used to automatically infer additional knowledge.
  7. Rules: Provides the means to re-factor knowledge graphs.
  8. FactStore: Stores relations between entities identified by their URIs. The relation between the two entities is called a fact.

 Apache Stanbol Component Layer (Credit: Rupert)

Getting Started

  • Check out the latest source code from Apache Stanbol repository- svn co http://svn.apache.org/repos/asf/incubator/stanbol/trunk stanbol
  • Make sure you have at least Java 6 and maven 2.2.1
  • Set maven parameters: export MAVEN_OPTS="-Xmx512M -XX:MaxPermSize=128M"
  • Compile using the command: mvn clean install (To skip tests, use -DskipTests)
  • Sit back and relax while it compiles and sets everything up

If your build fails with this error-
Reason: Cannot find parent: com.sun.jersey:jersey-project for project: null:jersey-server:jar:null for project null:jersey-server:jar:null

Clean up you existing jersey repository and compile again. To clean-up just remove the directory: rm -rf ~/.m2/repository/com/sun/jersey/

If it doesn't work, make sure that you are using Maven 3. If not, install it and try to compile Stanbol again. If you are using Ubuntu, here are few steps to install Maven 3- http://askubuntu.com/questions/49557/how-do-i-install-maven-3

Once the build goes through fine, switch to launchers directory and launch stanbol with the command: java -Xmx1g -jar full/target/org.apache.stanbol.launchers.full-{snapshot-version}-SNAPSHOT.jar

Open URL http://localhost:8080 and try it out. If everything went fine, you will see Stanbol's default landing page-

Go ahead play around with it!

Developers

To setup Eclipse project (recommended only for development), run: mvn eclipse:eclipse

Once the build is successful, import the entire Stanbol directory into a workspace. That is it, you are all set. Rest of this blog explains Stanbol components in detail.

Detailed Description

This section describes some of the components of Stanbol in detail. For more information, I encourage you to take a look at Stanbol's documentation.

ContentHub

For the impatient, there is a quick 5 minute tutorial available here- http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min

Contenthub is a document repository that stores "Content Item". A content item consists of metadata of the document in addition to the text-based content of the document. Contenthub has two main subcomponents-

  • Store - Responsible for persistent storage of content items
  • Search - Provides semantic search facilities for the content items

Contenthub uses Apache Solr for storage, indexing and retrieval of content items and LD Path for managing them. Try- http://localhost:8080/contenthub/ldpath to submit or retrieve LD Path programs. Contenthub provides three search interfaces-

  • SolrSearch: Native Solr Search Interface. Results are returned in "org.apache.solr.client.solrj.response.QueryResponse" format.
  • RelatedKeywordSearch: Also finds other related keywords from several sources. Wordnet, domain ontologies and referenced sites are the data sources for these services to retrieve the related keywords.
  • FeaturedSearch: Combines the services of SolrSearch and RelatedKeywordSearch

Creating an Enhancement Engine

The first thing that you would like to do after playing around with Stanbol is to write your own enhancement engine. I would recommend you to first explore the org.apache.stanbol.enhancer.engines package for existing engines. Start from language identification engine and then move on to Tika engine, NER engine, etc. The implementation itself is self-describing and will give you a kick-start for your own enhancement engine.

Lets try creating a simple enhancement engine that adds a new label as an enhancement. Don't worry about whether the label is correct or not. Just try to follow the flow to understand the overall development process.

Step-0: Create a Maven project

Create a new maven project for your enhancement engine and add these dependencies-

<dependencies>
 <dependency>
  <groupId>org.apache.stanbol</groupId>
  <artifactId>org.apache.stanbol.enhancer.servicesapi</artifactId>
  <version>0.10.0-incubating-SNAPSHOT</version>
 </dependency>
 <dependency>
  <groupId>org.apache.felix</groupId>
  <artifactId>org.apache.felix.scr.annotations</artifactId>
  <version>1.6.0</version>
 </dependency>
</dependencies>

You will also need few plugins to build the bundle-

<build>
  <plugins>
    <plugin>
      <groupId>org.apache.felix</groupId>
      <artifactId>maven-bundle-plugin</artifactId>
      <extensions>true</extensions>
      <configuration>
        <instructions>
          <!-- Enable this for including your --> 
          <!-- enhancement chain configuration -->
          <!-- <Install-Path>config</Install-Path> -->
          <Export-Package>
            org.formcept.engine.enhancer.*;version=${project.version}
          </Export-Package>
          <Embed-Dependency>
          </Embed-Dependency>
        </instructions>
      </configuration>
    </plugin>
    <plugin>
      <groupId>org.apache.felix</groupId>
      <artifactId>maven-scr-plugin</artifactId>
      <executions>
        <execution>
          <id>generate-scr-scrdescriptor</id>
          <goals>
            <goal>scr</goal>
          </goals>
          <configuration>
            <properties>
              <service.vendor>FORMCEPT</service.vendor>
            </properties>
          </configuration>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>

If you want to enable your own enhancement chain or modify an existing chain, you need to include the Install-Path directive and specify the configuration in the config folder under resources of your project. Don't worry about configuration as of now; Default Chain will pick your enhancement engine. For more details, take a look at Stanbol's documentation on List Chain- http://incubator.apache.org/stanbol/docs/trunk/enhancer/chains/listchain.html

You will also need few plugin management directives-

<pluginManagement>
  <plugins>
    <plugin>
      <groupId>org.apache.felix</groupId>
      <artifactId>maven-bundle-plugin</artifactId>
      <version>2.3.7</version>
      <inherited>true</inherited>
      <configuration>
        <instructions>
          <Bundle-DocURL>http://www.formcept.com</Bundle-DocURL>
          <Bundle-Vendor>FORMCEPT</Bundle-Vendor>
          <Bundle-SymbolicName>${project.artifactId}</Bundle-SymbolicName>
          <_versionpolicy>$${version;===;${@}}</_versionpolicy>
        </instructions>
      </configuration>
    </plugin>
    <plugin>
      <groupId>org.apache.felix</groupId>
      <artifactId>maven-scr-plugin</artifactId>
      <version>1.7.4</version>
      <executions>
        <execution>
          <id>generate-scr-scrdescriptor</id>
          <goals>
            <goal>scr</goal>
          </goals>
          <configuration>
            <properties>
              <service.vendor>FORMCEPT</service.vendor>
            </properties>
          </configuration>
        </execution>
      </executions>
    </plugin>
    <!-- This prevents m2e error in Eclipse. -->
    <!-- Does not effect the build -->
    <plugin>
      <groupId>org.eclipse.m2e</groupId>
      <artifactId>lifecycle-mapping</artifactId>
      <version>1.0.0</version>
      <configuration>
        <lifecycleMappingMetadata>
          <pluginExecution>
            <pluginExecutionFilter>
              <groupId>org.apache.felix</groupId>
              <artifactId>maven-scr-plugin</artifactId>
              <versionRange>[1.4.4,)</versionRange>
              <goals>
                <goal>scr</goal>
              </goals>
            </pluginExecutionFilter>
            <action>
              <ignore />
            </action>
          </pluginExecution>
        </lifecycleMappingMetadata>
      </configuration>
    </plugin>
  </plugins>
</pluginManagement>

That's it for the configuration. You can now easily package your enhancement engine with maven.

Step-1: Create a new class for your Enhancement Engine

@Component(immediate = true, metatype = true, inherit=true)
@Service
@Properties(value={
    @Property(name=EnhancementEngine.PROPERTY_NAME,value="formcept-disambiguator")
})
public class FCDEnhancer extends AbstractEnhancementEngine
    implements EnhancementEngine, ServiceProperties {

    public Map getServiceProperties() {
        // TODO Auto-generated method stub
        return null;
    }

    public int canEnhance(ContentItem ci) throws EngineException {
        // TODO Auto-generated method stub
        return 0;
    }

    public void computeEnhancements(ContentItem ci) throws EngineException {
        // TODO Auto-generated method stub

    }

}

Each enhancement engine is an OSGi component. Stanbol uses Apache Felix as the service platform for OSGi components. You need to add few annotations to your class for Apache Felix. The specified annotations direct Apache Felix to treat this class as an component, activate it immediately, generate the Metatype service data and inherit the service, property and reference declarations from the base class.

We also define a property EnhancementEngine.PROPERTY_NAME that contains the name of the component. You will see the same name in the Stanbol user interface for your component. For more details on the annotations and service declarations, please take a look at Apache Felix project.

Your enhancement engine should also extend- org.apache.stanbol.enhancer.servicesapi.EnhancementEngine
and implement- org.apache.stanbol.enhancer.servicesapi.ServiceProperties and org.apache.stanbol.enhancer.servicesapi.impl.AbstractEnhancementEngine

The extended class and implemented interfaces give you all the power to play around with Stanbol's content item and add enhancements to it. They are the entry points to the Stanbol architecture for your enhancement engine.

Step-2: Implement the required methods

canEnhance(ContentItem), computeEnhancements(ContentItem) and getServiceProperties

canEnhance(ContentItem)

public int canEnhance(ContentItem ci) throws EngineException {
     // check if content is present
     try {
         if((ContentItemHelper.getText(ci.getBlob()) == null) || 
                 (ContentItemHelper.getText(ci.getBlob()).trim().isEmpty())){
             return CANNOT_ENHANCE;
         }
     } catch (IOException e) {
         LOG.error("Failed to get the text for " +
          "enhancement of content: " + ci.getUri(), e);
         throw new InvalidContentException(this, ci, e);
     }
     // default enhancement is synchronous enhancement
     return ENHANCE_SYNCHRONOUS;
}

Stanbol provides org.apache.stanbol.enhancer.servicesapi.helper.ContentItemHelper class that contains useful methods to work with the ContentItem.

getServiceProperties

public Map getServiceProperties() {
   return Collections.unmodifiableMap(Collections.singletonMap(
       ENHANCEMENT_ENGINE_ORDERING, (Object) defaultOrder));
}

In the get service properties implementation we provide the ordering for our enhancement engine. For more details on the ordering, please take a look at- http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/

computeEnhancements(ContentItem)

public void computeEnhancements(ContentItem ci) throws EngineException {
    // write results (requires a write lock)
    // not required as we are enhancing synchronously
    //ci.getLock().writeLock().lock();
    try {
        // get the metadata graph
        MGraph metadata = ci.getMetadata();
        // update some sample data
        UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this);
        metadata.add(new TripleImpl(textAnnotation, ENHANCER_ENTITY_LABEL, 
             new PlainLiteralImpl("FORMCEPT")));
    } finally {
        //ci.getLock().writeLock().unlock();
    } 
}

Here we are trying to add a label by the name of FORMCEPT to the content item's metadata that is stored as an MGraph. To know more about MGraph, please read about Apache Clerezza project. I would like to re-state that you should not add label arbitrarily. This is just to understand the overall flow of development.

Configuration Parameters

If you want to read any specific user defined configuration parameter into your enhancer then you can declare them as a property and populate them in the activate method. For example, if a service URL is required to connect to the external service, we can declare a property, like-

@Property(value = "https://www.formcept.com/analyze")
public static final String FORMCEPT_SERVICE_URL = "org.formcept.engine.enhancer.url";

/**
 * Service URL
 */
private String serviceURL;

The serviceURL can be populated in the activate method, like-

/**
 * Activate and read the properties
 * @param ce the {@link ComponentContext}
 */
@Activate
protected void activate(ComponentContext ce) throws ConfigurationException {
    try {
        super.activate(ce);
    } catch (IOException e) {
        // log
        LOG.error("Failed to update the configuration", e);
    }
    @SuppressWarnings("unchecked")
    Dictionary properties = ce.getProperties();
    // update the service URL if it is defined
    if(properties.get(FORMCEPT_SERVICE_URL) != null){
        this.serviceURL = (String) properties.get(FORMCEPT_SERVICE_URL);
    }
}

/**
 * Deactivate
 * @param ce the {@link ComponentContext}
 */
@Deactivate
protected void deactivate(ComponentContext ce) {
    super.deactivate(ce);
}

You can also use the deactivate method to clean-up the resources that you are using.

Step-3: Package and Deploy Enhancement Engine

Once you have implemented your enhancement engine, it is time to deploy it and test it using the Stanbol interface. First step in setup is to make sure that you have all the dependencies resolved and included under the maven dependencies. If you have mentioned the dependencies within your maven configuration, it will take care of packaging your engine.

To package the bundle, use- mvn clean compile install

Maven will package all the required dependencies and generate the component descriptions for your enhancement bundle in a jar file that will be generated in the target folder of your project.

Now, we just need to deploy the generated JAR file of our enhancement engine as a bundle. To do so, open the OSGi console provided by Apache Felix- http://localhost:8080/system/console/bundles

The Apache Felix console lists all the bundles, components, configurations and a bunch of other details about the OSGi services platform. Now click on Install/Update button to install the enhancement engine that we have created.

You will see a dialog box as shown below. Choose your enhancement engine JAR file and check the Start Bundle checkbox. This will make sure that your bundle is started after deployment. Now click on Install or Update.

If your bundle has pulled in all the dependencies correctly and the required configuration files are present within your bundle, it will deployed successfully and the status of your bundle will be shown as Active. For example, we have created a bundle by the name of FORMCEPT Enhancement Engine, so we will see the bundle listed as shown below-

Now, if you go back to your Stanbol's home page, you will see your bundle listed as a part of Default Chain-

 

Write some text and click on Run Engines. You will see your enhancement in the result-

 

Remember... we added a label FORMCEPT and we see the same in the result.

We also added a property for URL that we wanted to get as a configuration parameter. To change the property, go back to the Apache Felix console and click on the Configuration tab. Locate your enhancement engine and click on it. You will see the configuration parameters as show below-

 

You can modify them and reload your bundle to pick up the new configuration parameter.

So, that is it. We have covered the basics of developing an enhancement engine for Apache Stanbol. For more details on enhancement engines and how-to methods, please read- http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/

If you are interested to use the code, please feel free to fork- https://github.com/formcept/formcept-enhancer

References