FORMCEPT features in NASSCOM Emerge 50

FORMCEPT has been recognized as one of “NASSCOM Emerge 50″ companies of India for 2013 in the Emerge Start-up category. 

Emerge_50_2013NASSCOM Emerge 50 Awards annually recognize top 50 highly innovative and agile Emerging & Start-up companies that are foraying into untapped territories and redefining the way IT can make a difference. In the start-up category, the award recognizes the companies that are innovative and are growing at a rapid pace within 3 years of existence.

About NASSCOM
NASSCOM is a global trade body with over 1200 members, of which over 250 are global companies from the US, UK, EU, Japan and China. With “Emerge 50” NASSCOM lends out support to upcoming companies by laying a platform to showcase their potential.

Posted in Award, FORMCEPT | Tagged , , , , | Comments Off

FORMCEPT’s approach to Telematics

“Connected cars have the potential to dramatically reduce the 1.2 million traffic deaths that occur worldwide each year”
                                                   – The National Highway Traffic Safety Administration, June 2013

Since the first ever successful installation of car radio in 1930, telecommunications technology has evolved significantly. Our vehicles now provide an enhanced and safe experience by embedding the latest technologies and gadgets. The rapidly evolving mobile technology is driving the next phase of innovation in vehicles and becoming an integral part of the system which provides benefits like infotainment, assistance and navigation facilities on-the-go.

With these fruitful business opportunities, the automotive industry is gearing up to the challenge and it comes as no surprise that by 2022, the connected car market’s worth is projected at staggering $422 billion.

Machina research, 2013

Source: Machina Research, 2013

This is an age of smart phones and we are slowly progressing towards “smart cars” that will soon become an everyday “common” term. Continue reading

Posted in FORMCEPT, Infographics, Telematics | Tagged , , | Comments Off

Retail – Edging ahead with Big Data

“We all have been hearing about Big Data for over three years. We’re at the point where the industry really needs some tangible examples and show what it means.”
                                                  –  Curt Hecht, Global Chief Revenue Officer, The Weather Company

Retail industry around the world has evolved significantly over the years. From corner stores to departmental stores and now in the days of supermarket and hypermarket, retail industry has seen drastic transitions. One of the major game changers in the retail industry was the advent of Internet that revolutionized it with the introduction of E-commerce.

With the launch of Amazon e-commerce services in 1995, many companies started using Internet aggressively for commercial transactions. Since then e-commerce has seen a meteoric rise. The forecast is that online retail sales in the US will reach $327 billion by 2016 with 56% of the population indulged in online shopping.

The rise of e-commerce has led to a large number of online portals that are leveraging multiple channels like social media network to generate awareness about their products. In the process, e-commerce industry has started contributing significantly to the amount of unstructured data that is being generated across the web in addition to the traditional transactional data that they used to generate. This phenomenon is often related to the term “Big Data”. It was Internet at the dawn of the 20th century and it is Big Data today which is turning heads and giving companies the cutting edge advantage over the competitors.

With the rising competition, several attempts are being made in the retail industry to make the process more efficient and convenient both for consumers as well as retailers. Most of these attempts are targeted towards adopting a data driven approach that involves capturing and processing immense amount of useful social media content in addition to the in-store transactional data. In this blog, one such case study has been presented. The implementation was done on top of FORMCEPT’s MECBOT platform.

Continue reading

Posted in FORMCEPT, Retail | Comments Off

A Unified Big Data Analytics Platform

For decades, companies have been making business decisions based on the transactional data stored in relational databases. But with the availability of Big Data processing power, it is becoming more meaningful to include external data sources, such as social media, web logs, sensors, etc. and their integration with internal data sources. When Big Data is distilled and analyzed in combination with traditional enterprise data, enterprises can develop a more thorough and insightful understanding of their business leading to enhanced productivity, a better competitive positioning and greater innovation – all of which can positively impact the bottom line. No wonder, companies are rushing to capture, refine, analyze and experiment with all types of so-far-unexplored data sources including the data generated from customer interactions and business operations.

The challenge lies in finding the perfect blend of tools and techniques to harness the underlying potential of Big Data. However, the business analysts and data scientists who understand Big Data are themselves facing tremendous pressure to work with multiple disparate systems to get an integrated view before they can actually focus on the business problems.

Continue reading

Posted in FORMCEPT | Comments Off

FORMCEPT transforms Recruitment using Big Data

More than 500 years ago, the first CV was prepared by Leonardo Da Vinci 1. Since then there has been a significant transition; from traditional paper resumes to electronic resumes and of late, video resumes. Resumes have been moulded into different forms to appear more conspicuous.

In this innovation driven era, even with rapid technological advancements, human capital remains the most precious resource for any organization. Right from recruitment of the right candidate to training, development and retention of the best employees has become increasingly cumbersome for HR managers across organizations. It has come to the fore that lack of visibility rather than lack of talent is proving to be the hindrance for the recruitment industry that has the market size of $400 billion globally 2.

FORMCEPT has developed a first-of-its-kind product HR Intent that will enable recruiters to optimize their hiring process. It helps the recruiters to find the ‘Best-Fit’ for a particular job description from a large pool of candidates. The product can also be used by the candidates to generate an eye-catching resume Infographic that can be shared with the recruiters and friends.
Continue reading

Posted in FORMCEPT, Infographics, News | Comments Off

Proud to be a part of TiE50 Finalists

On 15th March 2013, FORMCEPT was selected among 6 promising big data startups from India and got an opportunity to discuss with Amr Awadallah, CTO, Cloudera at a public event. Today, we would like to announce that we have made it to the list of TiE50 Finalists.

TiE brand is now synonymous with entrepreneurship. It is well known for providing a platform for startups to showcase themselves to the world on their stage and continuously mentor and guide them during their initial days.

As a part of TiE50 finalists, we will be presenting the story behind FORMCEPT at TiEcon on 17th May, 2013 at Santa Clara Convention Center, at the heart of Silicon Valley, CA. TiEcon is TiE’s flagship event held annually and this will be their 5th edition to be held over duration of two days i.e. 17-18 May, 2013.

TiEcon over the years has laid an ideal platform for some of the successful startups since its inception in 2009. It presents to the world the new dramatic and disruptive innovation driven companies with huge potential that are ready to influence the global business. The event thrives on competition, innovation, smart technology. But only 50 make it to the list of TiE50 winners, 10 companies in each of its 5 categories –  (1) Software (2) Internet (3) Mobile (4) Energy (5) Life Sciences

Some of the winners and finalists who have been a part of TiE50 includes-

   

And this year it will be FORMCEPT. See you guys at TiEcon!

Posted in FORMCEPT, News | Comments Off

Power of Ideas 2012

Last three months have been awesome for FORMCEPT. We have been one of the finalists of Next Big Idea 2012, covered in the Tech30 report at TechSparks and also got selected among the top 75 ideas of the country. We will be spending our next few days at IIMA with all the other finalists. Power of Ideas 2012 has selected 75 ideas spanning more than 15 different domains and 24 cities of India. Here are some details.

Power of Ideas 2012

Power of Ideas 2012

Posted in FORMCEPT, Infographics | Comments Off

Healthcare – Analyzing Medical Records

This blog describes a use case of analyzing medical records using Apache Stanbol. For more details, please read FORMCEPT’s proposal. In the previous blog post, we discussed about the basics of creating an Enhancement Engine for Apache Stanbol. This blog drills down into the Enhancement Structure of Apache Stanbol and various properties of a Metadata Graph that can be used to store the enhancements. The concepts are explained using a FORMCEPT Healthcare engine that can be used to annotate the medical records.

Introduction

This section describes the concepts and terminologies used in the blog.

Content Item

A content item is the unit of content within Apache Stanbol. It contains the content as well as the entire Metadata graph of enhancements. You can read more about content item here.

Enhancement Engines

Enhancement Engines enhance the content item. A content item is processed by one or more enhancement engines based on the selected enhancement chain. You can read more about enhancement engines here.

Enhancement Structure

Enhancement structure defines the types and properties used in the Metadata graph of enhancements. The enhancement structure is based on RDF and OWL. A sample enhancement structure is shown below. This example has been taken from Apache Stanbol wiki page. You can read more about enhancement structure here.

Apache Stanbol - Enhancement Structure

(Credit: Apache Stanbol)

Medical Record and Enhancements

[WikipediaThe terms medical record, health record, and medical chart are used somewhat interchangeably to describe the systematic documentation of a single patient’s medical history and care across time within one particular health care provider’s jurisdiction.

The healthcare enhancement engine considers a medical record as a content item. The knowledge base that is used by the enhancement engine is built on top of DBpedia 3.6 and specifically these domains-

  1. Drugs and Diseases
  2. Chemical Compounds
  3. Species
  4. Others, like- Health, Microbiology, Medical Diagnosis, Medicine, Perception and Biology

Implementation

This section describes the healthcare enhancement engine and how the enhancements are added to the Metadata graph for each Medical Record.

Healthcare Enhancement Chain

FORMCEPT Healthcare enhancement chain consists of-

  1. Tika Engine (Credit: Apache Stanbol) (Credit: Apache Stanbol)
  2. Language Identification Engine (Credit: Apache Stanbol)
  3. FORMCEPT Healthcare Engine

For research purpose, you can also take a look at the enhancement engine provided by Apache Stanbol eHealth demo.

An Example

Lets take an example of a typical medical record that describes the symptoms of a Brain Tumor-

Symptoms vary depending on the location of the tumour and may be slow in onset. Symptoms include headache, vomiting, nausea, dizziness, poor coordination, disturbance of vision, weakness affecting one side of the body, mental changes and fits. A person with any symptoms of brain disorder should seek medical advice.

Given a statement on symptom and indications, as shown above, FORMCEPT Healthcare Engine will try to annotate all the entities of interest. For example, in the above statement, headache, vomiting, nausea, etc. are all entities of interest. The annotated entities are then picked up and shown by Apache Stanbol enhancer user interface as shown below.

FORMCEPT Healthcare Engine

FORMCEPT Healthcare Engine

FORMCEPT Healthcare Engine Enhancements

FORMCEPT Healthcare Engine Enhancements

FORMCEPT Healthcare Engine

FORMCEPT Healthcare Engine relies on an external FORMCEPT Spotter Service that spots the keywords in the specified content. The spotter service returns a JSON that looks like-

{
  "spotID": "64b3b8f0-2b14-44fd-9f07-a4276e377d55",
  "createdOn": "Jul 11, 2012 2:55:26 PM",
  "spottedElements": [
    {
      "spottedWord": "headache",
      "startIndex": 97,
      "endIndex": 105,
      "elements": [
        {
          "uri": "http://dbpedia.org/resource/Headache",
          "dataset": [
            "DBPEDIA_RESOURCE"
          ],
          "label": "Headache",
          "alsoKnownAs": [
            "Headache and Migraine",
            "Hofudverkur",
            "Headech",
            "Headaches",
            "Headache (medical)",
            "Headache disorders",
            "Head pain",
            "Headache syndromes",
            "Encephalalgia",
            "Toxic headache",
            "Hoefudverkur",
            "Höfuðverkur",
            "Head Aches",
            "Headach",
            "Head ache",
            "Chronic headache",
            "Cephalgia"
          ],
          "description": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
          "context": "Health->Diseases and disorders->Neurological disorders->Headache",
          "broaderCategs": [
            "Health",
            "Diseases and disorders",
            "Neurological disorders"
          ],
          "language": "en",
          "timestamp": "Feb 17, 2012 10:56:00 PM"
        }
      ]
    }
    ...
  ]
}

The spotting result contains the spotted elements that were found in the specified content. Each spotted element has these features-

  1. Spotted Word: Word as it appears in the specified content
  2. Start and End Index: Word boundaries within the specified content. This is useful to locate the word within the specified content
  3. Elements: One or more elements in the knowledge base that define the spotted word

Each Knowledge Base Element has these features-

  1. URI of the Element
  2. Dataset associated with the Element  (For example, DBpedia)
  3. Standard label for the Element as defined in the Knowledge Base
  4. alsoKnownAs: Other labels by which the Element can be referred to
  5. Short description of the Element
  6. Hierarchical Context of the Element
  7. Broader Categories for the Element
  8. Language of the Element in which the label and other details are specified
  9. Timestamp of the last update

FORMCEPT Healthcare Enhancer converts these features into the Metadata Graph. The computeEnhancements method contains all the implementation to update the Metadata Graph with the above features of the Knowledge Base Elements.

As a first step, a text enhancement is created within the Metadata Graph for the spotted word. Since each spotted word can have one or more associated elements within the Knowledge Base, only the first element (closest match) defines the type of the enhancement. The type is derived from a class defined within the ontology of the dataset. If a type is not found, then the first category of the context is considered as the type. Here is a code snippet to create the text enhancement-

// get the literal factory
LiteralFactory literalFactory = LiteralFactory.getInstance();
// get the metadata graph
MGraph metadata = ci.getMetadata();
// text annotation
UriRef elemAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this);
// add metadata
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_SELECTED_TEXT, new PlainLiteralImpl(elem.getSpottedWord())));
// add the type
metadata.add(new TripleImpl(elemAnnotation, DC_TYPE, new UriRef(elemType)));
// set the description
metadata.add(new TripleImpl(elemAnnotation, RDFS_COMMENT, new PlainLiteralImpl(knowledgeElem.getDescription())));
// set the context as broader categories
metadata.add(new TripleImpl(elemAnnotation, SKOS_BROADER, new PlainLiteralImpl(knowledgeElem.getContext())));
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_START, literalFactory.createTypedLiteral(elem.getStartIndex())));
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_END, literalFactory.createTypedLiteral(elem.getEndIndex())));

The spotted word is added as a selected text along with other metadata, like- description, context and the start/end fields.

Once the text enhancement is created, all the knowledge elements are added as an Entity Enhancement that refers to the text enhancement created for the spotted word. Here is a code snippet to create the entity enhancements for the spotted word-

// add other entities
for(FCKnowledgeElement entityElem : spottedElems){
    // add a topic enhancement
    UriRef enhancement = EnhancementEngineHelper.createEntityEnhancement(ci, this);
    metadata.add(new TripleImpl(enhancement, RDF_TYPE, TechnicalClasses.ENHANCER_TOPICANNOTATION));
    metadata.add(new TripleImpl(enhancement, org.apache.stanbol.enhancer.servicesapi.rdf.Properties.DC_RELATION, elemAnnotation));
    // add link to entity
    metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_REFERENCE, new UriRef(entityElem.getUri())));
    metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_TYPE, OntologicalClasses.SKOS_CONCEPT));
    metadata.add(new TripleImpl(enhancement, ENHANCER_CONFIDENCE, literalFactory.createTypedLiteral(entityElem.getConfidence())));
    metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_LABEL, new PlainLiteralImpl(entityElem.getLabel())));
    metadata.add(new TripleImpl(enhancement, RDFS_COMMENT, new PlainLiteralImpl(entityElem.getDescription())));
    metadata.add(new TripleImpl(enhancement, SKOS_BROADER, new PlainLiteralImpl(entityElem.getContext())));
}

Each entity enhancement is linked with the text enhancement using the DC_RELATION property. Stanbol User Interface groups all the related entities and links them to the external URI specified by the ENHANCER_ENTITY_REFERENCE property.

Results

A typical enhancement result generated by FORMCEPT Healthcare Engine, looks like-

{
  "@context": {
    "broader": "http://www.w3.org/2004/02/skos/core#broader",
    "comment": "http://www.w3.org/2000/01/rdf-schema#comment",
    "Concept": "http://www.w3.org/2004/02/skos/core#Concept",
    "confidence": "http://fise.iks-project.eu/ontology/confidence",
    "created": "http://purl.org/dc/terms/created",
    "creator": "http://purl.org/dc/terms/creator",
    "end": "http://fise.iks-project.eu/ontology/end",
    "Enhancement": "http://fise.iks-project.eu/ontology/Enhancement",
    "entity-label": "http://fise.iks-project.eu/ontology/entity-label",
    "entity-reference": "http://fise.iks-project.eu/ontology/entity-reference",
    "entity-type": "http://fise.iks-project.eu/ontology/entity-type",
    "EntityAnnotation": "http://fise.iks-project.eu/ontology/EntityAnnotation",
    "extracted-from": "http://fise.iks-project.eu/ontology/extracted-from",
    "Health": "http://dbpedia.org/ontology/Health",
    "language": "http://purl.org/dc/terms/language",
    "LinguisticSystem": "http://purl.org/dc/terms/LinguisticSystem",
    "relation": "http://purl.org/dc/terms/relation",
    "selected-text": "http://fise.iks-project.eu/ontology/selected-text",
    "start": "http://fise.iks-project.eu/ontology/start",
    "TextAnnotation": "http://fise.iks-project.eu/ontology/TextAnnotation",
    "TopicAnnotation": "http://fise.iks-project.eu/ontology/TopicAnnotation",
    "type": "http://purl.org/dc/terms/type",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "@coerce": {
      "@iri": [
        "entity-reference",
        "entity-type",
        "extracted-from",
        "relation",
        "type"
      ],
      "xsd:dateTime": "created",
      "xsd:double": "confidence",
      "xsd:int": [
        "end",
        "start"
      ],
      "xsd:string": "creator"
    }
  },
  "@subject": [
    {
      "@subject": "urn:enhancement-bf848e5d-decf-49ae-eaed-248e9e476d29",
      "@type": [
        "Enhancement",
        "TextAnnotation"
      ],
      "created": "2012-07-11T10:54:53.141Z",
      "creator": "org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine",
      "extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
      "language": "en",
      "type": "LinguisticSystem"
    },
    {
      "@subject": "urn:enhancement-fd5f3241-f1ea-5b02-01b2-30986c2b7f90",
      "@type": [
        "Enhancement",
        "TextAnnotation"
      ],
      "broader": "Health->Diseases and disorders->Neurological disorders->Headache",
      "comment": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
      "created": "2012-07-11T10:54:53.186Z",
      "creator": "org.formcept.engine.enhancer.FCHealthCareEnhancer",
      "end": 105,
      "extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
      "selected-text": "headache",
      "start": 97,
      "type": "Health"
    },
    {
      "@subject": "urn:enhancement-fdc05670-18b8-1703-0210-86e2d12fa36b",
      "@type": [
        "Enhancement",
        "EntityAnnotation",
        "TopicAnnotation"
      ],
      "broader": "Health->Diseases and disorders->Neurological disorders->Headache",
      "comment": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
      "confidence": 1.0,
      "created": "2012-07-11T10:54:53.186Z",
      "creator": "org.formcept.engine.enhancer.FCHealthCareEnhancer",
      "entity-label": "Headache",
      "entity-reference": "http://dbpedia.org/resource/Headache",
      "entity-type": "Concept",
      "extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
      "relation": "urn:enhancement-fd5f3241-f1ea-5b02-01b2-30986c2b7f90"
    }
  ]
}

Stanbol User Interface groups the linked entities as shown below. The type is shown as a heading, i.e. Health in this case. The spotted word, i.e. headache in this case, is shown under Mentions.

FORMCEPT Healthcare Enhancer - Headache-Example

FORMCEPT Healthcare Enhancer – Headache-Example

Evaluation

The results shown below are obtained by running the tests against the CALBC Corpora. As of now, we have tested against 1,725 (42,368 words) test cases. We will continue to add more test cases from different medical classes (type) and report the performance.

FORMCEPT Healthcare Engine - Evaluation Report

FORMCEPT Healthcare Engine – Evaluation Report

Credit: The result shown above has been generated using sgvizler tool that connects to the SPARQL endpoint provided by Fuseki server. The performance report was generated by FORMCEPT Benchmarking tool that uses EARL schema. The concept of visualizing through a SPARQL visualizer has been adopted from Rupert’s and Pablo’s comment on the improvement request [STANBOL-652] for Apache Stanbol Benchmark Tool. Thanks to both of them.

Benchmark

Type TP FP FN Precision Recall F1
Disease 1826 132 881 0.9326 0.6745 0.7829
Disease, Drugs 1847 140 876 0.9295 0.6783 0.7843
Disease, Drugs, CC 1961 249 848 0.88733 0.6981 0.7814

CC: Chemical Compound, TP: True Positives, FP: False Positives, FN: False Negatives,
F1: F-Measure/F-Score

FORMCEPT Healthcare Benchmark

FORMCEPT Healthcare Benchmark

FORMCEPT Healthcare Benchmark

FORMCEPT Healthcare Benchmark

Performance

The table given below shows the time taken by the spotting algorithm to spot the annotations out of 42,368 words present in the 1,725 test cases. The table also lists the number of entities present in the Knowledge Base out of which the annotations are identified.

Type Entities E1(sec) E2(sec) E3(sec) Avg(sec) Min(sec)
Disease 5156 0.125 0.096 0.083 0.101 0.083
DiDr 9814 0.155 0.135 0.125 0.138 0.125
DiDrCC 16487 0.209 0.195 0.182 0.195 0.182
DiDrCCSp 185020 0.289 0.284 0.214 0.262 0.214
TypeCateg 221572 0.428 0.424 0.418 0.423 0.418

E1E2 and E3 represent the independent execution time of the test cases

FORMCEPT Healthcare Engine Performance

FORMCEPT Healthcare Engine Performance

FORMCEPT Spotter builds an in-memory model of the entities to annotate the content. The table given below mentions the amount of memory consumed and the time taken to build the in-memory model.

Type Entities Processor Memory Time (sec)
Disease 5156 i5-2400 3.10GHz 23 MB 1.081
DiDr 9814 i5-2400 3.10GHz 35 MB 1.380
DiDrCC 16487 i5-2400 3.10GHz 182 MB 2.152
DiDrCCSp 185020 i5-2400 3.10GHz 1.14 GB 11.821
TypeCateg 221572 i5-2400 3.10GHz 1.39 GB 19.880

Entities: Total number of entities present in the Knowledge Base
DiDr: Disease and Drugs, DiDrCC: Disease, Drugs and Chemical Compound,
DiDrCCSp: Disease, Drugs, Chemical Compound and Species,
TypeCateg includes Disease, Drugs, Chemical Compound, Species, Health, Microbiology, Biology, Perception, Medical diagnosis and Medicine

Discussion

  • The results reported a high number of false negatives for each type. Here are the reasons for high number of false negatives-
  1. C2/C3/C4/C6/C6D/C7/C9 deficiency, hematopoiesis – were not marked as diseases within the Knowledge Base
  2. Close to 70% of the false negatives consisted of abbreviations, like- BMD, DMD, MJD, FAP, ALD, AMN, CL/P, PWS, VWS, CP, UPD14, DM, FAP, RCCs, MHP, WAS, CTX, DRD, HPD, VHL, HD, AS, AGU, MPS VII, FRDA, ASPA, etc. Full forms for these abbreviations have already been captured and some of the abbreviations were ambiguous
  3. Annotations, like- deficiency of norrin, deficiency of the enzyme, abnormal growth of lymphocytes  are not exact terms but a phrase
  4. Annotations, like- apoptosis, lesions, Enlarged vestibular aqueduct were not included within the Knowledge Base
  • There were few more annotations that were not captured by the Knowledge Base. Some of them were- hyperalphalipoproteinemia, CETP deficiency, anhaptoglobinemia, Atm-deficient, cardiac abnormality, diurnal fluctuation, Peter’s anomaly, platyspondyly, Axenfeldt anomaly, neurogenetic disorder, fibular overgrowth, facial lesions, Chronic neisserial infection, hemochromatosis, skin pigmentation, lymphoid malignancy, hitch hiker thumb, GPI-anchor deficiency, attenuated polyposis, iminodipeptiduria, hair-follicle morphogenesis, pseudoglioma, hyperalphalipoproteinemia, hyperphenylalaninemic, Morphological abnormalities, cPNETs, Duarte 2 and microvesicular steatosis
  • Number of true positives increased with the addition of types but that also increased the number of false positives. False positives were identified to be the drug names and chemical compounds that were not annotated in the Corpus.
  • Memory requirements can be further reduced by keeping only the spotted element IDs in memory.

We will continue to add more datasets to reduce the number of false negatives and incorporate the missing annotations.

References

  1. http://incubator.apache.org/stanbol/docs/trunk/enhancer/
  2. http://incubator.apache.org/stanbol/docs/trunk/enhancer/enhancementstructure.html
  3. http://incubator.apache.org/stanbol/docs/trunk/enhancementusage.html
  4. http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/
  5. http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/resources.html
Posted in Benchmark, FORMCEPT, Healthcare, Performance, Stanbol | Comments Off

Apache Stanbol

Apache Stanbol

FORMCEPT


Note: This blog conforms to Stanbol 0.10.0-incubating SNAPSHOT release.

We at FORMCEPT are really excited to become an early adopter of Apache Stanbol product. We are working on its integration with our Big Data Analysis Stack. While working with Apache Stanbol we created multiple enhancement engines and would like to share the experience on building an enhancement engine.

Overview

Apache Stanbol is built as a modular set of components. Each component is accessible via its own RESTful web interface. From this viewpoint, all Apache Stanbol features can be used via RESTful service calls. The components are implemented as OSGi components based on Apache Felix.

Apache Stanbol Components

  1. Enhancer and Enhancement Engines: Annotate possible entities and link them to public or private entity repositories.
  2. Entityhub: Caches and manages entities stored in local indexes of linked-data repositories including entities specific to a particular domain.
  3. Contenthub: Provides persistent document store on top of Apache Solr. It enables semantic indexing facilities and semantic search including faceted search capability on the documents. Endpoint- http://localhost:8080/contenthub
  4. CMS Adapter: Acts as a bridge between JCR/CMIS compliant content management systems and Apache Stanbol.
  5. Ontology Manager: Manages ontologies that are used to define the knowledge models that describe the metadata of content.
  6. Reasoners: Used to automatically infer additional knowledge.
  7. Rules: Provides the means to re-factor knowledge graphs.
  8. FactStore: Stores relations between entities identified by their URIs. The relation between the two entities is called a fact.

Apache Stanbol Component Layer

 Apache Stanbol Component Layer (Credit: Rupert)

Getting Started

  • Check out the latest source code from Apache Stanbol repository- svn co http://svn.apache.org/repos/asf/incubator/stanbol/trunk stanbol
  • Make sure you have at least Java 6 and maven 2.2.1
  • Set maven parameters: export MAVEN_OPTS=”-Xmx512M -XX:MaxPermSize=128M”
  • Compile using the command: mvn clean install (To skip tests, use -DskipTests)
  • Sit back and relax while it compiles and sets everything up

If your build fails with this error-
Reason: Cannot find parent: com.sun.jersey:jersey-project for project: null:jersey-server:jar:null for project null:jersey-server:jar:null

Clean up you existing jersey repository and compile again. To clean-up just remove the directory: rm -rf ~/.m2/repository/com/sun/jersey/

If it doesn’t work, make sure that you are using Maven 3. If not, install it and try to compile Stanbol again. If you are using Ubuntu, here are few steps to install Maven 3- http://askubuntu.com/questions/49557/how-do-i-install-maven-3

Once the build goes through fine, switch to launchers directory and launch stanbol with the command: java -Xmx1g -jar full/target/org.apache.stanbol.launchers.full-{snapshot-version}-SNAPSHOT.jar

Open URL http://localhost:8080 and try it out. If everything went fine, you will see Stanbol’s default landing page-

Welcome Stanbol

Go ahead play around with it!

Developers

To setup Eclipse project (recommended only for development), run: mvn eclipse:eclipse

Once the build is successful, import the entire Stanbol directory into a workspace. That is it, you are all set. Rest of this blog explains Stanbol components in detail.

Detailed Description

This section describes some of the components of Stanbol in detail. For more information, I encourage you to take a look at Stanbol’s documentation.

ContentHub

For the impatient, there is a quick 5 minute tutorial available here- http://incubator.apache.org/stanbol/docs/trunk/contenthub/contenthub5min

Contenthub is a document repository that stores “Content Item”. A content item consists of metadata of the document in addition to the text-based content of the document. Contenthub has two main subcomponents-

  • Store – Responsible for persistent storage of content items
  • Search – Provides semantic search facilities for the content items

Contenthub uses Apache Solr for storage, indexing and retrieval of content items and LD Path for managing them. Try- http://localhost:8080/contenthub/ldpath to submit or retrieve LD Path programs. Contenthub provides three search interfaces-

  • SolrSearch: Native Solr Search Interface. Results are returned in “org.apache.solr.client.solrj.response.QueryResponse” format.
  • RelatedKeywordSearch: Also finds other related keywords from several sources. Wordnet, domain ontologies and referenced sites are the data sources for these services to retrieve the related keywords.
  • FeaturedSearch: Combines the services of SolrSearch and RelatedKeywordSearch

Creating an Enhancement Engine

The first thing that you would like to do after playing around with Stanbol is to write your own enhancement engine. I would recommend you to first explore the org.apache.stanbol.enhancer.engines package for existing engines. Start from language identification engine and then move on to Tika engine, NER engine, etc. The implementation itself is self-describing and will give you a kick-start for your own enhancement engine.

Lets try creating a simple enhancement engine that adds a new label as an enhancement. Don’t worry about whether the label is correct or not. Just try to follow the flow to understand the overall development process.

Step-0: Create a Maven project

Create a new maven project for your enhancement engine and add these dependencies-

<dependencies>
 <dependency>
  <groupId>org.apache.stanbol</groupId>
  <artifactId>org.apache.stanbol.enhancer.servicesapi</artifactId>
  <version>0.10.0-incubating-SNAPSHOT</version>
 </dependency>
 <dependency>
  <groupId>org.apache.felix</groupId>
  <artifactId>org.apache.felix.scr.annotations</artifactId>
  <version>1.6.0</version>
 </dependency>
</dependencies>

You will also need few plugins to build the bundle-

<build>
  <plugins>
    <plugin>
      <groupId>org.apache.felix</groupId>
      <artifactId>maven-bundle-plugin</artifactId>
      <extensions>true</extensions>
      <configuration>
        <instructions>
          <!-- Enable this for including your --> 
          <!-- enhancement chain configuration -->
          <!-- <Install-Path>config</Install-Path> -->
          <Export-Package>
            org.formcept.engine.enhancer.*;version=${project.version}
          </Export-Package>
          <Embed-Dependency>
          </Embed-Dependency>
        </instructions>
      </configuration>
    </plugin>
    <plugin>
      <groupId>org.apache.felix</groupId>
      <artifactId>maven-scr-plugin</artifactId>
      <executions>
        <execution>
          <id>generate-scr-scrdescriptor</id>
          <goals>
            <goal>scr</goal>
          </goals>
          <configuration>
            <properties>
              <service.vendor>FORMCEPT</service.vendor>
            </properties>
          </configuration>
        </execution>
      </executions>
    </plugin>
  </plugins>
</build>

If you want to enable your own enhancement chain or modify an existing chain, you need to include the Install-Path directive and specify the configuration in the config folder under resources of your project. Don’t worry about configuration as of now; Default Chain will pick your enhancement engine. For more details, take a look at Stanbol’s documentation on List Chain- http://incubator.apache.org/stanbol/docs/trunk/enhancer/chains/listchain.html

You will also need few plugin management directives-

<pluginManagement>
  <plugins>
    <plugin>
      <groupId>org.apache.felix</groupId>
      <artifactId>maven-bundle-plugin</artifactId>
      <version>2.3.7</version>
      <inherited>true</inherited>
      <configuration>
        <instructions>
          <Bundle-DocURL>http://www.formcept.com</Bundle-DocURL>
          <Bundle-Vendor>FORMCEPT</Bundle-Vendor>
          <Bundle-SymbolicName>${project.artifactId}</Bundle-SymbolicName>
          <_versionpolicy>$${version;===;${@}}</_versionpolicy>
        </instructions>
      </configuration>
    </plugin>
    <plugin>
      <groupId>org.apache.felix</groupId>
      <artifactId>maven-scr-plugin</artifactId>
      <version>1.7.4</version>
      <executions>
        <execution>
          <id>generate-scr-scrdescriptor</id>
          <goals>
            <goal>scr</goal>
          </goals>
          <configuration>
            <properties>
              <service.vendor>FORMCEPT</service.vendor>
            </properties>
          </configuration>
        </execution>
      </executions>
    </plugin>
    <!-- This prevents m2e error in Eclipse. -->
    <!-- Does not effect the build -->
    <plugin>
      <groupId>org.eclipse.m2e</groupId>
      <artifactId>lifecycle-mapping</artifactId>
      <version>1.0.0</version>
      <configuration>
        <lifecycleMappingMetadata>
          <pluginExecution>
            <pluginExecutionFilter>
              <groupId>org.apache.felix</groupId>
              <artifactId>maven-scr-plugin</artifactId>
              <versionRange>[1.4.4,)</versionRange>
              <goals>
                <goal>scr</goal>
              </goals>
            </pluginExecutionFilter>
            <action>
              <ignore />
            </action>
          </pluginExecution>
        </lifecycleMappingMetadata>
      </configuration>
    </plugin>
  </plugins>
</pluginManagement>

That’s it for the configuration. You can now easily package your enhancement engine with maven.

Step-1: Create a new class for your Enhancement Engine

@Component(immediate = true, metatype = true, inherit=true)
@Service
@Properties(value={
    @Property(name=EnhancementEngine.PROPERTY_NAME,value="formcept-disambiguator")
})
public class FCDEnhancer extends AbstractEnhancementEngine
    implements EnhancementEngine, ServiceProperties {

    public Map getServiceProperties() {
        // TODO Auto-generated method stub
        return null;
    }

    public int canEnhance(ContentItem ci) throws EngineException {
        // TODO Auto-generated method stub
        return 0;
    }

    public void computeEnhancements(ContentItem ci) throws EngineException {
        // TODO Auto-generated method stub

    }

}

Each enhancement engine is an OSGi component. Stanbol uses Apache Felix as the service platform for OSGi components. You need to add few annotations to your class for Apache Felix. The specified annotations direct Apache Felix to treat this class as an component, activate it immediately, generate the Metatype service data and inherit the service, property and reference declarations from the base class.

We also define a property EnhancementEngine.PROPERTY_NAME that contains the name of the component. You will see the same name in the Stanbol user interface for your component. For more details on the annotations and service declarations, please take a look at Apache Felix project.

Your enhancement engine should also extend- org.apache.stanbol.enhancer.servicesapi.EnhancementEngine
and implement- org.apache.stanbol.enhancer.servicesapi.ServiceProperties and org.apache.stanbol.enhancer.servicesapi.impl.AbstractEnhancementEngine

The extended class and implemented interfaces give you all the power to play around with Stanbol’s content item and add enhancements to it. They are the entry points to the Stanbol architecture for your enhancement engine.

Step-2: Implement the required methods

canEnhance(ContentItem), computeEnhancements(ContentItem) and getServiceProperties

canEnhance(ContentItem)

public int canEnhance(ContentItem ci) throws EngineException {
     // check if content is present
     try {
         if((ContentItemHelper.getText(ci.getBlob()) == null) || 
                 (ContentItemHelper.getText(ci.getBlob()).trim().isEmpty())){
             return CANNOT_ENHANCE;
         }
     } catch (IOException e) {
         LOG.error("Failed to get the text for " +
          "enhancement of content: " + ci.getUri(), e);
         throw new InvalidContentException(this, ci, e);
     }
     // default enhancement is synchronous enhancement
     return ENHANCE_SYNCHRONOUS;
}

Stanbol provides org.apache.stanbol.enhancer.servicesapi.helper.ContentItemHelper class that contains useful methods to work with the ContentItem.

getServiceProperties

public Map getServiceProperties() {
   return Collections.unmodifiableMap(Collections.singletonMap(
       ENHANCEMENT_ENGINE_ORDERING, (Object) defaultOrder));
}

In the get service properties implementation we provide the ordering for our enhancement engine. For more details on the ordering, please take a look at- http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/

computeEnhancements(ContentItem)

public void computeEnhancements(ContentItem ci) throws EngineException {
    // write results (requires a write lock)
    // not required as we are enhancing synchronously
    //ci.getLock().writeLock().lock();
    try {
        // get the metadata graph
        MGraph metadata = ci.getMetadata();
        // update some sample data
        UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this);
        metadata.add(new TripleImpl(textAnnotation, ENHANCER_ENTITY_LABEL, 
             new PlainLiteralImpl("FORMCEPT")));
    } finally {
        //ci.getLock().writeLock().unlock();
    } 
}

Here we are trying to add a label by the name of FORMCEPT to the content item’s metadata that is stored as an MGraph. To know more about MGraph, please read about Apache Clerezza project. I would like to re-state that you should not add label arbitrarily. This is just to understand the overall flow of development.

Configuration Parameters

If you want to read any specific user defined configuration parameter into your enhancer then you can declare them as a property and populate them in the activate method. For example, if a service URL is required to connect to the external service, we can declare a property, like-

@Property(value = "https://www.formcept.com/analyze")
public static final String FORMCEPT_SERVICE_URL = "org.formcept.engine.enhancer.url";

/**
 * Service URL
 */
private String serviceURL;

The serviceURL can be populated in the activate method, like-

/**
 * Activate and read the properties
 * @param ce the {@link ComponentContext}
 */
@Activate
protected void activate(ComponentContext ce) throws ConfigurationException {
    try {
        super.activate(ce);
    } catch (IOException e) {
        // log
        LOG.error("Failed to update the configuration", e);
    }
    @SuppressWarnings("unchecked")
    Dictionary properties = ce.getProperties();
    // update the service URL if it is defined
    if(properties.get(FORMCEPT_SERVICE_URL) != null){
        this.serviceURL = (String) properties.get(FORMCEPT_SERVICE_URL);
    }
}

/**
 * Deactivate
 * @param ce the {@link ComponentContext}
 */
@Deactivate
protected void deactivate(ComponentContext ce) {
    super.deactivate(ce);
}

You can also use the deactivate method to clean-up the resources that you are using.

Step-3: Package and Deploy Enhancement Engine

Once you have implemented your enhancement engine, it is time to deploy it and test it using the Stanbol interface. First step in setup is to make sure that you have all the dependencies resolved and included under the maven dependencies. If you have mentioned the dependencies within your maven configuration, it will take care of packaging your engine.

To package the bundle, use- mvn clean compile install

Maven will package all the required dependencies and generate the component descriptions for your enhancement bundle in a jar file that will be generated in the target folder of your project.

Now, we just need to deploy the generated JAR file of our enhancement engine as a bundle. To do so, open the OSGi console provided by Apache Felix- http://localhost:8080/system/console/bundles

Apache Felix Bundles

The Apache Felix console lists all the bundles, components, configurations and a bunch of other details about the OSGi services platform. Now click on Install/Update button to install the enhancement engine that we have created.

You will see a dialog box as shown below. Choose your enhancement engine JAR file and check the Start Bundle checkbox. This will make sure that your bundle is started after deployment. Now click on Install or Update.

FORMCEPT Bundle Installation

If your bundle has pulled in all the dependencies correctly and the required configuration files are present within your bundle, it will deployed successfully and the status of your bundle will be shown as Active. For example, we have created a bundle by the name of FORMCEPT Enhancement Engine, so we will see the bundle listed as shown below-

FORMCEPT Bundle Installed

Now, if you go back to your Stanbol’s home page, you will see your bundle listed as a part of Default Chain-

FORMCEPT Enhancement Engine

Write some text and click on Run Engines. You will see your enhancement in the result-

FORMCEPT Enhancements

Remember… we added a label FORMCEPT and we see the same in the result.

We also added a property for URL that we wanted to get as a configuration parameter. To change the property, go back to the Apache Felix console and click on the Configuration tab. Locate your enhancement engine and click on it. You will see the configuration parameters as show below-

FORMCEPT Enhancement Engine Configuration

You can modify them and reload your bundle to pick up the new configuration parameter.

So, that is it. We have covered the basics of developing an enhancement engine for Apache Stanbol. For more details on enhancement engines and how-to methods, please read- http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/

If you are interested to use the code, please feel free to fork- https://github.com/formcept/formcept-enhancer

References

 

Posted in FORMCEPT, Stanbol | Comments Off

FORMCEPT Architecture

FORMCEPT Big Data Analysis Stack
FORMCEPT provides a highly scalable big data analysis infrastructure that is designed on top of proven open source technologies including Hadoop, HBase and Solr. FORMCEPT Big Data Analysis Stack consists of five layers-

  • MECBOT
  • Grabby
  • C3 (Classify | Compare | Correlate) Engine
  • Storage Engine
  • Intent Channel

MECBOT - Management and Enhancement of Content

 

 

We believe that in near future everyone will own a robot. These bots will be used either for assistance at home or for day-to-day work. They will have to be intelligent enough to accept your commands, understand what you need and respond within a limited time frame. Our vision with MECBOT is to provide a bot (hardware/software) that can provide assistance for all the storage, analysis and retrieval services related to the day-to-day work.

MECBOT stands for Management and Enhancement of Content and is responsible for storing and analyzing all your data and events such that they can be retrieved on demand at the right time and on the right device. All the users of FORMCEPT Platform own a MECBOT and configure it based on their requirement.

MECBOT users also have an option to configure interests like, which content to grab, from where to grab and the device where they would like the analyzed content or reports to be delivered. Once configured, MECBOT coordinates all the activities within the stack and starts responding to the commands and the events.

Grabby

FORMCEPT Platform - Grabby Grabby act as a content firehose for FORMCEPT Platform. Grabby is used by MECBOT to grab content from a variety of external data sources. Grabby understands OAuth protocol that gives it all the power to grab content seamlessly from all the popular Social Media platforms and various document stores, like Dropbox, Google Docs, etc. You can configure your data source for Grabby through MECBOT and fetch structured or unstructured content for storage, analysis and retrieval.

Storage Engine

FORMCEPT Storage EngineStorage Engine provides the scalable storage implementation for FORMCEPT platform. The biggest challenge that is addressed by the Storage Engine is that the content is stored based on the type and structure. All the content, structured or unstructured are managed by the Storage Engine in such a way that the storage, retrieval, processing and analysis is efficient. FORMCEPT hides the entire storage complexity and stores the content into the right storage implementation based on its type. The storage engine can also sit on top of an existing distributed storage implementation.

C3 Engine

FORMCEPT C3 (Compare | Classify | Correlate) EngineC3 stands for Classify | Compare | Correlate and is the main processing engine of the platform. FORMCEPT uses its proprietary Natural Language Processing algorithms for classification. The classification is backed by a Knowledge Graph that is built on the concepts of Linked Data. Once the classification is done, the next step is to compare content and find correlations. FORMCEPT uses advanced mathematical concepts to compare and correlate content. Tasks like, sentiment analysis, trend analysis and pattern detection are also done by the C3 Engine.

Intent Channel

FORMCEPT Intent ChannelIntent Channel is the delivery channel for all the applications that are built on top of FORMCEPT Platform. These applications are called Intents, hence the name Intent Channel.

Big Data Analysis is a challenging problem but analysis alone is not sufficient. We need to analyze the data and at the same time deliver the analysis results and reports to the right device at the right time.

FORMCEPT Intent Channel is designed to deliver the right content to the right device at the right time. The Intent Channel can be used not only for delivering the analysis results or reports but also documents, images and notifications.

Intents

FORMCEPT provides a Big Data Analysis platform to build next generation intelligent applications with minimal effort. These applications are called Intents. You can develop intent with your data and review it in an agile manner. Once you are convinced with the working of Intent, you can turn on your content fire-hose by configuring your MECBOT.

FORMCEPT provides few intents out-of-the-box with the Big Data Analysis Platform. These intents include Human Resource Intent, Media Intent and Retail Intent.

Posted in FORMCEPT | Comments Off