Big Data Tech Conclave 2013 – Part-1

Leaders from around the world gathered at the “Big Data Tech Conclave 2013 Winter Edition” marking the success of the event held on the 6th and 7th December 2013 at Bangalore. FORMCEPT was associated with the global conclave as an endorsing partner.

Big Data Tech Conclave winter editionThe 2-day event hosted back-to-back inspiring session around the deluge called Big Data. It was well attended and eminent personalities from the industry shared their knowledge and experience with the audience.

In this blog, FORMCEPT would like to share the key takeaways from the event.

Big Data Tech ConclaveOn the first day of the event there was one thing common across all the talks- “Data Infrastructure Issues”. It is a broader term for the issues related to Data Capturing (Structured and/or Unstructured), Storage, Analysis, Delivery and Visualization.

Most of the talks forced us to think- Do enterprises need to worry about the “Data Infrastructure Issues? and that too all of them?” or do they just need to worry about solving their business problem? It made us think- when we buy a Fridge or AC do we ever ask about the compressor being attached or any of the electronic system being used? If not then why can’t we ease the pain for the enterprises in the similar way for their Data Infrastructure issues?

If we talk about the current scenario of data infrastructure, it is evident that the traditional technologies are slowly being replaced by the upcoming technologies and the gap between human expertise and the technology is increasing at a rapid pace. This scenario is jeopardizing the data analysis, typically Big Data analysis adoption in most of the enterprises due to the lack of robust Data Infrastructure. On the other hand, if you ask the CXOs, they definitely want to adopt the same as they are aware of the competition that is taking advantage of emerging data analysis techniques.

FORMCEPT addresses this by MECBOT, a unified analytics platform, built on top of state of the art Open source like- Hadoop, HBase, Storm and Spark. Enterprises are now taking advantage of MECBOT that does all the heavy-lifting around data and makes it available on-demand as well as in real-time. Enterprises can focus on their business problem rather than worrying about the “Data Infrastructure Issues”. MECBOT also allows enterprises to develop Data Driven applications faster and scale it on demand using their existing skill set.

To know more about FORMCEPT and MECBOT, please contactus@formcept.com

Posted in FORMCEPT | Tagged , , | Comments Off

FORMCEPT featured at TechCrunch Bangalore

For the first time ever, TechCrunch International City event arrived in India and was held in the tech-hub Bangalore spanning across 2 days (November 14 – 15, 2013).

FORMCEPT was featured among 50 startups selected from hundreds of entries for Pitch Presentations. The event showcased these startups launching their products before a live and online audience, including a panel of 50 investors and expert judges.

We are proud to be a part of chosen few to demonstrate out product MECBOT at TechCrunch platform.

Tech Crunch IndiaTechCrunch is a leading technology media property, dedicated to obsessively profiling startups, reviewing new Internet products, and breaking tech news. TechCrunch Bangalore focused on encouraging the upcoming Indian startups to have a ground-breaking impact on the global stage.

Posted in FORMCEPT | Comments Off

FORMCEPT features in NASSCOM Emerge 50

FORMCEPT has been recognized as one of “NASSCOM Emerge 50″ companies of India for 2013 in the Emerge Start-up category. 

Emerge_50_2013NASSCOM Emerge 50 Awards annually recognize top 50 highly innovative and agile Emerging & Start-up companies that are foraying into untapped territories and redefining the way IT can make a difference. In the start-up category, the award recognizes the companies that are innovative and are growing at a rapid pace within 3 years of existence.

About NASSCOM
NASSCOM is a global trade body with over 1200 members, of which over 250 are global companies from the US, UK, EU, Japan and China. With “Emerge 50” NASSCOM lends out support to upcoming companies by laying a platform to showcase their potential.

Posted in Award, FORMCEPT | Tagged , , , , | Comments Off

FORMCEPT’s approach to Telematics

“Connected cars have the potential to dramatically reduce the 1.2 million traffic deaths that occur worldwide each year”
                                                   – The National Highway Traffic Safety Administration, June 2013

Since the first ever successful installation of car radio in 1930, telecommunications technology has evolved significantly. Our vehicles now provide an enhanced and safe experience by embedding the latest technologies and gadgets. The rapidly evolving mobile technology is driving the next phase of innovation in vehicles and becoming an integral part of the system which provides benefits like infotainment, assistance and navigation facilities on-the-go.

With these fruitful business opportunities, the automotive industry is gearing up to the challenge and it comes as no surprise that by 2022, the connected car market’s worth is projected at staggering $422 billion.

Machina research, 2013

Source: Machina Research, 2013

This is an age of smart phones and we are slowly progressing towards “smart cars” that will soon become an everyday “common” term. Continue reading

Posted in FORMCEPT, Infographics, Telematics | Tagged , , | Comments Off

Retail – Edging ahead with Big Data

“We all have been hearing about Big Data for over three years. We’re at the point where the industry really needs some tangible examples and show what it means.”
                                                  –  Curt Hecht, Global Chief Revenue Officer, The Weather Company

Retail industry around the world has evolved significantly over the years. From corner stores to departmental stores and now in the days of supermarket and hypermarket, retail industry has seen drastic transitions. One of the major game changers in the retail industry was the advent of Internet that revolutionized it with the introduction of E-commerce.

With the launch of Amazon e-commerce services in 1995, many companies started using Internet aggressively for commercial transactions. Since then e-commerce has seen a meteoric rise. The forecast is that online retail sales in the US will reach $327 billion by 2016 with 56% of the population indulged in online shopping.

The rise of e-commerce has led to a large number of online portals that are leveraging multiple channels like social media network to generate awareness about their products. In the process, e-commerce industry has started contributing significantly to the amount of unstructured data that is being generated across the web in addition to the traditional transactional data that they used to generate. This phenomenon is often related to the term “Big Data”. It was Internet at the dawn of the 20th century and it is Big Data today which is turning heads and giving companies the cutting edge advantage over the competitors.

With the rising competition, several attempts are being made in the retail industry to make the process more efficient and convenient both for consumers as well as retailers. Most of these attempts are targeted towards adopting a data driven approach that involves capturing and processing immense amount of useful social media content in addition to the in-store transactional data. In this blog, one such case study has been presented. The implementation was done on top of FORMCEPT’s MECBOT platform.

Continue reading

Posted in FORMCEPT, Retail | Comments Off

A Unified Big Data Analytics Platform

For decades, companies have been making business decisions based on the transactional data stored in relational databases. But with the availability of Big Data processing power, it is becoming more meaningful to include external data sources, such as social media, web logs, sensors, etc. and their integration with internal data sources. When Big Data is distilled and analyzed in combination with traditional enterprise data, enterprises can develop a more thorough and insightful understanding of their business leading to enhanced productivity, a better competitive positioning and greater innovation – all of which can positively impact the bottom line. No wonder, companies are rushing to capture, refine, analyze and experiment with all types of so-far-unexplored data sources including the data generated from customer interactions and business operations.

The challenge lies in finding the perfect blend of tools and techniques to harness the underlying potential of Big Data. However, the business analysts and data scientists who understand Big Data are themselves facing tremendous pressure to work with multiple disparate systems to get an integrated view before they can actually focus on the business problems.

Continue reading

Posted in FORMCEPT | Comments Off

FORMCEPT transforms Recruitment using Big Data

More than 500 years ago, the first CV was prepared by Leonardo Da Vinci 1. Since then there has been a significant transition; from traditional paper resumes to electronic resumes and of late, video resumes. Resumes have been moulded into different forms to appear more conspicuous.

In this innovation driven era, even with rapid technological advancements, human capital remains the most precious resource for any organization. Right from recruitment of the right candidate to training, development and retention of the best employees has become increasingly cumbersome for HR managers across organizations. It has come to the fore that lack of visibility rather than lack of talent is proving to be the hindrance for the recruitment industry that has the market size of $400 billion globally 2.

FORMCEPT has developed a first-of-its-kind product HR Intent that will enable recruiters to optimize their hiring process. It helps the recruiters to find the ‘Best-Fit’ for a particular job description from a large pool of candidates. The product can also be used by the candidates to generate an eye-catching resume Infographic that can be shared with the recruiters and friends.
Continue reading

Posted in FORMCEPT, Infographics, News | Comments Off

Proud to be a part of TiE50 Finalists

On 15th March 2013, FORMCEPT was selected among 6 promising big data startups from India and got an opportunity to discuss with Amr Awadallah, CTO, Cloudera at a public event. Today, we would like to announce that we have made it to the list of TiE50 Finalists.

TiE brand is now synonymous with entrepreneurship. It is well known for providing a platform for startups to showcase themselves to the world on their stage and continuously mentor and guide them during their initial days.

As a part of TiE50 finalists, we will be presenting the story behind FORMCEPT at TiEcon on 17th May, 2013 at Santa Clara Convention Center, at the heart of Silicon Valley, CA. TiEcon is TiE’s flagship event held annually and this will be their 5th edition to be held over duration of two days i.e. 17-18 May, 2013.

TiEcon over the years has laid an ideal platform for some of the successful startups since its inception in 2009. It presents to the world the new dramatic and disruptive innovation driven companies with huge potential that are ready to influence the global business. The event thrives on competition, innovation, smart technology. But only 50 make it to the list of TiE50 winners, 10 companies in each of its 5 categories –  (1) Software (2) Internet (3) Mobile (4) Energy (5) Life Sciences

Some of the winners and finalists who have been a part of TiE50 includes-

   

And this year it will be FORMCEPT. See you guys at TiEcon!

Posted in FORMCEPT, News | Comments Off

Power of Ideas 2012

Last three months have been awesome for FORMCEPT. We have been one of the finalists of Next Big Idea 2012, covered in the Tech30 report at TechSparks and also got selected among the top 75 ideas of the country. We will be spending our next few days at IIMA with all the other finalists. Power of Ideas 2012 has selected 75 ideas spanning more than 15 different domains and 24 cities of India. Here are some details.

Power of Ideas 2012

Power of Ideas 2012

Posted in FORMCEPT, Infographics | Comments Off

Healthcare – Analyzing Medical Records

This blog describes a use case of analyzing medical records using Apache Stanbol. For more details, please read FORMCEPT’s proposal. In the previous blog post, we discussed about the basics of creating an Enhancement Engine for Apache Stanbol. This blog drills down into the Enhancement Structure of Apache Stanbol and various properties of a Metadata Graph that can be used to store the enhancements. The concepts are explained using a FORMCEPT Healthcare engine that can be used to annotate the medical records.

Introduction

This section describes the concepts and terminologies used in the blog.

Content Item

A content item is the unit of content within Apache Stanbol. It contains the content as well as the entire Metadata graph of enhancements. You can read more about content item here.

Enhancement Engines

Enhancement Engines enhance the content item. A content item is processed by one or more enhancement engines based on the selected enhancement chain. You can read more about enhancement engines here.

Enhancement Structure

Enhancement structure defines the types and properties used in the Metadata graph of enhancements. The enhancement structure is based on RDF and OWL. A sample enhancement structure is shown below. This example has been taken from Apache Stanbol wiki page. You can read more about enhancement structure here.

Apache Stanbol - Enhancement Structure

(Credit: Apache Stanbol)

Medical Record and Enhancements

[WikipediaThe terms medical record, health record, and medical chart are used somewhat interchangeably to describe the systematic documentation of a single patient’s medical history and care across time within one particular health care provider’s jurisdiction.

The healthcare enhancement engine considers a medical record as a content item. The knowledge base that is used by the enhancement engine is built on top of DBpedia 3.6 and specifically these domains-

  1. Drugs and Diseases
  2. Chemical Compounds
  3. Species
  4. Others, like- Health, Microbiology, Medical Diagnosis, Medicine, Perception and Biology

Implementation

This section describes the healthcare enhancement engine and how the enhancements are added to the Metadata graph for each Medical Record.

Healthcare Enhancement Chain

FORMCEPT Healthcare enhancement chain consists of-

  1. Tika Engine (Credit: Apache Stanbol) (Credit: Apache Stanbol)
  2. Language Identification Engine (Credit: Apache Stanbol)
  3. FORMCEPT Healthcare Engine

For research purpose, you can also take a look at the enhancement engine provided by Apache Stanbol eHealth demo.

An Example

Lets take an example of a typical medical record that describes the symptoms of a Brain Tumor-

Symptoms vary depending on the location of the tumour and may be slow in onset. Symptoms include headache, vomiting, nausea, dizziness, poor coordination, disturbance of vision, weakness affecting one side of the body, mental changes and fits. A person with any symptoms of brain disorder should seek medical advice.

Given a statement on symptom and indications, as shown above, FORMCEPT Healthcare Engine will try to annotate all the entities of interest. For example, in the above statement, headache, vomiting, nausea, etc. are all entities of interest. The annotated entities are then picked up and shown by Apache Stanbol enhancer user interface as shown below.

FORMCEPT Healthcare Engine

FORMCEPT Healthcare Engine

FORMCEPT Healthcare Engine Enhancements

FORMCEPT Healthcare Engine Enhancements

FORMCEPT Healthcare Engine

FORMCEPT Healthcare Engine relies on an external FORMCEPT Spotter Service that spots the keywords in the specified content. The spotter service returns a JSON that looks like-

{
  "spotID": "64b3b8f0-2b14-44fd-9f07-a4276e377d55",
  "createdOn": "Jul 11, 2012 2:55:26 PM",
  "spottedElements": [
    {
      "spottedWord": "headache",
      "startIndex": 97,
      "endIndex": 105,
      "elements": [
        {
          "uri": "http://dbpedia.org/resource/Headache",
          "dataset": [
            "DBPEDIA_RESOURCE"
          ],
          "label": "Headache",
          "alsoKnownAs": [
            "Headache and Migraine",
            "Hofudverkur",
            "Headech",
            "Headaches",
            "Headache (medical)",
            "Headache disorders",
            "Head pain",
            "Headache syndromes",
            "Encephalalgia",
            "Toxic headache",
            "Hoefudverkur",
            "Höfuðverkur",
            "Head Aches",
            "Headach",
            "Head ache",
            "Chronic headache",
            "Cephalgia"
          ],
          "description": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
          "context": "Health->Diseases and disorders->Neurological disorders->Headache",
          "broaderCategs": [
            "Health",
            "Diseases and disorders",
            "Neurological disorders"
          ],
          "language": "en",
          "timestamp": "Feb 17, 2012 10:56:00 PM"
        }
      ]
    }
    ...
  ]
}

The spotting result contains the spotted elements that were found in the specified content. Each spotted element has these features-

  1. Spotted Word: Word as it appears in the specified content
  2. Start and End Index: Word boundaries within the specified content. This is useful to locate the word within the specified content
  3. Elements: One or more elements in the knowledge base that define the spotted word

Each Knowledge Base Element has these features-

  1. URI of the Element
  2. Dataset associated with the Element  (For example, DBpedia)
  3. Standard label for the Element as defined in the Knowledge Base
  4. alsoKnownAs: Other labels by which the Element can be referred to
  5. Short description of the Element
  6. Hierarchical Context of the Element
  7. Broader Categories for the Element
  8. Language of the Element in which the label and other details are specified
  9. Timestamp of the last update

FORMCEPT Healthcare Enhancer converts these features into the Metadata Graph. The computeEnhancements method contains all the implementation to update the Metadata Graph with the above features of the Knowledge Base Elements.

As a first step, a text enhancement is created within the Metadata Graph for the spotted word. Since each spotted word can have one or more associated elements within the Knowledge Base, only the first element (closest match) defines the type of the enhancement. The type is derived from a class defined within the ontology of the dataset. If a type is not found, then the first category of the context is considered as the type. Here is a code snippet to create the text enhancement-

// get the literal factory
LiteralFactory literalFactory = LiteralFactory.getInstance();
// get the metadata graph
MGraph metadata = ci.getMetadata();
// text annotation
UriRef elemAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this);
// add metadata
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_SELECTED_TEXT, new PlainLiteralImpl(elem.getSpottedWord())));
// add the type
metadata.add(new TripleImpl(elemAnnotation, DC_TYPE, new UriRef(elemType)));
// set the description
metadata.add(new TripleImpl(elemAnnotation, RDFS_COMMENT, new PlainLiteralImpl(knowledgeElem.getDescription())));
// set the context as broader categories
metadata.add(new TripleImpl(elemAnnotation, SKOS_BROADER, new PlainLiteralImpl(knowledgeElem.getContext())));
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_START, literalFactory.createTypedLiteral(elem.getStartIndex())));
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_END, literalFactory.createTypedLiteral(elem.getEndIndex())));

The spotted word is added as a selected text along with other metadata, like- description, context and the start/end fields.

Once the text enhancement is created, all the knowledge elements are added as an Entity Enhancement that refers to the text enhancement created for the spotted word. Here is a code snippet to create the entity enhancements for the spotted word-

// add other entities
for(FCKnowledgeElement entityElem : spottedElems){
    // add a topic enhancement
    UriRef enhancement = EnhancementEngineHelper.createEntityEnhancement(ci, this);
    metadata.add(new TripleImpl(enhancement, RDF_TYPE, TechnicalClasses.ENHANCER_TOPICANNOTATION));
    metadata.add(new TripleImpl(enhancement, org.apache.stanbol.enhancer.servicesapi.rdf.Properties.DC_RELATION, elemAnnotation));
    // add link to entity
    metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_REFERENCE, new UriRef(entityElem.getUri())));
    metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_TYPE, OntologicalClasses.SKOS_CONCEPT));
    metadata.add(new TripleImpl(enhancement, ENHANCER_CONFIDENCE, literalFactory.createTypedLiteral(entityElem.getConfidence())));
    metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_LABEL, new PlainLiteralImpl(entityElem.getLabel())));
    metadata.add(new TripleImpl(enhancement, RDFS_COMMENT, new PlainLiteralImpl(entityElem.getDescription())));
    metadata.add(new TripleImpl(enhancement, SKOS_BROADER, new PlainLiteralImpl(entityElem.getContext())));
}

Each entity enhancement is linked with the text enhancement using the DC_RELATION property. Stanbol User Interface groups all the related entities and links them to the external URI specified by the ENHANCER_ENTITY_REFERENCE property.

Results

A typical enhancement result generated by FORMCEPT Healthcare Engine, looks like-

{
  "@context": {
    "broader": "http://www.w3.org/2004/02/skos/core#broader",
    "comment": "http://www.w3.org/2000/01/rdf-schema#comment",
    "Concept": "http://www.w3.org/2004/02/skos/core#Concept",
    "confidence": "http://fise.iks-project.eu/ontology/confidence",
    "created": "http://purl.org/dc/terms/created",
    "creator": "http://purl.org/dc/terms/creator",
    "end": "http://fise.iks-project.eu/ontology/end",
    "Enhancement": "http://fise.iks-project.eu/ontology/Enhancement",
    "entity-label": "http://fise.iks-project.eu/ontology/entity-label",
    "entity-reference": "http://fise.iks-project.eu/ontology/entity-reference",
    "entity-type": "http://fise.iks-project.eu/ontology/entity-type",
    "EntityAnnotation": "http://fise.iks-project.eu/ontology/EntityAnnotation",
    "extracted-from": "http://fise.iks-project.eu/ontology/extracted-from",
    "Health": "http://dbpedia.org/ontology/Health",
    "language": "http://purl.org/dc/terms/language",
    "LinguisticSystem": "http://purl.org/dc/terms/LinguisticSystem",
    "relation": "http://purl.org/dc/terms/relation",
    "selected-text": "http://fise.iks-project.eu/ontology/selected-text",
    "start": "http://fise.iks-project.eu/ontology/start",
    "TextAnnotation": "http://fise.iks-project.eu/ontology/TextAnnotation",
    "TopicAnnotation": "http://fise.iks-project.eu/ontology/TopicAnnotation",
    "type": "http://purl.org/dc/terms/type",
    "xsd": "http://www.w3.org/2001/XMLSchema#",
    "@coerce": {
      "@iri": [
        "entity-reference",
        "entity-type",
        "extracted-from",
        "relation",
        "type"
      ],
      "xsd:dateTime": "created",
      "xsd:double": "confidence",
      "xsd:int": [
        "end",
        "start"
      ],
      "xsd:string": "creator"
    }
  },
  "@subject": [
    {
      "@subject": "urn:enhancement-bf848e5d-decf-49ae-eaed-248e9e476d29",
      "@type": [
        "Enhancement",
        "TextAnnotation"
      ],
      "created": "2012-07-11T10:54:53.141Z",
      "creator": "org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine",
      "extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
      "language": "en",
      "type": "LinguisticSystem"
    },
    {
      "@subject": "urn:enhancement-fd5f3241-f1ea-5b02-01b2-30986c2b7f90",
      "@type": [
        "Enhancement",
        "TextAnnotation"
      ],
      "broader": "Health->Diseases and disorders->Neurological disorders->Headache",
      "comment": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
      "created": "2012-07-11T10:54:53.186Z",
      "creator": "org.formcept.engine.enhancer.FCHealthCareEnhancer",
      "end": 105,
      "extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
      "selected-text": "headache",
      "start": 97,
      "type": "Health"
    },
    {
      "@subject": "urn:enhancement-fdc05670-18b8-1703-0210-86e2d12fa36b",
      "@type": [
        "Enhancement",
        "EntityAnnotation",
        "TopicAnnotation"
      ],
      "broader": "Health->Diseases and disorders->Neurological disorders->Headache",
      "comment": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
      "confidence": 1.0,
      "created": "2012-07-11T10:54:53.186Z",
      "creator": "org.formcept.engine.enhancer.FCHealthCareEnhancer",
      "entity-label": "Headache",
      "entity-reference": "http://dbpedia.org/resource/Headache",
      "entity-type": "Concept",
      "extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
      "relation": "urn:enhancement-fd5f3241-f1ea-5b02-01b2-30986c2b7f90"
    }
  ]
}

Stanbol User Interface groups the linked entities as shown below. The type is shown as a heading, i.e. Health in this case. The spotted word, i.e. headache in this case, is shown under Mentions.

FORMCEPT Healthcare Enhancer - Headache-Example

FORMCEPT Healthcare Enhancer – Headache-Example

Evaluation

The results shown below are obtained by running the tests against the CALBC Corpora. As of now, we have tested against 1,725 (42,368 words) test cases. We will continue to add more test cases from different medical classes (type) and report the performance.

FORMCEPT Healthcare Engine - Evaluation Report

FORMCEPT Healthcare Engine – Evaluation Report

Credit: The result shown above has been generated using sgvizler tool that connects to the SPARQL endpoint provided by Fuseki server. The performance report was generated by FORMCEPT Benchmarking tool that uses EARL schema. The concept of visualizing through a SPARQL visualizer has been adopted from Rupert’s and Pablo’s comment on the improvement request [STANBOL-652] for Apache Stanbol Benchmark Tool. Thanks to both of them.

Benchmark

Type TP FP FN Precision Recall F1
Disease 1826 132 881 0.9326 0.6745 0.7829
Disease, Drugs 1847 140 876 0.9295 0.6783 0.7843
Disease, Drugs, CC 1961 249 848 0.88733 0.6981 0.7814

CC: Chemical Compound, TP: True Positives, FP: False Positives, FN: False Negatives,
F1: F-Measure/F-Score

FORMCEPT Healthcare Benchmark

FORMCEPT Healthcare Benchmark

FORMCEPT Healthcare Benchmark

FORMCEPT Healthcare Benchmark

Performance

The table given below shows the time taken by the spotting algorithm to spot the annotations out of 42,368 words present in the 1,725 test cases. The table also lists the number of entities present in the Knowledge Base out of which the annotations are identified.

Type Entities E1(sec) E2(sec) E3(sec) Avg(sec) Min(sec)
Disease 5156 0.125 0.096 0.083 0.101 0.083
DiDr 9814 0.155 0.135 0.125 0.138 0.125
DiDrCC 16487 0.209 0.195 0.182 0.195 0.182
DiDrCCSp 185020 0.289 0.284 0.214 0.262 0.214
TypeCateg 221572 0.428 0.424 0.418 0.423 0.418

E1E2 and E3 represent the independent execution time of the test cases

FORMCEPT Healthcare Engine Performance

FORMCEPT Healthcare Engine Performance

FORMCEPT Spotter builds an in-memory model of the entities to annotate the content. The table given below mentions the amount of memory consumed and the time taken to build the in-memory model.

Type Entities Processor Memory Time (sec)
Disease 5156 i5-2400 3.10GHz 23 MB 1.081
DiDr 9814 i5-2400 3.10GHz 35 MB 1.380
DiDrCC 16487 i5-2400 3.10GHz 182 MB 2.152
DiDrCCSp 185020 i5-2400 3.10GHz 1.14 GB 11.821
TypeCateg 221572 i5-2400 3.10GHz 1.39 GB 19.880

Entities: Total number of entities present in the Knowledge Base
DiDr: Disease and Drugs, DiDrCC: Disease, Drugs and Chemical Compound,
DiDrCCSp: Disease, Drugs, Chemical Compound and Species,
TypeCateg includes Disease, Drugs, Chemical Compound, Species, Health, Microbiology, Biology, Perception, Medical diagnosis and Medicine

Discussion

  • The results reported a high number of false negatives for each type. Here are the reasons for high number of false negatives-
  1. C2/C3/C4/C6/C6D/C7/C9 deficiency, hematopoiesis – were not marked as diseases within the Knowledge Base
  2. Close to 70% of the false negatives consisted of abbreviations, like- BMD, DMD, MJD, FAP, ALD, AMN, CL/P, PWS, VWS, CP, UPD14, DM, FAP, RCCs, MHP, WAS, CTX, DRD, HPD, VHL, HD, AS, AGU, MPS VII, FRDA, ASPA, etc. Full forms for these abbreviations have already been captured and some of the abbreviations were ambiguous
  3. Annotations, like- deficiency of norrin, deficiency of the enzyme, abnormal growth of lymphocytes  are not exact terms but a phrase
  4. Annotations, like- apoptosis, lesions, Enlarged vestibular aqueduct were not included within the Knowledge Base
  • There were few more annotations that were not captured by the Knowledge Base. Some of them were- hyperalphalipoproteinemia, CETP deficiency, anhaptoglobinemia, Atm-deficient, cardiac abnormality, diurnal fluctuation, Peter’s anomaly, platyspondyly, Axenfeldt anomaly, neurogenetic disorder, fibular overgrowth, facial lesions, Chronic neisserial infection, hemochromatosis, skin pigmentation, lymphoid malignancy, hitch hiker thumb, GPI-anchor deficiency, attenuated polyposis, iminodipeptiduria, hair-follicle morphogenesis, pseudoglioma, hyperalphalipoproteinemia, hyperphenylalaninemic, Morphological abnormalities, cPNETs, Duarte 2 and microvesicular steatosis
  • Number of true positives increased with the addition of types but that also increased the number of false positives. False positives were identified to be the drug names and chemical compounds that were not annotated in the Corpus.
  • Memory requirements can be further reduced by keeping only the spotted element IDs in memory.

We will continue to add more datasets to reduce the number of false negatives and incorporate the missing annotations.

References

  1. http://incubator.apache.org/stanbol/docs/trunk/enhancer/
  2. http://incubator.apache.org/stanbol/docs/trunk/enhancer/enhancementstructure.html
  3. http://incubator.apache.org/stanbol/docs/trunk/enhancementusage.html
  4. http://incubator.apache.org/stanbol/docs/trunk/enhancer/engines/
  5. http://www.ebi.ac.uk/Rebholz-srv/CALBC/corpora/resources.html
Posted in Benchmark, FORMCEPT, Healthcare, Performance, Stanbol | Comments Off