Healthcare - Analyzing Medical Records

This blog describes a use case of analyzing medical records using Apache Stanbol. For more details, please read FORMCEPT's proposal. In the previous blog post, we discussed about the basics of creating an Enhancement Engine for Apache Stanbol. This blog drills down into the Enhancement Structure of Apache Stanbol and various properties of a Metadata Graph that can be used to store the enhancements. The concepts are explained using a FORMCEPT Healthcare engine that can be used to annotate the medical records.


This section describes the concepts and terminologies used in the blog.

Content Item

A content item is the unit of content within Apache Stanbol. It contains the content as well as the entire Metadata graph of enhancements. You can read more about content item here.

Enhancement Engines

Enhancement Engines enhance the content item. A content item is processed by one or more enhancement engines based on the selected enhancement chain. You can read more about enhancement engines here.

Enhancement Structure

Enhancement structure defines the types and properties used in the Metadata graph of enhancements. The enhancement structure is based on RDF and OWL. A sample enhancement structure is shown below. This example has been taken from Apache Stanbol wiki page. You can read more about enhancement structure here.


(Credit: Apache Stanbol)

Medical Record and Enhancements

[WikipediaThe terms medical record, health record, and medical chart are used somewhat interchangeably to describe the systematic documentation of a single patient's medical history and care across time within one particular health care provider's jurisdiction.

The healthcare enhancement engine considers a medical record as a content item. The knowledge base that is used by the enhancement engine is built on top of DBpedia 3.6 and specifically these domains-

  1. Drugs and Diseases
  2. Chemical Compounds
  3. Species
  4. Others, like- Health, Microbiology, Medical Diagnosis, Medicine, Perception and Biology


This section describes the healthcare enhancement engine and how the enhancements are added to the Metadata graph for each Medical Record.

Healthcare Enhancement Chain

FORMCEPT Healthcare enhancement chain consists of-

  1. Tika Engine (Credit: Apache Stanbol) (Credit: Apache Stanbol)
  2. Language Identification Engine (Credit: Apache Stanbol)
  3. FORMCEPT Healthcare Engine

For research purpose, you can also take a look at the enhancement engine provided by Apache Stanbol eHealth demo.

An Example

Lets take an example of a typical medical record that describes the symptoms of a Brain Tumor-

Symptoms vary depending on the location of the tumour and may be slow in onset. Symptoms include headache, vomiting, nausea, dizziness, poor coordination, disturbance of vision, weakness affecting one side of the body, mental changes and fits. A person with any symptoms of brain disorder should seek medical advice.

Given a statement on symptom and indications, as shown above, FORMCEPT Healthcare Engine will try to annotate all the entities of interest. For example, in the above statement, headache, vomiting, nausea, etc. are all entities of interest. The annotated entities are then picked up and shown by Apache Stanbol enhancer user interface as shown below.

FORMCEPT Healthcare Engine

FORMCEPT Healthcare Engine relies on an external FORMCEPT Spotter Service that spots the keywords in the specified content. The spotter service returns a JSON that looks like-
"spotID": "64b3b8f0-2b14-44fd-9f07-a4276e377d55",
"createdOn": "Jul 11, 2012 2:55:26 PM",
"spottedElements": [
"spottedWord": "headache",
"startIndex": 97,
"endIndex": 105,
"elements": [
"uri": "",
"dataset": [
"label": "Headache",
"alsoKnownAs": [
"Headache and Migraine",
"Headache (medical)",
"Headache disorders",
"Head pain",
"Headache syndromes",
"Toxic headache",
"Head Aches",
"Head ache",
"Chronic headache",
"description": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
"context": "Health->Diseases and disorders->Neurological disorders->Headache",
"broaderCategs": [
"Diseases and disorders",
"Neurological disorders"
"language": "en",
"timestamp": "Feb 17, 2012 10:56:00 PM"
The spotting result contains the spotted elements that were found in the specified content. Each spotted element has these features-

  1. Spotted Word: Word as it appears in the specified content
  2. Start and End Index: Word boundaries within the specified content. This is useful to locate the word within the specified content
  3. Elements: One or more elements in the knowledge base that define the spotted word

Each Knowledge Base Element has these features-

  1. URI of the Element
  2. Dataset associated with the Element  (For example, DBpedia)
  3. Standard label for the Element as defined in the Knowledge Base
  4. alsoKnownAs: Other labels by which the Element can be referred to
  5. Short description of the Element
  6. Hierarchical Context of the Element
  7. Broader Categories for the Element
  8. Language of the Element in which the label and other details are specified
  9. Timestamp of the last update

FORMCEPT Healthcare Enhancer converts these features into the Metadata Graph. The computeEnhancements method contains all the implementation to update the Metadata Graph with the above features of the Knowledge Base Elements.

As a first step, a text enhancement is created within the Metadata Graph for the spotted word. Since each spotted word can have one or more associated elements within the Knowledge Base, only the first element (closest match) defines the type of the enhancement. The type is derived from a class defined within the ontology of the dataset. If a type is not found, then the first category of the context is considered as the type. Here is a code snippet to create the text enhancement-
// get the literal factory
LiteralFactory literalFactory = LiteralFactory.getInstance();
// get the metadata graph
MGraph metadata = ci.getMetadata();
// text annotation
UriRef elemAnnotation = EnhancementEngineHelper.createTextEnhancement(ci, this);
// add metadata
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_SELECTED_TEXT, new PlainLiteralImpl(elem.getSpottedWord())));
// add the type
metadata.add(new TripleImpl(elemAnnotation, DC_TYPE, new UriRef(elemType)));
// set the description
metadata.add(new TripleImpl(elemAnnotation, RDFS_COMMENT, new PlainLiteralImpl(knowledgeElem.getDescription())));
// set the context as broader categories
metadata.add(new TripleImpl(elemAnnotation, SKOS_BROADER, new PlainLiteralImpl(knowledgeElem.getContext())));
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_START, literalFactory.createTypedLiteral(elem.getStartIndex())));
metadata.add(new TripleImpl(elemAnnotation, ENHANCER_END, literalFactory.createTypedLiteral(elem.getEndIndex())));
The spotted word is added as a selected text along with other metadata, like- description, context and the start/end fields.

Once the text enhancement is created, all the knowledge elements are added as an Entity Enhancement that refers to the text enhancement created for the spotted word. Here is a code snippet to create the entity enhancements for the spotted word-
// add other entities
for(FCKnowledgeElement entityElem : spottedElems){
// add a topic enhancement
UriRef enhancement = EnhancementEngineHelper.createEntityEnhancement(ci, this);
metadata.add(new TripleImpl(enhancement, RDF_TYPE, TechnicalClasses.ENHANCER_TOPICANNOTATION));
metadata.add(new TripleImpl(enhancement, org.apache.stanbol.enhancer.servicesapi.rdf.Properties.DC_RELATION, elemAnnotation));
// add link to entity
metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_REFERENCE, new UriRef(entityElem.getUri())));
metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_TYPE, OntologicalClasses.SKOS_CONCEPT));
metadata.add(new TripleImpl(enhancement, ENHANCER_CONFIDENCE, literalFactory.createTypedLiteral(entityElem.getConfidence())));
metadata.add(new TripleImpl(enhancement, ENHANCER_ENTITY_LABEL, new PlainLiteralImpl(entityElem.getLabel())));
metadata.add(new TripleImpl(enhancement, RDFS_COMMENT, new PlainLiteralImpl(entityElem.getDescription())));
metadata.add(new TripleImpl(enhancement, SKOS_BROADER, new PlainLiteralImpl(entityElem.getContext())));
Each entity enhancement is linked with the text enhancement using the DC_RELATION property. Stanbol User Interface groups all the related entities and links them to the external URI specified by the ENHANCER_ENTITY_REFERENCE property.


A typical enhancement result generated by FORMCEPT Healthcare Engine, looks like-
"@context": {
"broader": "",
"comment": "",
"Concept": "",
"confidence": "",
"created": "",
"creator": "",
"end": "",
"Enhancement": "",
"entity-label": "",
"entity-reference": "",
"entity-type": "",
"EntityAnnotation": "",
"extracted-from": "",
"Health": "",
"language": "",
"LinguisticSystem": "",
"relation": "",
"selected-text": "",
"start": "",
"TextAnnotation": "",
"TopicAnnotation": "",
"type": "",
"xsd": "",
"@coerce": {
"@iri": [
"xsd:dateTime": "created",
"xsd:double": "confidence",
"xsd:int": [
"xsd:string": "creator"
"@subject": [
"@subject": "urn:enhancement-bf848e5d-decf-49ae-eaed-248e9e476d29",
"@type": [
"created": "2012-07-11T10:54:53.141Z",
"creator": "org.apache.stanbol.enhancer.engines.langid.LangIdEnhancementEngine",
"extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
"language": "en",
"type": "LinguisticSystem"
"@subject": "urn:enhancement-fd5f3241-f1ea-5b02-01b2-30986c2b7f90",
"@type": [
"broader": "Health->Diseases and disorders->Neurological disorders->Headache",
"comment": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
"created": "2012-07-11T10:54:53.186Z",
"creator": "org.formcept.engine.enhancer.FCHealthCareEnhancer",
"end": 105,
"extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
"selected-text": "headache",
"start": 97,
"type": "Health"
"@subject": "urn:enhancement-fdc05670-18b8-1703-0210-86e2d12fa36b",
"@type": [
"broader": "Health->Diseases and disorders->Neurological disorders->Headache",
"comment": "A headache or cephalgia is pain anywhere in the region of the head or neck. It can be a symptom of a number of different conditions of the head and neck. The brain tissue itself is not sensitive to pain because it lacks pain receptors. Rather, the pain is caused by disturbance of the pain-sensitive structures around the brain.",
"confidence": 1.0,
"created": "2012-07-11T10:54:53.186Z",
"creator": "org.formcept.engine.enhancer.FCHealthCareEnhancer",
"entity-label": "Headache",
"entity-reference": "",
"entity-type": "Concept",
"extracted-from": "urn:content-item-sha1-ea34dfcefbb6b4e10c5e1d70953708aa65e7dd69",
"relation": "urn:enhancement-fd5f3241-f1ea-5b02-01b2-30986c2b7f90"
Stanbol User Interface groups the linked entities as shown below. The type is shown as a heading, i.e. Health in this case. The spotted word, i.e. headache in this case, is shown under Mentions.


The results shown below are obtained by running the tests against the CALBC Corpora. As of now, we have tested against 1,725 (42,368 words) test cases. We will continue to add more test cases from different medical classes (type) and report the performance.

Credit: The result shown above has been generated using sgvizler tool that connects to the SPARQL endpoint provided by Fuseki server. The performance report was generated by FORMCEPT Benchmarking tool that uses EARL schema. The concept of visualizing through a SPARQL visualizer has been adopted from Rupert's and Pablo's comment on the improvement request [STANBOL-652] for Apache Stanbol Benchmark Tool. Thanks to both of them.


Type TP FP FN Precision Recall F1
Disease 1826 132 881 0.9326 0.6745 0.7829
Disease, Drugs 1847 140 876 0.9295 0.6783 0.7843
Disease, Drugs, CC 1961 249 848 0.88733 0.6981 0.7814

CC: Chemical Compound, TP: True Positives, FP: False Positives, FN: False Negatives,
F1: F-Measure/F-Score


The table given below shows the time taken by the spotting algorithm to spot the annotations out of 42,368 words present in the 1,725 test cases. The table also lists the number of entities present in the Knowledge Base out of which the annotations are identified.

Type Entities E1(sec) E2(sec) E3(sec) Avg(sec) Min(sec)
Disease 5156 0.125 0.096 0.083 0.101 0.083
DiDr 9814 0.155 0.135 0.125 0.138 0.125
DiDrCC 16487 0.209 0.195 0.182 0.195 0.182
DiDrCCSp 185020 0.289 0.284 0.214 0.262 0.214
TypeCateg 221572 0.428 0.424 0.418 0.423 0.418

E1E2 and E3 represent the independent execution time of the test cases

FORMCEPT Spotter builds an in-memory model of the entities to annotate the content. The table given below mentions the amount of memory consumed and the time taken to build the in-memory model.

Type Entities Processor Memory Time (sec)
Disease 5156 i5-2400 3.10GHz 23 MB 1.081
DiDr 9814 i5-2400 3.10GHz 35 MB 1.380
DiDrCC 16487 i5-2400 3.10GHz 182 MB 2.152
DiDrCCSp 185020 i5-2400 3.10GHz 1.14 GB 11.821
TypeCateg 221572 i5-2400 3.10GHz 1.39 GB 19.880

Entities: Total number of entities present in the Knowledge Base
DiDr: Disease and Drugs, DiDrCC: Disease, Drugs and Chemical Compound,
DiDrCCSp: Disease, Drugs, Chemical Compound and Species,
TypeCateg includes Disease, Drugs, Chemical Compound, Species, Health, Microbiology, Biology, Perception, Medical diagnosis and Medicine


  • The results reported a high number of false negatives for each type. Here are the reasons for high number of false negatives-
  1. C2/C3/C4/C6/C6D/C7/C9 deficiency, hematopoiesis - were not marked as diseases within the Knowledge Base
  2. Close to 70% of the false negatives consisted of abbreviations, like- BMD, DMD, MJD, FAP, ALD, AMN, CL/P, PWS, VWS, CP, UPD14, DM, FAP, RCCs, MHP, WAS, CTX, DRD, HPD, VHL, HD, AS, AGU, MPS VII, FRDA, ASPA, etc. Full forms for these abbreviations have already been captured and some of the abbreviations were ambiguous
  3. Annotations, like- deficiency of norrin, deficiency of the enzyme, abnormal growth of lymphocytes  are not exact terms but a phrase
  4. Annotations, like- apoptosis, lesions, Enlarged vestibular aqueduct were not included within the Knowledge Base
  • There were few more annotations that were not captured by the Knowledge Base. Some of them were- hyperalphalipoproteinemia, CETP deficiency, anhaptoglobinemia, Atm-deficient, cardiac abnormality, diurnal fluctuation, Peter's anomaly, platyspondyly, Axenfeldt anomaly, neurogenetic disorder, fibular overgrowth, facial lesions, Chronic neisserial infection, hemochromatosis, skin pigmentation, lymphoid malignancy, hitch hiker thumb, GPI-anchor deficiency, attenuated polyposis, iminodipeptiduria, hair-follicle morphogenesis, pseudoglioma, hyperalphalipoproteinemia, hyperphenylalaninemic, Morphological abnormalities, cPNETs, Duarte 2 and microvesicular steatosis
  • Number of true positives increased with the addition of types but that also increased the number of false positives. False positives were identified to be the drug names and chemical compounds that were not annotated in the Corpus.
  • Memory requirements can be further reduced by keeping only the spotted element IDs in memory.

We will continue to add more datasets to reduce the number of false negatives and incorporate the missing annotations.