Analytics – the Game Changer in the Evolving Life Insurance Sector of India

In our previous article, Transforming Indian Insurance Sector with Data Analytics Edge, we explored how the Indian insurance sector can transform its customer acquisition, customer retention and share of customer’s wallet by adopting advanced, actionable analytics. In this article, we will zero-in on one of the most competitive segments of the insurance sector – the Life Insurance segment.

Need for Advanced Analytics in Life Insurance

Traditional Analytics Cannot Tackle Complex Variables Like ‘Trust’ and ‘Confidence’

Globally, the Indian life insurance sector is one of the largest, with about 360 mn1 policies that are expected to grow at an annual average of 12-15%2 over the next five years. Since the country’s population still remains largely uninsured, competition is severe, products are by and large similar, and high growth is constantly tempered by profitability challenges. One key aspect that affects how the life insurance segment functions is this – when people buy insurance, they buy a promise, or a future guarantee. Unlike other commodities, insurance cannot be consumed immediately by the customer. This leads to a key characteristic of the sector – customer’s trust is the most important factor in securing growth and profitability.

We were recently in conversation with life insurance veteran Pinaki Mullick, who has spent over fifteen years in this industry and held the position of a national head in his last stint with a leading insurance provider in the country. Talking about the importance of gaining customer’s trust, he says, “If you look at the graph of the top insurance companies in India during the last five or seven years, you will see that some of the top-ranked companies have slided down in terms of performance, while there are others who have climbed upwards and onwards. What has made this possible? The stark contrast is between those who ignored customer’s trust, and those who made it their center-piece. And how do you build trust? By knowing your customers – that’s exactly where analytics comes in.”

Mullick’s words are echoed in some of the recent findings on the sector. According to a recent study3 conducted on 1,900 insured and uninsured people based in Metro and Tier 1 cities, less than 50% of those with life insurance are very confident that they have purchased the right policy, or that their policy is very valuable. Strangely enough, there is a further drop in the level of confidence for customers who purchased the policy directly from an insurance company or through a bank, the most preferred mode of purchase being through agents. Moreover, 20% of the respondents said that they don’t have any trust in the way insurance is sold, or the information they’re provided.

Traditional Analytics Depicts ‘What’ and ‘When’, Not ‘Why’ and ‘How’

Let’s face it – in the financial year 2016, the life insurance industry exhibited a persistency of 61% on an average in the 13th month4. What does this mean? In simple terms, this means that 1 year post sale, only 61 out of every 100 policies were renewed. And the woes don’t end there – 5 years post sale, only 33.33%5 of the policies continue to exist, the rest being discontinued prematurely along the way. What does this mean for profitability of the insurer? – Disaster.

The customer acquisition and retention costs are a huge overhaul burden on the insurer’s books. When a policy is discontinued prematurely, the costs to open it in the first place cannot be recovered, giving a severe blow to profitability. Madhuri Deb, another industry veteran with over a decade of leadership experience in sales and policy life-cycle management, strikes a chord as she tells us about the vicious cycle of policy lapses. “In life insurance, the field-level salesforce – the pillars of the industry – are not empowered with tools and insights to make informed judgments in everyday decision making. This is at the root of insurance providers not being able to use data and analytics to business advantage. ”

“Besides,” she adds, “the picture is more complicated than it seems. We deal with customers and field-staff who are comfortable in vernacular language, but are intimidated by English – the language used in most of the analytics softwares. In the absence of predictive analysis at the grass-root level, policy lapses are a vicious cycle of customer dissatisfaction, lack of intelligence on reason behind lapses, and policies foregone prematurely. Traditional analytics is not actionable, and does not incorporate human component and textual / linguistic aspects in a contextualized manner.”

Traditional Analytics Cannot Keep Up with Evolving Life Insurance Sector

In the last couple of years alone, with IRDA’s (Insurance Regulatory and Development Authority) blessings, phenomenal changes are all set to transform the way life insurance works in India. To begin with, the emergence of PoS (Point of Sale)-based insurance products will be a game changer in making analytics indispensable. From penetration levels hovering around 4.60% in life insurance in 2009 (post the global downturn in 2008), we have hit a new low of 2.72% as of March, 20166, prompting IRDA to introduce innovative channels like PoS. PoS Saral Nivesh, the country’s first PoS life insurance product, was launched recently by Edelweiss Tokio Life Insurance. Spurred by cutting-edge analytics, the product is expected to complete a policy purchase transaction within 20 minutes, along with verifications7.

Another major change comes in the form of IRDA mandating an increase in premiums through a series of recent circulars8, forcing insurers to revisit their assumptions of growth, profitability and pricing. The third most anticipated regulatory change is the much debated possibility of IRDA introducing life insurance portability9 – under which unhappy customers of an insurance player can shift to a similar product of a competing player instead of closing the policy. With these changes, competition and battle for survival are all set to get tougher than ever!

Mullick offers a simple yet intuitive way-forward. “Instead of losing sleep over the intricacies, let’s learn lessons from our foreign counterparts, who have spent centuries in this domain, while we have barely completed a couple of decades. They perceive insurance analytics as a two-fold feedback loop – one from the happy customers, and one from the unhappy or ‘orphan’ customers. They set up elaborate processes capturing actionable analytics in each of these segments, and ensure that each decision by each employee is data-oriented and analytics-driven. Many Indian insurers have partnered with foreign players. The ones who have derived lessons from their foreign counterparts and applied the same in the Indian context, have made it to the top.”

Mullick’s thoughts are resonated in this article by Analytics India Magazine where a senior official from a life insurance provider which is a joint venture between an Indian and a foreign player, speaks on the analytics trends in the industry and the challenges that come bundled with it.

Concluding Note

So, what are the various kinds of advanced analytics that can catapult the life insurance sector to the next level?

The pillars will be formed by the four key components – Descriptive Analytics, Predictive Analytics, Diagnostic Analytics and Prescriptive Analytics. The technology of the future will be a composite of whole-brain analytics – that incorporates both rational and emotional components in capturing and analyzing information. The rise of IoT (Internet of Things), Robotics and AI (Artificial Intelligence) along-with advanced methodologies such as text mining and deep learning will be the mainstay of whole-brain analytics in Indian life insurance sector. In our next article, we will share with you some of the innovative solutions that FORMCEPT has conceptualized for India’s evolving life-insurance sector.

Posted in Analysis, Benchmark, FORMCEPT, Insurance | Tagged , , , | Comments Off

Is the Future of Indian Banking Dependent on Actionable Analytics?


Growth Challenges to Indian Banking Sector

With the current 26 public sector banks, 25 private sector banks, 43 foreign banks, 56 regional rural banks, 1,589 urban cooperative banks and 93,550 rural cooperative banks,1, India is poised to become world’s third largest domestic banking sector in the world by 20502. However, in the face of unprecedented shifts in consumer expectations and explosive growth of mobile technology cum ICT, traditional banks are grappling to attain the business basics – growth, profitability and market share. Ironically, demonetization, which some of us expected to be catalytic to growth of formal banking in India, laid bare its weaknesses instead – exposing its deepest vulnerabilities and weak-links.

The Demonetization Effect

Over INR 14 lakh crores worth of currency (accounting for 84% of the cash in circulation) had been withdrawn during demonetization.3 But when it came to reaping the benefits of a colossal cash-crunched economy, banks were overridden by e-wallets with flying colours. PayTM alone added millions of subscribers within a few days of the demonetization move. Villages and towns with populations below 100,000 now account for 20 percent of PayTM’s top line, as opposed to 2 percent earlier, thanks to demonetization. Nearly 200 million transactions worth ~ USD 750 mn were conducted through PayTM just in the month of January, 2017.4

New Competition – Arrival of Payment Banks

And then, with the Reserve Bank of India (RBI) granting in-principle approvals to 11 out of the 41 applicants for payments banks, we are staring at a tectonic shift in the way bankers survive and thrive against the heat of competition.5 What will be the key to reviving customer acquisition, retention and increasing the share of customer’s wallet by Indian banks? And how critical is analytics on that front? Let us explore.

Scope of Customer Acquisition in Indian Banking

Digital Payments

It is not just in India – bankers across the world are struggling to answer the question – how to acquire the maximum possible customers at the minimum possible cost per acquisition? Customer acquisition by a company refers to the process of converting prospects and inquiries to new customers by persuading them to buy the company’s products and services. First of all, is there a room for expansive growth in customer acquisition in Indian banking? Let us look at customer acquisition in digital payments, for example. As of 2016, there are 616 mn6 unique mobile subscribers and 970 mn7 Aadhar Card holders in India. And yet, the digital payments sector has a paltry 60-80 mn8 user base – leaving a massive space for growth for banks in this domain!

New to Bank’ and NPA

According to a recent report, bringing the ‘new-to-bank’ category of population into the fold will be key to India’s customer acquisition spree in the banking sector.9 India’s current unbanked population is over 160 mn10 – most of whom reside in rural areas. Only 27 per cent11 of villages in India have a bank within a five km radius.

It is not just mindless acquisition that counts, however. India has a high account dormancy rate of 43 per cent (against 15 per cent globally). The non-performing assets (NPAs) of banks rose by over 56 per cent during the calendar year 2016.12

Fostering Data-Driven Customer Acquisition

In view of the above, it is imperative that Indian bankers streamline customer acquisition strategy using analytics and informed judgement. Firstly, for acquiring new adopters to modern technology such as digital payments and omni-channel banking, it will be pertinent to mine existing data to identify the barriers to such adoption. A recent example of a bank using analytics to formulate effective customer acquisition strategy is the launch of ‘811’ – a zero balance, zero charge account – by Kotak Mahindra Bank. The bank promises to set up an account for the customer digitally within five minutes, and bring down customer acquisition cost drastically.13

Secondly, to cut down the surge in account dormancy and NPAs, data-driven models could profile ‘profitable’ customers based on demographic and personal characteristics and enable targeted customer acquisition. Thirdly, to capture the ‘new to bank’ category, aspects such as relative importance of fungibility, factors causing insecurity while adoption, improved convenience through banking can be profiled using deep data analytics and used to advantage in proliferating customer acquisition.

Lastly, banks acquiring customers using multiple channels over a period of time can rationalize their channel strategy by identifying which channel is performing better and generating more attractive returns. For example, a KPMG LLP-UBS study shows that “the cost of effecting a transaction through a branch is 43 times higher than that of a mobile channel, and Internet banking is twice as expensive as that of the mobile channel. Consequently, the mobile channel seems to be the way forward.” 14

Customer Retention – Fixing the Leaking Pipe of Indian Banking

If you think that the plight of Indian bankers would be over with better customer acquisition, you are no further from truth. In fact, the customer leakage that banks face today is probably more worrisome than the acquisition challenges. Conceptually, customer retention envisages the activities undertaken to retain the maximum number of customers by securing customer loyalty towards the company / brand.

Customers are Spoilt for Choice

According to a recent report, users of digital channels are more likely to switch banks due to low customer loyalty.15 In fact, customer stickiness has become a wishful proposition specially for private sector banks who offer almost similar product benefits and user conveniences. It is not unusual for the urban account holder today to have multiple accounts. Leveraging relationship management to attain loyalties seem to work less and less as face-time between customers and relationship managers reduce in the digital age.

Analytics – Key to Securing Customer Loyalty

Creating a seamless experience for the customer in the banking value chain, launching products targeted to customer’s investment and credit behaviour, identifying the critical touch points in a customer’s lifecycle (education, marriage, family planning, retirement, etc.), effectiveness of third-party loyalty programmes like PAYBACK, and risk perceptions of different customer segments are some of the key parameters which can be conducive to analytics-driven customer loyalty for Indian banks.

Share of the Customer’s Wallet

Beating the Nexus Between e-Commerce and Payment Aggregators

Share of wallet (SOW) refers to the proportion of the customer’s total spending that a business attains through its products and services. Nexus in the innovative payment space such as that of mobile money, e-wallets and payment aggregators with the burgeoning e-commerce segment are already eating away a sizeable pie of the cash flows and revenue streams of the traditional banker. A recent case in point is Amazon receiving approval for its much publicised e-wallet – Pay Balance – from the RBI. Pay Balance offers an easy and convenient one-click payment system as opposed to the two-step authentication system that is currently popular in India. Another example is MSwipe, which has given an alternative solution to point of sale (PoS) machines provided by banks.

Harnessing Customer Advocacy

The gaping divide between the large volume of transactions translating to only a small bite of the wallet share means that growth is often divorced from profitability. With actionable analytics on the optimum cross-selling strategy, coupled with garnering customer advocacy, a larger pie of the customer’s wallet can be targeted.

Concluding Note

Deep data and credible information are now the drivers of growth and profitability in Indian banking more than ever. Given the information load that banks deal with everyday, generating sizeable analytics targeted to specific pain points should come handy for this industry. What remains to be seen is if our bankers are ready to take the plunge. In the next article, FORMCEPT will bring to you the various solutions that can help banks in alleviating challenges using analytics and informed decisions.


Posted in Banking, FORMCEPT | Tagged , , , | Comments Off

Transforming Indian Insurance Sector with Data Analytics Edge

Insurance Industry – A Late Adopter to Analytics

With India’s insurable population projected to reach 750 mn1 in 2020, the insurance penetration still hovers below the world average (3.9% against global average of 6.3% in 20132). However, the woes of the industry are far more than just that. Lurking behind steep growth are burgeoning customer acquisition costs, high customer churn that leads to lower retention, and cut-throat competition among players vying for a larger pie of the customer’s wallet.

Customer Acquisition

To begin with, most Indian insurers employ a salesforce consisting of agents who sell directly to customers and in return get paid fat commissions that shoot up the costs of acquiring new customers. Customer acquisition by a company refers to the process of converting prospects and inquiries to new customers by persuading them to buy the company’s products and services. As of 2011, operating expenditures and customer acquisition costs of Indian insurance companies accounted for 25% to 50%3 of the total annual premiums.

Customer Retention

The challenges to customer retention are multi-faceted. High rate of attrition among customers translates into too many policies being returned or lapsed way before they enable adequate premium income for the insurer to become profitable. Customer retention envisages the activities undertaken to retain the maximum number of customers by securing customer loyalty towards the company / brand. The key to customer retention is knowing your customer well enough to facilitate meaningful engagement. To put things into perspective, most Indian insurers collect a large amount of customer data during the initial stages of customer onboarding, i.e. the application process. As policy lifecycle changes from filing a new application to its actual usage and recurring premium payments, the data generation practically ceases. In the absence of targeted customer loyalty programmes, and given that insurance is still not exactly a favourite of the Indian household, customer relationships often wane leading to high churn rates.

Share of Customer’s Wallet

The fresh scramble among insurance players to secure a higher share of the customer’s wallet is in the light of the fact that increasing a company’s pie in the customer’s wallet is often a cheaper way of bolstering revenue than increasing the company’s share in the market. Share of wallet (SOW) refers to the proportion of the customer’s total spending that a business attains through its products and services. However, in the absence of actionable customer analytics and insights into their wallet spend, this remains a wishful proposition.

Even as Indian insurers struggled to make both ends meet, it took them five years (from 2005 to 2010) to finally start selling policies online. The price for late adoption to data analytics have been paid both by insurers and customers – Indian insurance sector lost a mind-boggling INR 30,401 Cr4 (~ 9% of the industry worth in that year) to frauds and scams in 2011.

Data Analytics in Indian Insurance Sector

Data analytics in Indian insurance sector can be perceived as a three-pronged tool – Marketing Analytics, Loyalty Analytics and Risk Analytics.

Marketing Analytics include analytics that drive efforts to maximize fresh influx of customers and attain a higher share of the customer’s wallet, such as promotion campaign analytics, segmentation and targeting analytics, price & premium optimization and marketing mix modelling. Loyalty Analytics are targeted towards optimizing customer retention rates by establishing touch-points for customer engagement on one hand and alleviating customer grievances and doubts on the other. These include analytics on customer satisfaction assessments, customer churn analytics, reduction of claims settlement periods, personalization of customer experience, claims settlement optimization, and customer life-time-value analysis.

Risk Analytics are more for the insurer than for the customer, but also with substantial spill-over effects on the latter. These include risk optimization through analytics such as analytics for fraud detection & management, operationalizing claims approval & risk scorecards, actionable analytics on policy renewal & revival, predictive loss forecasting & modeling.

Customer Analytics for Future Revolution

According to a report by BCG-Google, 75% of insurance policies in India are anticipated to be influenced by digital channels by 20205. This essentially translates into a vast online ecosystem to catalyse growth and profitability for insurers. Using multi-modal data analysis, they can zoom into both structured data (application and policy data, for example) and text-based data (such as reports and experiential data on social media) to design more powerful products, formulate correct pricing and fuel better acquisition followed by improved customer stickiness.

Data analytics has been identified as one of the most pressing issues of insurance company by a PwC report on ’Top Insurance Industry Issues 2014’. Even though the industry virtually sits on a data tank, it lacks the tools and corporate will-power to gain business advantage from that data. According to global risk solutions provider LexisNexis, presently the expenditure on data analytics by insurance companies in India is far too low.6

How far the Indian insurers are ready to walk the extra mile to enter into the analytics fold is yet to be seen – however, the fact remains that analytics will be the key to growth and profitability in the years to come. In our forthcoming blog, we will bring to you the various solutions that FORMCEPT can offer to insurers to enable higher customer acquisition, better retention and larger share of the customer’s wallet.

Posted in Analysis, FORMCEPT, Insurance | Tagged , , | Comments Off

Locality-Sensitive Hashing on Spark with Clojure/Flambo

Record Linkage is a process of finding similar entities in a Dataset. Using this technique one can implement systems like: Plagiarism Detectors – which are able identify fraudulent scientific papers or articles, Document Similarity – finding similar articles on the internet, Fingerprint Matching, etc. The possibilities are endless. But the topic which we are focusing on in this article is *De-Duplication, which is the process of finding (and removing if need be) duplicates from a dataset.

Why do we need this? The answer is simple, removing or at least identifying redundant data to save Space (memory, disk space, etc.) or/and Time (avoiding unnecessary/repeated computation on duplicate data). One can simply go about doing this by comparing each entity in the dataset with every other entity, finding similarity score between those entities and doing some more computations on it depending on the application you are building.

But think of it in this way; if there are six strings in a dataset, [“aa”, “ab”, “ac” , “xx”, “xy”, “xz”] and I want to find possible duplicates from it, I’d rather compare “aa” with only “ab” and “ac” for finding its duplicates rather than the entire dataset because clearly [“aa”, “ab”, “ac”] are sort of similar to each other and way different from [“xx”, “xy”, “xz”]. But there is no merit in comparing everyone with everyone.

The computation time of this approach is O(n*(n -1)) where ‘n’ is the number of entities in the dataset. Now this approach is all good when the dataset size is very small. But when the data becomes very “Big”, the awful computation time of O(n*(n -1)) just won’t cut it. So we need to find a technique which reduces the number of candidates to be compared with a particular entity i.e. generating “similar” subsets from the main dataset and then running separate de-duplication tasks on the smaller datasets in a distributed manner which will greatly reduce the overall running time of the program.

So how do we achieve this? How do we intuitively create smaller “similar” datsets from the “bigger” main dataset without wasting too much time in pre-processing?

LSH to the rescue! It stands for Locality-sensitive hashing and it is one of the most common and a convenient algorithm for Document Similarity (in my opinion of course). The best part about this algorithm is that when one hashes the entities (documents or just strings) using LSH, all the “similar” entities tend to have similar hashes. From then on, it is just a matter of grouping the entities by their hash values which will give you smaller datasets and then finding duplicates in a distributed manner. Enough explanation, let’s just jump straight into the code. The language of choice is Clojure and we will be writing it for Apache Spark using a Clojure DSL called flambo.

We will be using two external libraries for this, flambo and a hashing library written on top of Google’s Guava, aesahaettr.

You can use this test file (contains restaurant details from two guides, fodors and zagats, and has about 112 duplicates in it) to play around with.

(require '[flambo.api :as f]
         '[flambo.conf :as conf])

(def c (-> (conf/spark-conf)
           (conf/master "local[2]")
           (conf/app-name "De-Duplication")))

(def sc (f/spark-context c))

(def rdd (f/text-file sc "/path/to/file"))

First things first, we need to generate shingles of the rows (string) of the RDD. The reason for doing this is that the chances of having the hash values match of a string’s corresponding shingles are greater than the entire string itself.

(defn k-shingles
  [n s]
  (let [indexed-str-seq (into {}
                          (map-indexed (fn [idx itm] {idx itm}) (seq s)))
        shingles (->> (map
                        (fn [[idx str-seq]]
                          (if (<= idx (- (dec (count indexed-str-seq)) (dec n)))
                            (reduce str
                              (map #(indexed-str-seq % "") (range idx (+ idx n))))))
                      (filter #(not (nil? %))))]

The function above requires two arguments, ‘n’ -> shingle size and ‘s’ -> The string.

NOTE: If your string size is very small, for e.g. if the rows of RDD are just first names, you can just create a list of individual strings of your string:

(map str (k-shingles 1 "punit"))

After this we need to hash each generated shingle of the string ‘X’ amount of times. This is again done to improve the chances of hash values matching.

(require '[æsahættr :as hasher])

(defn gen-hash-with-seeds
 (map #(hasher/murmur3-32 %) seeds))

(defn hash-n-times
 [n shingles-list]
 (let [hash-fns (gen-hash-with-seeds (range n))]
     (fn [x]
         (fn [y] (hasher/hash->int (hasher/hash-string y x)))

Now it is time to generate the MinHash Signature of that string (or document). We do this by taking the lists of hashed values (where all of them have the same size) and finding the minimum hash value at a position ‘i’ from every list thereby generating a single list of hash values which is the minhash for that string.

(defn min-hash
 (reduce (fn [x y] (map (fn [a b] (min a b)) x y)) l))

Now we partition the minhash signature into smaller ‘bands’ and then hash each of them for a final time. The purpose of doing this is that candidate (or similar) strings will have at least one or matching ‘hashed’ band and then we can group the strings by their ‘hashed’ band and generate candidate lists.

(defn partition-into-bands
 [band-size min-hashed-list]
 (partition-all band-size min-hashed-list))

(defn band-hash-generator
 (let [r (range -1
           (unchecked-negate-int (inc (count banded-list))) -1)
       ; Incrementing "band-size" because we are starting from -1
       hash-fns (gen-hash-with-seeds r)
       hashed-banded-list (map
                            (fn [x y]
                                (hasher/hash-string x
                                  (clojure.string/join "-" y))))
                            hash-fns banded-list)]

; Output of "partition-into-bands" is the input
; for "band-hash-generator"

After this we will have our RDD in the following form

RDD[String, List(Int)]

After this we have to write a code which maps through the List of hash values and then writes a Key-Value pair of [hash value, String]. Then you use the “combine-by-key” function of flambo to gather all the Strings (or Docs) with the same hash value. The only minor issue in this case is that when two strings have multiple matching bands, you will still have to collect the candidate list for all of them and then apply a distinct on the sorted candidate lists. Now it is only a matter of comparing the strings in all the candidate lists. The method that we generally use to compare strings is Levenshtein Distance. You can also set a threshold parameter that will classify the strings as duplicates only if the Levenshtein Distance is greater than it.

[*] De-Duplication is a subset of a larger topic, Document Similarity, which is mentioned above.


  1. Clojure for Data Science book –
  2. spark-hash intro –
Posted in Analysis, FORMCEPT, Open Source, Research | Tagged , , , , , , , , , , , | Comments Off

Unlock Insights to Boost User Experience Online

FORMCEPT has achieved pioneering position in extracting in-depth insights from piles of data. An interesting testimony to the degree of our influence in data analytics solutions space is our recent collaboration with ESPN Cricinfo to deliver a data analytics solution on our patent-pending platform that coincided with Cricket World Cup 2015. The solution for one of the leading providers of high value cricket analysis, news and trends, fit well with its popularity as a trusted authority on the game.

The visually eye pleasing data points are arranged in tabular and chart form for easy readability. The ability to extract insights from the site helps deliver a superior level of engagement with web visitors, irrespective of what their position is regarding cricket as a game.

Player Profile Analysis

Player Profile Analysis

Records - Most Wins by Teams

Records – Team-wise Wins

Some of the ways in which it provides engrossing and interesting data analytics experience to different categories of website visitors are as below -

1. A Team Manager

a. Single View – A team manager can easily correlate past performance and gain an idea of a player’s form over a period of time stretching back to last 10-15 years.

b. Decision Making – It helps him make insightful decision on whether to select, retain or drop a players based on his strengths and performance.

c. All-inclusive – The holistic data platform from FORMCEPT allows a well-thought view of how a player performs in different formats of the game. So if you are a team manager and want to see if a player is fit for IPL, you can check out his T20 record.

2. A Player

a. Visually Attractive – A player can gain invaluable performance insights in a highly interactive tabular and graphical format.

b. Performance Analysis – A batsman can conduct a meticulous self-analysis with help of detailed breakdown. This helps him to carry out an all-inclusive data-backed SWOT analysis and thus improve his game based on the website’s insights.

c. Up-to-date – The platform collates data in near real time. It allows a player to see latest figures in an easy to read manner and do a competitive analysis.

d. Takeaways – After a prolonged duration, it becomes difficult to track what are a player’s strong points, weak points or which team or player he is susceptible to. The impactful visualizations helps him glean useful data on performance trend of recent past and take corrective action for future.

3. End Users (fans)

a. Engrossing – As a fan, you get interesting facts quickly about your favorite player or team. For any cricket sportsperson you get insights on various aspects around his performance.

b. Authority – You comes across as a knowledgeable authority on the game of cricket when you share visually appealing stats over social media with your friends and acquaintances

c. Interesting Data Cuts – Fans get practically innumerous ways to look at data filtered by various parameters such as match formats, players, opposition, country played, ground played, and year etc. for both batting and bowling.

4. ESPN Cricinfo site

a. User Loyalty – As a result of the high amount of time spent by website visitors playing around with its interactive platform, the chances of sales conversion and business revenues is higher than its competitors.

b. Competitive differentiator – The site gains a competitive upper hand from the platform’s enhanced repeat visit potential and amazingly captivating content.

c. Rewarding User experience – The platform provides a highly appealing and immensely immersive UI/UX experience online for website visitors, thereby giving the site a distinct appeal.

Here is a glimpse of Insights Interface-

Sachin Tendulkar and Opposition

Sachin Tendulkar and Opposition

Sachin Tendulkar - Pace vs Spin

Sachin Tendulkar – Pace vs Spin

Sachin Tendulkar at World Cup

Sachin Tendulkar at World Cup

ESPNCricinfo site successfully merges ESPN’s proven cricket expertise with FORMCEPT’s superior technical acumen. The outcome is a visually stunning data analytics and insight platform that increases the value of information consumed by the website visitors.

Posted in Analysis, FORMCEPT | Tagged , , , | Comments Off

Nolan Scheduler

How often have you come across requirements that demand tasks to be performed repetitively at a defined interval? Yes, I am talking about a scheduler but a simple, yet powerful one that justifies its name- Just schedules. That is what Nolan Scheduler is all about.

Kuldeep, a champion clojurist, wrote the library and it is now an important part of FORMCEPT platform. It schedules all the jobs within the platform and keeps users up-to-date with the job status.

Email Scheduler

Lets take an example of email scheduler that is required to read emails from an email account, say GMail and do the same periodically. This is the classic use case for a scheduler. So, here is how you can schedule your email reader job-

Step-1: Pick the function to schedule

In this case, we can create a simple function that reads all the unread emails from the specified GMail account. Here is my namespace with the function read-email-

(ns fcgmail.core
  ^{:author "Anuj" :doc "FORMCEPT GMail Reader"}
  (:require [clojure-mail.core :as mcore]
            [clojure-mail.message :as m]

; GMail Store Connection
(def ^:private gstore (atom nil))

(defn- read-msg
  "Reads the message and returns the subject and body"
  {:subject (msg :subject)
   :body (-> (filter
               #(and (:content-type %)
                     (.startsWith (:content-type %) "TEXT/PLAIN"))
               (msg :body))
             first :body)})
(defn read-email
  "Reads unread emails and marks them as read"
  [uri email pwd]
    (reset! gstore (mcore/gen-store email pwd))
    (let [msgs (mcore/unread-messages @gstore :inbox)
          fcmsgs (map #(read-msg (m/read-message %)) msgs)]
      (doseq [fcmsg fcmsgs]
        (log/info (str "Retrieved: " fcmsg))
        ; Do whatever you want with the message
    (catch Exception e (log/error (str "Failed: " (.getMessage e))))
      (do (mcore/mark-all-read @gstore :inbox)
          (mcore/close-store @gstore)))))

It uses clojure-mail project to connect to GMail and read the messages. I will keep that explanation for the next blog but I encourage readers to go ahead and take a look at this project as well.

Step-2: Schedule

Now, comes the most interesting part. This how you can schedule your target function, i.e. read-email for this example-

(ns fcgmail.core
  (:require [nolan.core :as n]))

; Create Scheduler
(defonce sc (n/get-mem-scheduler))

; Schedule
(n/add-schedule sc "R//PT30S" #(read-email uri email pwd))

That is it :-) – Your scheduled function will be called every 30 seconds as per the repeating intervals syntax of ISO 8601. The function add-schedule returns a schedule ID which can be used later to expire a scehdule which stops all further executions and removes it from schedule store as shown below-

(expire sc scid)
; Check expiry status
(expired? sc scid)
; Should return true

By default, the library comes with built-in in-memory scheduler but you can extend the ScheduleStore protocol to the store of your choice. Please give it a try.

Posted in Development, FORMCEPT, Open Source, Research | Tagged , , | Comments Off

GDF Graph Loader for TinkerPop 2.x

Recently, we came across .gdf files that are a CSV like format for Graphs primarily used by GUESS. Although GDF file format is supported by Gephi, it was still missing from TinkerPop, one of the widely used graph computing framework.

Today, we are happy to release gdfpop, an open source implementation of GDF File Reader for TinkerPop 2.x under Apache License, Version 2.0. It allows you directly import .gdf files into FORMCEPT’s FactorDB storage engine that is compliant to TinkerPop 2.x blueprint APIs.

gdfpop APIs

gdfpop provides a method GDFReader.inputGraph that takes in an existing com.tinkerpop.blueprints.Graph instance and an input stream to the GDF file. There are three optional parameters-

  1. buf: Buffer size for BatchGraph. See BatchGraph for more details.
  2. quote: You can specify the quote character that is being used for the values. Default is double quotes.
  3. eidp: Edge property to be used as an ID

The implementation handles all the missing values, datatypes, default values and quotes gracefully. Here is a sample .gdf file that can be loaded via gdfpop-

nodedef>name VARCHAR,label VARCHAR2,class INT, visible BOOLEAN default false,color VARCHAR,width FLOAT,height DOUBLE
a,'Hello "world" !',1,true,'114,116,177',10.10,20.24567
b,'Well, this is',2, ,'219,116,251',10.98,10.986123
c,'A correct 'GDF' file',,,, ,
edgedef>node1 VARCHAR,node2 VARCHAR,directed BOOLEAN,color VARCHAR, weight LONG default 100
a, b,true,' 114,116,177',
b,c ,false,'219,116,251 ',300
c, a  , ,,


For example, consider the following graph taken from default TinkerPop implementation-


It has 6 vertices and 6 edges with each vertex having two properties- label and age and each edge having a weight. The only change that we have done to convert it into a GDF file is that the property name has been renamed to label because name is used as node/vertex ID in GDF. See GDF File Format for all the possible properties for a vertex. The gdf file corresponding to the above graph is shown below-

nodedef>name VARCHAR,label VARCHAR,age INT,lang VARCHAR
edgedef>node1 VARCHAR,node2 VARCHAR,name VARCHAR,label VARCHAR,weight FLOAT

Although, GDF specification does not talk about an ID for the edges but you can ask gdfpop to use a specific edge property as an edge ID using the eidp parameter.

Using gdfpop

Consider an example.gdf file with the above vertices and edges is provided as input and you wish to use all the awesomness of TinkerPop 2.x stack on it. To do so, follow these steps-

Step-1: Build gdfpop

Currently, gdfpop is not available on Maven Central, so you will have to pick the latest release or build from source using the following command-

mvn clean compile install

Once Maven builds gdfpop, it will be available within your local maven repository and good to be integrated with your existing code base using the following maven dependency-


Step-2: Load GDF files

Now, you can use the org.formcept.gdfpop.GDFReader functions to process and load the above example.gdf file as shown below-

// initialize
Graph graph = new TinkerGraph();
// load the gdf file
GDFReader.inputGraph(graph, new FileInputStream(new File("example.gdf")), "\"", "name");
// write it out as GraphSON
GraphSONWriter.outputGraph(graph, System.out);

The above code snippet will create a TinkerGraph, load it with all the vertices and edges as defined in example.gdf file and dump the loaded graph in GraphSON format that we can easily verify. For example, here is a JSON dump from the sample run of the above code-

    "mode": "NORMAL",
    "vertices": [{
        "name": "3",
        "label": "lop",
        "lang": "java",
        "_id": "3",
        "_type": "vertex"
    }, {
        "age": 27,
        "name": "2",
        "label": "vadas",
        "_id": "2",
        "_type": "vertex"
    }, {
        "age": 29,
        "name": "1",
        "label": "marko",
        "_id": "1",
        "_type": "vertex"
    }, {
        "age": 35,
        "name": "6",
        "label": "peter",
        "_id": "6",
        "_type": "vertex"
    }, {
        "name": "5",
        "label": "ripple",
        "lang": "java",
        "_id": "5",
        "_type": "vertex"
    }, {
        "age": 32,
        "name": "4",
        "label": "josh",
        "_id": "4",
        "_type": "vertex"
    "edges": [{
        "weight": 1.0,
        "node1": "4",
        "name": "10",
        "node2": "5",
        "_id": "10",
        "_type": "edge",
        "_outV": "4",
        "_inV": "5",
        "_label": "created"
    }, {
        "weight": 0.5,
        "node1": "1",
        "name": "7",
        "node2": "2",
        "_id": "7",
        "_type": "edge",
        "_outV": "1",
        "_inV": "2",
        "_label": "knows"
    }, {
        "weight": 0.4,
        "node1": "1",
        "name": "9",
        "node2": "3",
        "_id": "9",
        "_type": "edge",
        "_outV": "1",
        "_inV": "3",
        "_label": "created"
    }, {
        "weight": 1.0,
        "node1": "1",
        "name": "8",
        "node2": "4",
        "_id": "8",
        "_type": "edge",
        "_outV": "1",
        "_inV": "4",
        "_label": "knows"
    }, {
        "weight": 0.4,
        "node1": "4",
        "name": "11",
        "node2": "3",
        "_id": "11",
        "_type": "edge",
        "_outV": "4",
        "_inV": "3",
        "_label": "created"
    }, {
        "weight": 0.2,
        "node1": "6",
        "name": "12",
        "node2": "3",
        "_id": "12",
        "_type": "edge",
        "_outV": "6",
        "_inV": "3",
        "_label": "created"

You can notice that it has 6 vertices and 6 edges that were defined in the example.gdf file earlier.

Currently, gdfpop is compatible with only TinkerPop 2.x implementation. Going forward we may look into providing a plug-in for TinkerPop 3.x as well based on the interest of the community. Feel free to give us a shout at gdfpop.


  1. GDF: A CSV Like Format For Graphs –
  2. GUESS: The Graph Exploration System –\_GUESS\_.gdf_format
  3. Gephi: The Open Graph Viz Platform –
  4. TinkerPop: An Open Source Graph Computing Framework –
  5. gdfpop: Open source GDF File Reader for TinkerPop 2.x –
  6. Apache License, Version 2.0:
  7. GraphSON Reader and Writer Library:
Posted in Development, FORMCEPT, Open Source, Research | Tagged , , , , | Comments Off

Gen-next of resumes : From standard text to visual infographics

Earlier this week, veteran HR executive, Lee E. Miller in his column for, noted how visual resumes will dominate the next big wave in the recruitment industry. With recruiters starting to see more visual resumes, candidates are considering to traverse that path and catch the attention of recruiters by turning to infographics over textual resumes.

Recruiters, who have gladly received the idea of visual resumes, believe that the acceptance is going to increase across the recruitment industry as the innovation and creativity involved reduces the effort from recruiter’s end to quite an extent. Stunning visuals are often showcased to mesmerize the HR managers and stand out above the crowd.

 Resume Intent @FORMCEPT

“It is easier to absorb visual content as vision rate of humans is very high and over 90% of visual information that is captured gets stored in the brain.”

Images are easily captured by a human brain and are retained for longer periods of time. To deliver a lasting impression on HR managers – FORMCEPT offers infographics to illustrate candidates’ profiles as visual summary of skills, experience, achievements, education and interests. This is how a resume infographic looks like-

Visual Resume Infographics

Over and above, FORMCEPT provides advanced analytics options for the recruiters to query and explore multiple resumes and also compare them alongside. For more details, please contact us.

Posted in FORMCEPT, Infographics, resume, visual CV | Tagged , , , , , , , , , , , | Comments Off

Data Analysis should be your Compass

Imagine that you are going from a well-known location- Point A, to an unknown location- Point B. Along your journey, you are referring to a GPS based navigation system and deciding how to proceed in a particular direction. In this scenario, there are can be two possibilities:

GPS Scenario

  1. You might know how to reach Point C optimally (event though the GPS may be suggesting a longer route via Point-X) and then rely on the GPS system to reach the destination, i.e. Point-B.
  2. You might blindly follow the GPS based navigation system to take you to the destination (Point-B) through Point-X that it thinks at that time might be the best possible route for you.

While you are on your way, you might change your course in-between due to traffic jams, or road blocks. In that case, the online navigation system will re-calculate the route to pick up where you are and start guiding you.

In fact, navigation systems have become intelligent enough to find out whether there is a traffic jam at certain places and provide alternate efficient routes, all in real time. In addition to that, they are non-intrusive and they provide the driver with complete freedom to follow the navigation system or change the course- “Navigation system adapts to the change”.

The navigation system provide you insights on the traffic data/route and you as a decision maker take the input and act on that.

So, how is this relevant in the context of Business? Consider, a typical organization where, the CXOs know the current state of the business (Point-A) and are eager to accomplish business goals (Point-B) faster. They have enough data collected inherently (knowledge) and are progressing towards the goal (Point B). In the context of business, Point-B might be any of these depending on the CXO level within the company-

  1. Increasing the revenue by  x%
  2. Increase product features as per the market demand
  3. Save cost by y%
  4. Increase customer base by z%
  5. Save inventory cost etc.
Company Compass

What is missing is a data driven analysis platform (GPS Navigation System) that can guide them to reach the desired destination faster and with the existing resources.

Why they need a platform rather than an application is that, one application may not be the silver bullet for all the requirements. An organization needs more than one application, custom built, for the business using the available data and resources to solve a particular business problem. The data driven analysis platform should inherently support that. The platform should be agile so that it can support multiple applications and adapt to the business requirement by doing all the heavy lifting of the repetitive and common tasks related to data analysis. In other words, it should quickly re-calculate the optimal path to the destination as and when there is a deviation from the earlier suggested path.

Can the current traditional Business Intelligence systems do that? It is challenging because the traditional BI systems are designed to work on structured data and are monolithic by nature. Moreover, the rate at which the data is being generated these days is much higher and mostly unstructured. The platform that can capture, store and analyze such data should

  • Treat the unstructured data in the same rigour as the structured data
  • Provide quick insights in as and when they are required (on-demand/real-time)
  • Understand context, i.e. put forth the possible strategies to reach the wanted destination and based on the choice taken by the decision maker assist them optimally

FORMCEPT Big Data platform is designed just for that. It enables enterprises

  • To gain business insights faster by leveraging the available data
  • To respond faster to the ever changing Business Intelligence requirements
  • To make “Dark Data” extinct by leveraging the historical data of the organization

FORMCEPT Data Analysis FlowFORMCEPT uses proprietary Data FoldingSM techniques to discover the relations and patterns that exists across the datasets and generates fact based unified views. What it means to business is that different business units can create their own virtual data in the form of unified data view and can write their own cognitive based data driven applications for the business problem.

For example, an e-commerce company’s marketing department can build their own Influencer Application which understands the customers holistically based on not only transactional data but also public data, like- social media, blogs, etc.. Based on this application, they can target a product promotion campaign effectively, thereby, increasing the revenue and customer base.

To learn more about FORMCEPT and how it can solve your business problem, please

Posted in FORMCEPT | Tagged , , , , , , , , , | Comments Off

Big Data Tech Conclave 2013 – Part-2

In the previous blog, we discussed how FORMCEPT addresses the “Data Infrastructure Issues” using its MECBOT platform. In this blog we will take you through two real customer use-cases and show how enterprises can leverage MECBOT to solve the business problems.

Use Case 1: Loyalty Analysis and Targeted Promotional Campaign

Data Sources Goal
Bank Statements and Bills (PDF documents) To segment the customers
based on loyalty and target a promotional campaign
on specific set of products
Public data sources, like Geolocation, Region, Country, etc.

Following are the basic requirements for this use case-

  • Deploy a scalable data analysis platform for storage and analysis of documents *
  • Extract facts, like- account numbers, transactions, etc. from these documents
  • Enrich the content using the location data
  • Identify transaction patterns from the data and come up with a loyalty model
  • Validate the model
  • Represent the results such that key stakeholders can explore the results and initiate a targeted promotional campaign

* One of the key factor for underlying Data Infrastructure

Continue reading

Posted in FORMCEPT, Infographics, Retail | Comments Off