Data Duplication and Redundancy - The Challenges
Over the last couple of decades, there has been a phenomenal shift from paper-based records to virtual data rooms in India. This has streamlined and transformed the way businesses collect, gather, store and retrieve customer data. However, as the primary focus remained on the uphill task of converting from paper to digital, aspects like data validation and periodic clean-ups were not given the attention they deserved. As a result, many businesses and public service providers today find themselves in an overwhelmingly redundant data pool.
On technology front, this massive data duplication has culminated into businesses requiring several times larger capacity for data storage, and drastically reduced speed of operation. Most organizations today are witnessing a data growth of 50 to 60 per cent annually. This accentuates the need to optimize data storage space made available through deduplication. But, this is easier said than done: basic de-duplication softwares offering temporary band-aid fix rarely help businesses.
On business front, multiple copies of the same data just means bad business. Why? This is because hazy, cloudy data cripples analytics, which is the cornerstone of most business activities today. Let us look at an example.
Use Case 1: Newspaper Publishing
Large news content publishers in India, with lakhs of newspaper copies in circulation, sit on an extensive internal data pool of customers. This includes information such as customer names, gender, age, address, and so on. Theoretically, this enormous data pool can be used for razor-sharp demographic segmentation of their customers.
The catch? Majority of the data is not available as a unique copy in their system. The implications of this are explained through the following instances:
- Primes Group, an umbrella news publishing house, has several newspapers in circulation targeted to different audience, such as Primes of India, Economic Pimes of India, Primes Group Magazine, and so one. Further, there are multiple editions across cities, such as Primes of India-Mumbai, Primes of India-Kolkata, and so on. A single customer, who has subscribed to multiple newspapers / magazines of Primes Group, is treated as different customer in each case by analytics softwares due to data duplication.
- Several members of the same family who subscribe to Primes Group products are treated as non-related entities. As a result, demographic segmentation at household and locality level is not possible.
Use Case 2: Public Goods and Services Providers
Distribution of and access to public goods and services such as subsidized food and commodities is often linked to social identification documents such as PAN Card, Voter Card, Aadhar Card, Passport and so on.
Fraudulent practices include making barely detectable changes in addresses or other details to issue multiple identification documents for the same beneficiary. This leads to leakage of resources and unfair advantage to a few. In the absence of intelligent softwares that detect such falsification and misrepresentation, it is a nearly task to streamline public distribution.
Use Case 3: Telecom Operators
It is popular practice in India that while a SIM card may be registered in the name of an individual, the actual user may be the individual’s friend, family member, or acquaintance. This means that customer data points such as personal information, consumption behaviour, and usage feedback are not unified across multiple data copies. This in turn results into poor customer profiling, and hence, ineffective marketing strategy.
Data duplication is further complicated by mobile number portability and the consumer trend of using multiple SIM cards. As of 2012, there were 71 million multiple-SIM users in India.
MECBOT to Ensure Uniqueness of Data
Ensuring a unique copy of data is indispensable for businesses to optimize their resources and boost RoI by designing effective marketing and distribution strategy. At FORMCEPT, we understand that just basic layer of technology like data compression and incremental backups are not enough for today’s data-driven businesses. Hence, we have built our systems around sanctity of data. For example, our flagship product MECBOT, an open cognitive application platform, uses Locality-sensitive Hashing (LSH) to intelligently detect and eliminate data duplication across multiple copies in the following manner:
According to the above flowchart, we first feed into the system a database that has one or more duplicate records in it. This database could be of names, addresses, date of birth, etc. We first decide which field(s) needs to be considered to detect the duplicate records - for e.g. address. Then the system generates a number-based code for each record, called a ‘hash’. For example, for a place called ‘Shyambazar’ in Kolkata, the hash may be -1709604676.
The next step is to classify the records that have the same hash. For example, ‘Shyambazar’, ‘shyambazar’, ‘shyam bazar’ and ‘Shyam Bazar’ (these are the different ways in which people in Shyambazar actually write their address) will be classified under the same hash -1709604676 under the LSH algorithm.
Now, the classified records will be compared against each other, and a minimum percentage threshold will be set to decide whether a record is a duplicate of another or not. For e.g. let’s say the threshold is 15%. Now, let’s say the system compares two strings ‘Shyamnagar’ and ‘Shyambazar’ by calculating the percentage difference between them using the Levenshtein Distance technique. Then, ‘Shyamnagar’, which has a hash of 1596453476, will not be treated as a duplicate of ‘Shyambazar’, because as per LSH, they differ by at least 15%.
The technique of De-duplication is based on a more fundamental process called Record Linkage, which is the process of finding similar entities in a Dataset. Record Linkage has use-cases across applications such as Plagiarism Detectors, Fingerprint Matching, and so on. In 2013, we at FORMCEPT took a major step in the direction of bringing our technology at par with global standards, when we shifted to Clojure as our programming language. Given its powerful, state-of-the-art features, wide applications, and ability to address complex problems quickly, it is now at the core of our functionality, with over 70% of our codebase already being migrated to it.
Shift to Clojure has also necessitated shifting some of our processing engine capabilities from Spark to Onyx. Locality-sensitive Hashing is one of the first algorithms we have tried out in the new set-up (See - LSH with Spark). With the recent shift in the LSH algorithm from Spark to Onyx, our de-duplication feature is now more powerful than ever. In our next blog, we will explore de-duplication further in the light of this change. To know more about what we do and how we can help your business achieve data-driven RoI, please visit www.formcept.com or write to us at email@example.com.