Author: Punit Naik

Locality-sensitive Hashing: Part 2 (Moving from Spark to Onyx)

Locality-sensitive Hashing: Part 2 (Moving from Spark to Onyx)

In our previous post, we explained briefly about the pitfalls associated with data duplication and how we can use Record Linkage to detect duplicates in databases. In fact, in an earlier post (March, 2017), we had also discussed the concept of Record Linkage and data-deduplication on Spark. There have been some recent exciting developments at …

+ Read More

De-duplication Can Empower Your Business to Achieve Data-Driven RoI: Here’s How

De-duplication Can Empower Your Business to Achieve Data-Driven RoI: Here’s How

Data Duplication and Redundancy – The Challenges Over the last couple of decades, there has been a phenomenal shift from paper-based records to virtual data rooms in India. This has streamlined and transformed the way businesses collect, gather, store and retrieve customer data. However, as the primary focus remained on the uphill task of converting …

+ Read More

Locality-Sensitive Hashing on Spark with Clojure/Flambo

Locality-Sensitive Hashing on Spark with Clojure/Flambo

Record Linkage is a process of finding similar entities in a Dataset. Using this technique one can implement systems like: Plagiarism Detectors – which are able identify fraudulent scientific papers or articles, Document Similarity – finding similar articles on the internet, Fingerprint Matching, etc. The possibilities are endless. But the topic which we are focusing …

+ Read More