What if we told you that one of the biggest challenges that businesses face today is not too less data, but rather, too much data? Luckily, a number of tools are available with data scientists to minimize and tame reckless data into useful, actionable and compact clusters. Enter Data Binning - a popular statistical pre-processing algorithm with powerful applications in business and image-processing alike.
We at FORMCEPT regularly chronicle our adventures with data and lay bare our trysts with powerful analysis techniques especially for you. Here is our exclusive piece on Data Binning - and more specifically, on Equal Frequency Data Binning - a powerful tool to augment algorithms such as Machine Learning.
For the uninitiated, let us quickly brush over these terms.
Data Binning | What it is
For a moment, imagine if schools had no system to classify students into ‘classes’ (1st standard, 6th standard, etc.). Thousands of students of varied interests, diverse capabilities, and different age-groups would flock the school - each of them expecting education outcomes at par. The result? Chaos. It would be impossible to develop curriculum, it would be impossible to create schedules to allocate teachers, and examinations will pretty much go out of the window. A long time ago, someone somewhere must have realized that ‘age’, ‘interest’, and ‘assessments’ are fundamental denominators in learning and pedagogy. Thus, the system of classes, electives, pass-fail and grades must have evolved accordingly.
Sometimes (if not always), this is what happens when you throw a stream of continuous data into a decision-making model. The algorithm simply does not know how to deal with it. This is where you group the data into clusters called intervals, which are then discrete and not continuous - and this process itself is nothing but Data Binning. Every cluster or interval is assigned a discrete value, and then the actual continuous data is mapped to each of these discrete values. This means that your algorithm can make the most of a binned data as there is minimum loss of information and yet the resource load is reduced significantly.
Example: When you compress a high-resolution image into a low-resolution one, the file size decreases as binning of pixels (e.g. a 4x4 pixel array becoming a 2x2 pixel array) leads to fewer data points in pixels without compromising on the original image composition.
From Binning to Equal Frequency Binning
Let us go back to our example of a school. Now let us imagine that out of 1000 students 875 are in standard 9, and the rest are distributed in the other classes. We still have groups, and we are still adhering to the age-group based classification. The catch? The distribution of students across the classes is so distorted that it is impossible to optimize resources like classrooms, faculty, labs, and so on.
Now, mirror this scenario to your data. Distorted discrete groups that are binned from continuous data lead to intractable data skewness that can mess with your algorithm. The rescue here is a slightly modified form of Binning - called Equal Frequency Data Binning. In this, the groups are designed in a way such that the same or nearly the same number of continuous data points are allocated to each discrete data group.
Example: Again, when you compress an image, the color distribution (RBG) of pixels is mapped as pro-rata to the compressed image without bringing in unwanted skewness, thus a low-resolution image will still look like the original image.
How it Works
There are two versions of the Equal Frequency Binning algorithm.
Version 1: Merge the intervals if the frequency of an interval is less than a particular threshold - determines the appropriate number of bins with ranges such that the number of data points under each bin is uniform.
- Fetching the variable from the attributes on which binning is to be applied.
- The mean and standard deviation is calculated for this variable.
- Choose the initial number of bins which are specified by the user. If not specified, the algorithm analyses the number of bins on its own.
- Identify the initial intervals using mean and standard deviation
- After the fourth step, the data distribution is not uniform and to overcome this problem, we merge the intervals whose frequency is less than a particular threshold.
- Calculate frequency, maximum and minimum value of records contained in every interval. Searching from right to left, when the frequency of records contained in an interval is less than a fraction of the frequency, merge the interval into its nearest interval and update frequency, maximum and minimum of records contained in the interval. This process will go on until no interval is left to be merged.
- Map values of the continuous attribute into discrete values with respect to the divided intervals to get equal frequency distribution.
Version 2: Transfer the data points from one interval to another so that all the intervals contain the percentage of data points within the deviation limit - where users can specify the number of bins and the deviation limit.
After the first 4 steps in Version 1:
- Calculate upper bound and lower bound based upon the deviation limit.
- If the percentage of data points in a particular interval is less than lower bound, we make the percentage of data points in that interval equal to lower bound. Similarly, if it is greater than upper bound, we make it equal to the upper bound. If it lies between the lower and upper bound percentage, we leave the percentage as it is.
- The algorithm first runs from right to left and then left to right and this process continues until all the percentages are within the deviation limit.
Primary Use-case: Binning customer database of a travel aggregator to design promotional marketing campaigns
Travel aggregators like ClearTrip, MakeMyTrip and Yatra frequently come up with attractive promotional campaigns to promote customer loyalty. Here, we have used a sample travel aggregator data to demonstrate Equal Frequency Binning (Version 1). The parameter that we have binned is ‘Transaction Amount’ of the customer.
In the graphs below, the x-axis indicates the transaction amount in INR and y-axis indicates the number of users/customers.
In the first Histogram above, 6 highly disparate bins have been formed from the continuous data series of transaction amounts. These bins are not useful for strategy-making in creating attractive promotions due to the high skewness of the distribution.
In the second Histogram, 2 bins from the right end of the distribution (as the search algorithm runs from right to left) are merged to create a new composite bin. The data skewness is a bit reduced now, but it is still there.
Similarly, once again the last 2 bins are merged to form a new bin.
The process continues until we arrive at equally-distributed bins - we now have 3 bins which contains approximately the same number of data points.
Implications on Marketing Strategy:
- The three bins can be categorized as Sporadic Users, Moderate Users and Heavy Users.
- Promotional campaigns that can be targeted to Sporadic Users include referral campaigns and wallet cashbacks.
- Moderate users can be attracted to Frequent Flyer campaigns, discounts on hotel-booking against travel transaction amount, and free cancellation option.
- Heavy users can be targeted with promotion campaigns like Corporate travel benefits, family packages, and other packaged deals.
- Public Distribution Services (PDS) - Setting up PDS centers in appropriate frequency across the country for equivalent rationing of food grains.
- Product Development - Binning income levels to create suitable insurance and banking products for customers.
- Customer Segmentation - Creating the right mix of web-series and documentaries for different age groups, ethnic groups etc. (e.g. Netflix, Hotstar and Amazon Prime could identify the appropriate mix of kid shows, romance, thrillers, and supernatural shows.)
Analysts today are continuously faced with data bombardment, and are increasingly inching towards the notion of “small is beautiful.” FORMCEPT is at the forefront of augmenting traditional analytics with state-of-the-art innovation to maximize the ability of models and algorithms to digest data and deliver actionable insights. Data Binning is one of the many approaches of polishing and re-structuring data for better usability and insight-generation. To know more about who we are and what we do, visit www.formcept.com or write to us at firstname.lastname@example.org.