From ‘Real’ to ‘Realistic’: Generating Synthetic Data to Empower Organizations in Making Better Decisions

Do We Have Enough Analysis-Worthy Data?

A lot is being said about the data explosion that 21st century businesses face. But, there is surprisingly little buzz around the data crisis that goes hand-in-hand. Some of the instances of data crisis are as follows:

Together, data explosion and data crisis form a paradox that FORMCEPT aims to tackle with disruptive innovation. In fact, today, data is viewed as a definitive source of competitive advantage for most businesses. This, in turn, brings us to the concept of Synthetic Data.

What is Synthetic Data?

Synthetic Data is system-generated data that mirrors real data in terms of the essential parameters, as determined by the user of the synthetic data. Therefore, Synthetic Data is 'realistic' but not 'real' data that acts as a proxy for actual data for all practical purposes. The term 'Synthetic Data' was coined by Donald Rubin in 1993. The standard definition of Synthetic Data is any production data applicable to a given situation that are not obtained by direct measurement. It is considered to be a subset of anonymized data.

A related concept is Partially Synthetic Data, where only a few fields or variables of the actual records are anonymized or generated synthetically, while the remaining are taken from the real dataset.

Applications of Synthetic Data

Defense: One of the most important applications of Synthetic Data is in Defense, where simulated topographical information is stored in the form of synthetic data vaults for depth measurements and landscaping when entering new territories.

‍Image Processing: An industry which has recently benefited tremendously from synthetic data is image processing. 2D Gesture Recognition, 3D Face Reconstruction, de-noising (smoothing), in-painting (filling in image gaps) are some of the popular use-cases of Synthetic Data in image processing.

Apart from the above, Internet of Things (IoT), Artificial Intelligence (AI), Machine Learning and Natural Language Processing are often fed Synthetic Data for developing robust solutions.

‍Benefits of using Synthetic Data across these domains include:

Sharing data with external users for validation purposes
Generating large dataset with bounded conditions to build predictive models
Creating adverse scenarios that may occur in future but are not present in the current actual dataset
Performance testing by generating and sending data stream to a service or API
Unit - Integration testing by sending random dataset to assess the performance of a service or API
Stress testing of an existing database or API by generating large datasets
Building demo applications and models

How It Works

Use Case: Iris Flower Classification Dataset

To validate our approach and methodology of data generation, we generated the synthetic data for the famous iris flower classification dataset. The original dataset, as well as the synthetic dataset -generated by us, contains 3 classes of 50 instances each, where each class refers to a type of iris plant.

‍Table 1: Classes and Features in Iris Flower Classification Dataset

‍Classes Features

Iris Setosa Sepal Length, Sepal Width, Petal Length, Petal Width

‍Iris Versicolour Sepal Length, Sepal Width, Petal Length, Petal Width

‍Iris Virginica Sepal Length, Sepal Width, Petal Length, Petal Width

‍Click here to know more about the dataset.

Below are the statistics for the original dataset (normal) and our synthetic dataset (italics).

‍Table 2: Statistics of Synthetic Dataset Generated by FORMCEPT: Very Close to Those of the Original Dataset

‍Attribute Min Max Mean Correlation Stdev

Sepal Length 4.3 -4.3 7.9 - 7.9 5.84 - 5.77 0.83 - 0.87

Sepal Width 2.0 - 2.0 4.4 - 4.3 3.05 - 3.12 0.43 - 0.64

‍Petal Length 1.0 - 0.9 6.9 - 6.8 3.76 - 3.60 High - High 1.76 - 1.72

‍Petal Width 0.1 - 0.1 2.5 - 2.5 1.20 - 1.21 High - High 0.76 - 0.72

Correlation Matrix

Correlation Matrices of Original and Synthetic Datasets Are Very Similar

The key inference that can be drawn from both the scatter plots is that the flower species Setosa has distinct features compared to those of Versicolor and Virginica. As a result, it has formed an isolated cluster which is linearly separable. Whereas, Versicolor and Virginica have overlapping feature clusters.

Thus, our summary statistics metrics, correlation matrix and feature clusters showed nearly same behaviour as those of the original dataset. We validated it further by training a machine learning model to classify the flower to be any of the 3 classes given its 4 features and then comparing the results of both the models (one trained with original data, and the other with synthetic data).We have used Keras to train a neural network for this classification task.

‍Table 3: Machine Learning Model - Hyperparameters

‍Parameter Value

Epoch 200

‍Layers 4 -> 4 -> 3

‍Activation Function Sigmoid

‍Optimizer Adam's Optimizer

‍Loss Function Categorical Cross Entropy

‍Table 4: Comparing the Accuracy Rates of Machine Learning Model on Iris Flower Classification - Original vs Synthetic Dataset

‍Dataset Accuracy (in %)

Original 93.6%

‍Synthetic 90.0%

Conclusion

While we do not claim that the parameters used by us are the best parameters for this dataset, we are encouraged by the results obtained. We hope to help many more organizations gain competitive advantage through Synthetic Data production and progress towards Smart Data Generation. We will keep sharing our experiences and learnings with you, so please stay tuned. If you would like to know more about what we do and how we can help you, please visit www.formcept.com or write to us at contactus@formcept.com.

From 'Real' to 'Realistic': Generating Synthetic Data to Empower Organizations in Making Better Decisions