InfoQ Homepage Articles Overcoming Data Scarcity and Privacy Challenges with Synthetic Data

# Overcoming Data Scarcity and Privacy Challenges with Synthetic Data

### Key Takeaways

• A synthetic dataset is one that resembles the real dataset, which is made possible by learning the statistical properties of the real dataset.
• Synthetic data can help to solve the common problem of data scarcity and protects data privacy, making it easier to share data and improve model robustness. This is particularly beneficial for financial institutions.
• To generate synthetic samples, different algorithms should be applied to different types of data. CTGAN is a great open source project at MIT that provides desirable results for generating synthetic tabular data.
• We explore the power of synthetic data generation through the application of the CTGAN on a payment dataset and learn how to evaluate synthetic data samples.

Data is the lifeblood of artificial intelligence. Without sufficient data, we are unable to train models and then our powerful and expensive hardware sits idle. Data contains the information from which we want models to draw patterns, extract insights, generate predictions, build smarter products, and develop into more intelligent models.

However, data is typically difficult to procure and oftentimes the data collection process can be more arduous and time-consuming than building the actual machine learning models.

There is a science to collecting good, high-quality, and clean data, as it can be a time-intensive and expensive process. In some cases, data is highly regulated, meaning long lead times to secure the permissions to access it. Even when secured, the size of a dataset might be so limited that training models is out of the question. To address this challenge, we need synthetic data.

Synthetic data is data that is artificially generated rather than collected by real-world events. It is data that serves the purpose of resembling a real dataset but is entirely fake in nature. Data has a distribution, a shape that defines the way it looks. Picture a dataset in a tabular format.

We have all these different columns and there are hidden interactions between the columns, as well as inherent correlations and patterns. If we can build a model to understand the way the data looks, interacts, and behaves, then we can query it and generate millions of additional synthetic records that look, act, and feel like the real thing.

Now, synthetic data isn’t a magical process. We can’t start with just a few poor-quality data points and expect to have a miraculous high-quality synthetic dataset from our model. Just like the old saying goes, "garbage in, garbage out," in order to create high-quality synthetic data, we need to start with a dataset that is both high-quality and plentiful in size. With this, it is possible to expand our current dataset with high-quality synthetic data points.

In this article, I will discuss the benefits of using synthetic data, which types are most appropriate for different use cases, and explore its application in financial services.

## Why is synthetic data useful?

If we already have a decent high-quality dataset, is there any point in trying to acquire additional fake data points? The answer should always be an emphatic, ‘Yes!’ And here’s why.

Say you have a dataset that has a very skewed balance on a column you are trying to predict, and you can’t obtain more data for this minority class. We can leverage synthetic data to synthesize more data points for the minority and add balance to our model to help seek a performance increase. For example, if the task is to predict if a piece of fruit is an apple or an orange by learning about the attributes of these two fruits - their color, shape, seasonality, etc., and there are 4,000 samples for apples while only 200 samples for oranges, then any machine learning algorithm is likely to be biased towards apples due to the large class imbalance. This could result in an inaccurate model and undesirable performance. However, if we can generate 3800 more synthetic samples for oranges, then the model won’t be biased toward either fruit and thus can make a more accurate prediction as there is more balance in the two classes.

Additionally, say you have a set of data that you wish to share. The caveat here is that the data is sensitive and in this case, data privacy is extremely important. Many datasets contain personally identifiable information (PII) or other sensitive attributes such as a person’s full name, social security number, bank account number, etc., making it difficult to share them with a third party in order to carry out any kind of data analysis or model building. This gets into the hassle of anonymizing data, picking and choosing non-personally identifiable information, sitting with legal teams, creating secure data transfer processes, and much more. This process can lead to months of delay in creating a solution, as the data needed for a model can’t be shared immediately. To combat this, we can leverage synthetic samples from the real dataset that still preserve the important characteristics of the real data that can be more easily shared without the risk of invading data privacy and leaking personal information.

## Why might it be useful in financial services?

Financial services are at the top of the list when it comes to concerns around data privacy. The data is sensitive and highly regulated. In addition to improving machine learning model performance, it’s no surprise that the use of synthetic data has grown rapidly in the financial services field, as it allows institutions to more easily share their data.

It’s also difficult to obtain more financial data. For example, to get more customer checking account data to feed a model, we need more customers to open up checking accounts. Then we need to wait a length of time for them to start using the accounts and building up transaction histories. However, with synthetic data, we can look at our current customer base and synthesize new checking accounts with their associated usage, allowing us to use this data right away.

## Different types of synthetic data

If you google synthetic data, you will find all different types of data mediums being synthesized. Most commonly, you will see unstructured data, such as synthetic paintings from image data, synthetic videos for advertisements, and synthetic audio for popular public figures. These are some really interesting data types to synthesize, but in financial services, just like many other industries, we commonly deal with databases and flat tabular files containing numerical, categorical, and text-based data points. Additionally, we have data ordered by time and data tables that are relational in nature.

It is important to note that there isn’t one perfect synthetic data generation algorithm that can handle any type of data. When looking into synthesizing your dataset you need to look at the characteristics and understand which algorithm is right for your data.

## Popular methods for generating synthetic data

So, if you google "synthetic data generation algorithms" you will probably see two common phrases: GANs and Variational Autoencoders. These are two classes of algorithms that have generative properties, i.e., the ability to create data. Heavy research and development have been done around these models and many synthetic data architectures, from images to audio to tabular data to text data, have been created using these core methodologies. Let’s briefly discuss these two architectures.

GANs, properly known as generative adversarial networks, are two neural networks, (namely the generator network and the discriminator network), that play a game against one another. The generator tries to generate fake or synthetic data while the discriminator network tries to determine if the data it is seeing is real or fake. As the two networks battle it out, the generator learns to create better and better fake data, which makes the task harder for the discriminator.

Variational autoencoders are neural networks whose goal is to predict their input. In traditional supervised machine learning tasks, we have an input and an output. With autoencoders, the goal is to use the input to predict and try to reconstruct it. Here, we have two parts to the network: the encoder and the decoder. The encoder compresses the input and creates a smaller version of it. The decoder takes this compressed input and tries to reconstruct the original input. The idea here is that we are learning how to represent the data by scaling it down in the encoder and building it back up from the decoder. If we can accurately rebuild the original input, then we can query the decoder to generate synthetic samples.

There are many machine learning algorithms for generating synthetic data out there, but which one performs the best all depends on the specific data types that you are working with. So, it would be smart to explore the data before making a choice.

## How to evaluate synthetic data samples

Once you have a synthetic dataset, you need to ensure that it is of high quality. There are many synthetic data generation algorithms for different types of data, but how do we make sure that the generated, fake samples truly mimic the real data? I will now introduce some methods and tips on how to evaluate synthetic data. Since data exists in many different forms, we will be focusing on tabular data that is non-time series.

There are two core evaluation components in which to validate synthetic data: statistical similarity to the true dataset and its machine learning efficacy.

### Statistical Similarity

As previously mentioned, data has a distribution. It has a look and feel. It has interactions with other data fields and behaves in its own respective manner. When we have a synthetic dataset and a real dataset, we want to make sure we have similar distributions. We want to make sure the column distribution looks the same. If we have data imbalances, we want to make sure our synthetic dataset captures these imbalances. Here, we want to plot side-by-side histograms, scatterplots, and cumulative sums of each column to ensure we have a similar look.

The next step is to look at correlations. If we have interactions between columns in our real dataset, then we should expect a properly generated synthetic dataset to have similar interactions. To do so, we can plot a correlation matrix of both the real and synthetic sets as well as a difference in correlation values between the two to get an idea of how similar or different the correlation matrices are.

### Machine Learning Efficacy

If our dataset contains a target variable or column that we are interested in predicting and building a model from, we can dive into machine learning efficacy. This measures how well the synthetic data performs under different models. The idea here is that if we can build and train a model on the synthetic dataset, and it performs well upon evaluation on real data, then we have a good synthetic dataset. To do this, we look at classification metrics that are appropriate for the problem at hand, such as (but not limited to) F1 score and regression metrics, such as RMSE. The performance, represented by evaluation metrics on the regression/classification models, can then be averaged across these metrics, which will give us a final performance score on the machine learning efficacy of the synthetic data.

## Synthetic data generation in finance

Choosing the right synthetic data generation algorithm depends greatly on the type of data we are dealing with. Since most of the datasets that we work with in the financial industry exist in tabular format, it would be preferable for us to use a machine learning model that is designed specifically for tabular data. Fortunately, there is an open-source project at MIT that developed exactly such an algorithm called CTGAN (Conditional GAN for Tabular Data). As previously discussed, GANs consist of two neural networks: a generator and a discriminator. The CTGAN is a spin-off of this methodology and takes generating data to a different level.

As data scientists, we often deal with tabular data with mixed data types, from numerical to categorical columns. With numerical columns, the distribution of values can become much more complex than an ideal Gaussian distribution. With categorical columns, the common problem is class imbalance, meaning there will be too many data points in some categories but not enough data points in the other categories. It is quite a challenge for traditional GAN models to successfully learn from these data points with non-Gaussian distributions or class imbalances. To produce highly realistic data of this nature we turn to the CTGAN. This model separates the numerical and categorical columns and uses alternate methods to learn the distributions. The Variational Gaussian Mixture Model can detect the modes of continuous columns, while the conditional generator and training-by-sampling will solve any prominent class imbalance problems. Then the two fully-connected layers in the network can efficiently learn the data distributions and the network will generate samples using mixed activation functions since there are both numerical and categorical values.

Figure 1: Diagram of a synthetic data generation model with CTGAN

Next, let’s see how we can use the CTGAN in a real-life example in the world of financial services.

To start, we import all the necessary libraries. The CTGAN model is built on top of PyTorch and the table_evaluator library is designed specifically for evaluating tabular data, which will be quite useful to see how our generated samples are performing.

import pandas as pd
import numpy as np
from dateutil import parser
import torch
from ctgan import CTGANSynthesizer
from table_evaluator import load_data, TableEvaluator

The dataset used in this example is the IBM Late Payment Histories dataset that is publicly available through Kaggle. For our example, we will be trying to predict if the payment will be late or not.

Let’s first read in the data and look at the first five rows. We will need to specify the path where our IBM_Late_Payment.csv file is located.

df = pd.read_csv('… /IBM_Late_Payment.csv')
df.head()

Figure 2: Original payment data samples

Now it’s time for data preprocessing. We want to make every column readable for the CTGAN model. We first map the ‘customerID’ column to a finite number of discrete integers, which will be called ‘CustomerIDMap’. Then, we convert all the columns that contain dates to a numerical representation that the CTGAN can effectively model.

def convert_dates(df,date_cols):
#Turn dates into epochs (seconds)
for i in date_cols:
df[i] = df[i].apply(lambda x: parser.parse(x).timestamp())
return df

customerID = df.customerID.unique().tolist()
customerID_map = dict(zip(customerID,range(len(customerID))))
df['CustomerIDMap'] = df['customerID'].apply(lambda i: customerID_map[i])
df.drop('customerID',axis=1,inplace=True)

df = convert_dates(df, ['PaperlessDate','InvoiceDate','DueDate','SettledDate'])

We also need to create a label column that dictates whether the payment is late or not. This will be the target column that we are trying to predict from the other feature columns.

df.loc[df['DaysLate'] > 0, 'IsLate'] = 'Yes'
df.loc[df['DaysLate'] <= 0, 'IsLate'] = 'No'
df.head()

Figure 3: Pre-processing the data points

We will be handling 2,466 data points, which is considered quite small for training a synthetic data generation algorithm. Usually, the more data points the better, but depending on the data quality, sometimes with fewer data points we can achieve the desired model performance.

df.shape()
Out[105]: (2466, 13)

Again, to make sure all the columns have the correct type, we will convert all categorical columns to string values so that the model doesn’t confuse these columns with the continuous ones.

df.dtypes

df['countryCode'] = df['countryCode'].astype(str)
df['CustomerIDMap'] = df['CustomerIDMap'].astype(str)
df['invoiceNumber'] = df['invoiceNumber'].astype(str)

Another thing worth noting is that the CTGAN model can’t handle categorical columns with high cardinalities. So, any column with a large number of unique identifiers or infinite discrete values will cause issues in training. Such columns will need to be removed from training. In our case, we will need to remove the ‘invoiceNumber’ column. Since there are already other features created from the date columns, we will also be removing those in this particular case.

discrete_cols = df.dtypes[(df.dtypes == 'object')].index.tolist()
for i in discrete_cols:
print(i, len(df[i].unique()))

discrete_cols.remove('invoiceNumber')
df_training = df[['countryCode','InvoiceAmount','Disputed','PaperlessBill','DaysToSettle','DaysLate','CustomerIDMap','IsLate']]

Now it’s time to get prepared for training. The CTGAN model is a neural network-based model that requires intense training sessions, thus GPU usage is recommended.

To start training, we create an instance of the CTGANSynthesizer class and fit it with our data, specifying discrete columns. We then run a long training session on our relatively small dataset and train for 750 epochs with a batch size of 100.

ctgan = CTGANSynthesizer(batch_size=100)
ctgan.fit(df_training, discrete_cols, epochs=750)

After training is done, we can generate as many data samples as we want. For comparison, we will sample the same size as the original training dataset. The model returns all data as strings, so we recast the data columns in the synthetic set to be the same as in the real set.

samples = ctgan.sample(df_training.shape[0])

tys = df_training.dtypes.tolist()
for idx,i in enumerate(df_training.columns.tolist()):
samples[i] = samples[i].astype(tys[idx])

Let’s take a look at the generated samples. They look very realistic, don’t they? But to see how they really performed we need to use the proper evaluation methods mentioned earlier.

samples.head()

Figure 4: Synthetic data samples generated by CTGAN

We create a TableEvaluator instance, passing in the real set and the synthetic samples, also specifying all discrete columns.

table_evaluator = TableEvaluator(df_training, samples, cat_cols=discrete_cols)
table_evaluator.visual_evaluation()

Looking at the cumulative sums and distribution plots as a way to compare statistical similarity, we can tell that the synthetic samples represent the real ones very well.

Figure 5: Cumulative sum for each feature (blue for real samples and orange for fake samples)

Figure 6: Distribution or histogram for each feature (blue for real samples and orange for fake samples)

Besides distribution visualizations, we will also evaluate our synthetic samples based on machine learning efficacy. We call the evaluate function from table_evaluator and pass in the target column. From here, table_evaluator will build models from the real and fake data and evaluate against each respectively. The numbers all look great, once again confirming that the model has done a great job.

table_evaluator.evaluate(target_col='IsLate')

Figure 7: Evaluation metrics of the synthetic samples

Last but not least, we compare the correlation matrix of the real data to that of the generated samples. If the correlation matrix of the fake data looks similar to that of the real data, then we have a good synthetic dataset as our synthetic data has similar interactions to the real data. For ours, they do look similar, which is great.

sns.heatmap(df_training.corr(), cmap='coolwarm', center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})

Figure 8: Heatmap of the real data

sns.heatmap(samples.corr(), cmap='coolwarm', center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5})

Figure 9: Heatmap of the synthetic data

From the unique model structure itself to the great results from our example, we can see that CTGAN is a fantastic tool for learning and generating synthetic samples on tabular data. If you would like to learn more about it, please check out the original Github project and share your thoughts with us!

Dawn Li is a data scientist at Finastra’s Innovation Lab, where she stays up-to-date with the latest advances and applications of machine learning and applies them to solve problems in financial services. Dawn holds degrees in applied mathematics and statistics from Georgia Institute of Technology.

Style

## Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.