Blog articles

Synthetic data: The future of machine learning?

March 05, 2023

Before a machine-learning (ML) model can perform a task, it needs to be trained. This can take millions of example images gathered into a massive dataset. However, getting this data can be problematic — using real images can run afoul of practical and ethical concerns ranging from copyright laws, privacy issues, or bias against certain demographics. 

However, researchers are now discovering that using synthetic data instead of real world data doesn’t just sidestep these issues, but in some situations also can result in an equal or higher level of accuracy.

Table of contents

What is synthetic data, and why use it?

Synthetic data is what it sounds like: artificially manufactured data. Traditionally, it has been used as a stand-in for real world data when getting the latter would be too costly and difficult. This is especially important when it comes to machine learning which can require a lot of training data.

Developers and analysts can also generate synthetic data to meet certain criteria which may simply not occur in the real world. It can be modified to improve the model and training, and once the synthetic environment is ready, it’s easy to produce as much data as you need. 

Perhaps the most notable benefit is since it’s not real-world information, there’s no risk of sensitive or confidential data being leaked. In general, using this sort of data cuts down on a lot of red tape.

What is synthetic data generation?

Synthetic data generation is the act of producing synthetic data using a generator. You can use synthetic data generators to have data ready for use in minutes rather than spending days, weeks, or months trying to collect it. AI-powered synthetic data generators are available online, in the cloud, or on-premise. 

Before you go selecting any synthetic data generator, be sure to do your research. Use a synthetic data generator that is truly AI-powered, retains the data structures, and has additional built-in privacy checks. For more information on healthy data testing, refer to this guide.

A case study: How MIT used synthetic data to pretrain their models more accurately

Researchers from MIT, the MIT-IBM Watson AI Lab, and other contributors recently tested the boundaries of using synthetic data for their ML training. Instead of designing a customized image generation program, they created a dataset of 21,000 publicly available programs on the internet, and used this large collection of basic image generation programs to train a computer vision model. Each program was just a few lines of code, simple and uncurated.

The models they created with this dataset of programs classified images better than other state-of-the-art computer visions that had been pre trained with synthetic data, which in itself was a huge accomplishment. Increasing the number of image programs in the dataset also led to an increase in model performance, scaling logarithmically.

So how did it fare against models made with real world data?

Well, it didn’t beat them, but it closed the divide considerably. Using their technique, they closed the gap between these models by 38 percent. 

“There is still a gap to close with models trained on real data. This gives our research a direction that we hope others will follow," said Manel Baradad, the lead author of the paper describing this technique.

That said, in another study — again, involving MIT — researchers were able to train models on synthetic data that performed even better than models trained on real data. MIT and Boston University researchers built a synthetic dataset of 150,000 video clips that captured human actions, which they used to train machine learning models. Then they compared them to real-world videos to see how the synthetic data measured up, and found the synthetic trained models had higher accuracy.

“The ultimate goal of our research is to replace real data pre training with synthetic data pretraining. There is a cost in creating an action in synthetic data, but once that is done, then you can generate an unlimited number of images or videos by changing the pose, the lighting, etc. That is the beauty of synthetic data,”  said Rogerio Feris, principal scientist and manager at the MIT-IBM Watson AI Lab.

Other examples of synthetic data use

Synthetic data is being used in more than just academic research; it’s also being used in commercial applications as well.

The rise of synthetic data use is partially tied to cloud computing, because the cloud gives simulations created in VMs the speed and power to be developed faster. 

“A large simulation that might take years to process if executed on a local server can be completed on cloud-based virtual machines within hours,” said Javier Tordable, Technical Director in Google Cloud’s CTO Office.

“This all comes back to the central goal of synthetic data: to use artificially generated data—which is similar to real-world data in a meaningful way—in order to overcome the limitations or restrictions of obtaining that real-world data.”

While synthetic data can be very useful in training machine learning models, it is important to understand that once the model is published it will likely have real data passed through it to be used in real predictions. Since synthetic data was used to train the model, it becomes even more important to monitor the prediction accuracy as well as monitor the model for data drift to ensure that accurate predictions are being used in decision making.

How to get real, hands-on experience in machine learning

Are you interested in taking a deeper dive into machine learning? Be sure to check out Pluralsight’s Machine Learning Literacy course which covers the workflows, modeling techniques, and strategies behind any machine learning solution. Pluralsight’s content library is always expanding to include more education courses, articles, webinars, and podcasts. Don’t have an account? Sign up for free today.