Synthetic Data Platforms for Training AI Algos Cheaply

December 17. 2018. 8 mins read

We recently introduced you to the best facial recognition algorithms out there today. Computer vision is one of the key artificial intelligence technologies empowering facial recognition – so you can tag yourself on Facebook or fulfill some furries fetish on Snapchat – as well as everything from self-driving cars to retail automation. These achievements rely broadly on three things: Einstein-level geniuses who do more thinking while sitting on the toilet than most of us do all day long; advances in computing power, including chips specially designed to handle AI algorithms; and, in technical terms, boatloads of data.

Plenty of companies leverage our online data exhaust every day in order to predict the future based on a few hundred million tweets. However, the sort of high-quality, realistic datasets used to teach machines how to detect such patterns or recognize a face require time and money to build. That’s partly why more and more startups and even mega-corporations are turning to synthetic data to train their algorithms.

What is Synthetic Data?

It used to be that everything synthetic was bad in some way, whether we’re talking about the height of 1970s fashion in polyester or the sorts of artificial colors that don’t exist outside of a bowl of Froot Loops. Today, synthetic biology startups like Ginkgo Bioworks are designing microbes to create natural fertilizers and flavors or saving the planet (and a few cows) by growing meat in the lab. Similarly, it used to be that only real-world data was thought appropriate fodder for feeding AI algorithms. No longer.

First, let’s get a working definition for synthetic data. We’ll use the one put forth by Evan Nisselson, a partner at LDV Capital that specializes in funding computer vision and other AI startups, in a piece he wrote for TechCrunch:

Synthetic data is computer-generated data that mimics real data; in other words, data that is created by a computer, not a human. Software algorithms can be designed to create realistic simulated, or ‘synthetic,’ data.

This synthetic data, he adds, “assists in teaching a computer how to react to certain situations or criteria, replacing real-world-captured training data.” Bernard Marr, a futurist and self-proclaimed tech influencer, writes for Forbes that one way to create synthetic data is simply to anonymize real-world data by stripping out private information like names, addresses, and social security numbers. Other methods to generate synthetic data include using different AI techniques, computer games, virtual reality, and other types of software.

How Does Synthetic Data Work?

One example that we read from San Francisco-based startup Neuromation (more on them below) might involve a company that wants to automate the retail experience in a store. That requires training algorithms on millions of store objects. Traditionally, a human photographer would capture images to train the dataset. Conversely, machines can be used to create realistic representations of the objects, which can then be “manipulated virtually to create endless variations that reflect different characteristics in size, shape, and color of a hundred similar products.” Says Neuromation CEO Yashar Behzadi, “Thanks to synthetic data, companies only need 50% of their original, authentic training data to finish the formal training of their algorithms. In fact, some AI applications, such as object recognition, can even be trained almost exclusively with synthetic data.”

Does it work? Yes, according to a study out of MIT last year that tested a generative synthetic data system against real-world data by seeing how it fared in solving a predictive modeling problem. The researchers found no significant performance difference in 11 out of the 15 tests. No doubt future iterations will do even better than 70%.

Why Use Synthetic Data?

The next question is why use synthetic data over real data? One of the top reasons is what has become a tech buzzword in recent years: democratization. The experts say that startups trying to get out of the gate are at a disadvantage against the data-rich Google giants of the world because they don’t have the resources to build the big real-world datasets required to train algorithms. Synthetic data also eliminates the privacy problems that might hamstring machine learning applications in areas like healthcare. The poster child for privacy breaches, Facebook, announced earlier this year that it would turn to synthetic data for its upcoming AI efforts.

Turning images from Grand Theft Auto into training data for autonomous vehicles.
Turning images from Grand Theft Auto into training data for autonomous vehicles. Credit: Darmstadt University

Finally, synthetic data also helps companies large and small scale up their AI training efforts. For example, the self-driving company Waymo has tested its technology over the course of millions of real miles, as well as billions of simulated roadways. Some are even turning to video games like Grand Theft Auto to train autonomous cars to drive during the Apocalypse.

Examples of Synthetic Data Platforms

Now, let’s look at a few startups selling synthetic data platforms.

Click for company websiteFounded in 2016, Singapore startup CVEDIA has taken in an undisclosed amount of funding to build a synthetic data platform that services enterprise projects. The company claims to work with over 30 major clients, including large aerospace, maritime, and autonomous vehicle companies, besides being funded by FLIR Systems, the world’s largest thermal sensor producer (and a happy client). The international team creates synthetic environments using a mixture of data science and advanced image processing, and Arjan Wijnveen, CEO, has been quoted as saying they’re “the only company thus far to produce data that can mirror, and even outperform real world data at times”. CVEDIA technology is able to auto-generate 15 types of data annotation as well as support sensor fusion in real-time.

Click for company websiteFounded in 2017, Neuromation is a San Francisco based AI startup pioneering the use of synthetic data for computer vision use cases. Originally out out of Estonia, the company raised $50 million in an Initial Coin Offering (ICO) in January of this year. Normally we’d make some jokes about ICO scams out of a small Baltic nation with a name that most Americans probably think is a new brand of cannabis vape pens (e-stonia, get it?). But Neuromation seems legit, featured in Wired and other publications, and CEO Yashar Behzadi has emerged as one of the leading proselytizers of synthetic data as a way for the Davids to compete with the Goliaths. Neuromation is still developing its synthetic data generation platform and recently released its third-quarter report about its latest efforts:

Neuromation expects to roll out its synthetic data generation platform next year.
Neuromation expects to roll out its synthetic data generation platform next year. Credit: Neuromation

In one use case, the company is creating digital images containing simulated pigs for a client that wants to train an algorithm to track livestock, according to the story in Wired. As we’ve noted before, there’s already facial recognition for cows. The company is also focused on churning out synthetic data for AI in healthcare. With offices in San Francisco, Europe, and Israel, Neuromation is helping enterprises across the world to build better AI models. 

Founded in 2015, Berlin-based TwentyBN has raised $12.5 million, including a $10 million Series A in September. TwentyBN has built an in-house data factory for generating high-quality, labeled video clips to teach neural networks about the real world. At this time, the company isn’t using synthetic generation but actually employing crowdsourced workers to record short video clips for specific descriptions, such as a person doing a thumbs-down gesture. This “crowd acting” approach, as the company calls it, helps generate large amounts of densely labeled video training data at low cost. However, the German startup is starting to experiment with synthetic video datasets. For instance, it generated short video clips using the Unity game engine, which allowed its team to render more than 50,000 densely labeled videos. While the company’s crowd actors aren’t out of a job just yet, the experiment showed that “synthetic data can be a useful complement to real data.”

Founded in 2017, New Yawk-based AI Reverie took in an undisclosed amount of Seed funding in May, which included Vulcan Capital, the multi-billion dollar investment arm of Microsoft co-founder Paul Allen, who passed away in October, so hopefully, the startup already cashed the check. (Too soon?) The young company builds photorealistic virtual worlds to closely mimic any real location for its clients, many of whom appear to be on an African safari.

The virtual worlds replicate everything from weather to weather people, with fully annotated video. It’s like VR without the motion sickness.

Update 04/15/2020: AI.Reverie has raised $5.6 million in funding to accelerate product development and increase hiring to match ramping demand. This brings the company’s total funding to $5.6 million to date.  

Click for company websiteFounded in 2017, an Austrian company called Mostly AI uses artificial intelligence and real data to create synthetic, anonymized data. Its generative deep neural network learns to simulate actual customer data by matching patterns and behaviors. That allows its clients to retain the details necessary to train AI algorithms but without identifying too much about your neighbors.

Founded in 2010, London-based Chatterbox Labs has created what it calls a Synthetic Data Generator that uses reinforcement learning to create synthetic data for enterprise systems. Reinforcement learning is a form of deep learning that encourages a machine to find the best way forward to maximize a reward, though we’re not exactly sure how you reward a machine outside of 3D printed sexbots. Chatterbox Labs says it can produce machine-ready data 95% faster. Which begs the question: Faster than what exactly?

Click for company websiteDeep Vision Data is a division of Kinetic Vision, a 150-person technology R&D company based in Cincinnati, Ohio. They provide outsourced product and technology development services to more than 50 companies in the Fortune 500, including 15 in the Fortune 100. Their Deep Vision Data division was created in early 2018 to meet the growing demand for synthetic training data for machine learning, and they believe they’re the only company in the US that is adequately structured to service this market. The company creates synthetic data using various “proprietary” methods such as computer graphics, real-time simulators, and custom software. For example, the company can quickly create thousands of training images of an object in a specific physical environment by using an augmented reality app to produce hybrid images of virtual objects in real-world settings.

We also came across a company called Synthetic Data, which has something to do with Big Data, IoT and AI, according to its website, that also manages to have marijuana leaves plastered across its banner. The only thing they haven’t done yet is a reverse-merger into some Over-The-Counter (OTC) shell and start pumping the heck out of it citing every disruptive technology known to man. Or maybe they’re selling synthetic cannabis? Just smoke the real thing, kids.


If AI is the new electricity, then you might think of synthetic data as a potentially cheaper and faster way of generating the power necessary to charge AI algorithms. However, synthetic data techniques are still in the early stages of testing and vetting, which is reflected in the mostly young, modestly funded group of startups that we found. But with companies like Facebook in the market for synthetic data, expect that dynamic to change quickly.  


Leave a Reply

Your email address will not be published.