Creating Synthetic Data for Computer Vision Algorithms
Ask anyone who’s trying to develop machine learning algorithms what’s most critical for making better algorithms and they’ll probably say data. Having a unique dataset to train your algorithms on is a competitive advantage, but even the best datasets may not contain “boundary cases”, which are situations that happen so infrequently they’re almost impossible to imagine.
That’s where “synthetic data” comes into play. Synthetic data is computer-generated data that’s created to help algorithms correctly understand the world. In essence (and of course, with the help of experts), our brain machines have begun training themselves. It’s only a matter of time before we get pushed to the side so the machines can get along with it, unfettered by our primitive thinking methods. Until then, we need to try and make some money off the whole thing. That’s what a small company called CVEDIA is doing.
Founded in 2016, Singapore startup CVEDIA has taken in an undisclosed amount of funding from FLIR Systems (FLIR), the world’s largest thermal sensor producer, which made a strategic investment in CVEDIA last August. That money has been used to continue developing CVEDIA’s simulation platform – SynCity – which generates photorealistic, labeled 3D worlds, for recreating everyday scenarios and developing edge cases for training, testing, and validating machine learning algorithms across multiple domains. We previously wrote about a handful of other companies that are creating synthetic data, and what differentiates CVEDIA is having a $6 billion imaging company backing them and actively using their technology.
FLIR Systems works in the area of thermal imaging, a technology that’s also being used for self-driving applications. In our article on 6 New Perception Systems for AI Self-Driving Cars, we noted that “thermal cameras have advantages over other types of sensors, not only with higher resolution over LiDAR and radar, but better performance in poor weather conditions.” It’s those thermal imaging systems that need training data and looking at FLIR’s investor deck shows us that CVEDIA’s tech is actually driving their long-term strategy.
Let’s talk more about the underlying technology that powers the “SynCity” platform.
CVEDIA and Unity Technologies
A few years back, we wrote about Unity Technologies – The World’s Leading Game Engine which is pretty much dominating the gaming space. In that article, we talked about how Unity was planning to expand into other applications outside of gaming, one of those being autonomous driving. Last June, Unity held their annual development conference – Unite Berlin 2018 – at which Jose De Oliveira (Lead Engineer for Autonomous Vehicles at Unity) and Michael Ferreira (Development Lead at CVEDIA) gave a 30-minute talk on Synthetic Environments for Autonomous Vehicle Development which we watched so you don’t have to. Here’s the problem they’re trying to solve:
Unity’s strategy is not to compete with their customers, but to provide a community of developers who can help expand the capabilities of the underlying platform. CVEDIA uses the Unity platform to create and validate data that would be impossible, dangerous, or too expensive to gather. They prove this point by showing a bunch of crazy videos like a tank crossing the road, a guy in a shopping cart spinning around at the entrance to a tunnel, and a helicopter buzzing a highway.
The point is that we can’t develop autonomous vehicles that can navigate roads – especially Russian roads – unless they can deal with some pretty unpredictable stuff happening. In the world of software development, these unforeseen scenarios are called “edge conditions.” These are the things that we don’t know we don’t know, or as Mr. Rumsfeld would say, these are the “unknown unknowns.” CVEDIA creates virtual worlds and then actually re-creates their client’s LiDAR system in these virtual worlds.
Now, CVEDIA can begin introducing the LiDAR system to things like animals in the roadway, a tree falling across the road, landslides, or drunk Russian truck drivers. This resulted in their client being able to perform new iterations in hours instead of weeks. New edge cases were identified that weren’t even considered (unknown unknowns), and the client was actually able to save all these scenarios so that they could perform regression testing once the problems had been fixed.
What’s remarkable is that CVEDIA created a digital twin of the LiDAR system in a virtual world. They’re also able to do that with other technologies, like the sorts of thermal imaging systems that FLIR Systems develops. In the talk, another use case was described where SynCity was used to generate pedestrians in a virtual neighborhood giving them random poses and body types (imagine an extremely obese person lying in the middle of the road as an example). Then, they populated these neighborhoods with all sort of things. Again, they recreated the actual imaging system being used – in this case, thermal imaging – in the virtual world. For the objects within the virtual world, they created “thermal distribution textures” which allowed for entropy at the thermal level. The end result was a 25-30% improvement in terms of precision and recall performance ratios.
Additional Use Cases
A large number of current applications being tested have “unknown unknowns” in addition to autonomous driving use cases. For example, we may not be able to collect data on an enemy’s territory when developing autonomous defense applications. The enemy might be able to trick your algorithms using cardboard cutouts of tanks or by creating fake people like they did in the Battle of Santo Poco. Things also look different in twilight than they do sunrise. Extremely hot or cold environments can produce different results, like a sensor lens fogging over or a mirage on the horizon. All of these conditions can affect the way an AI system performs, and are huge barriers to success for people who are trying to develop brand new applications for machine learning. Here are just a few interesting examples of how SynCity is being used to help train tomorrow’s algorithm.
- Healthcare – There was a contest being held – CAMELYON16 – where machine learning algorithms were used to diagnose histological lymph node sections using the dataset provided. CVEDIA’s platform could actually create synthetic data that could be used to generate additional data to provide a competitive advantage for the contest participants.
- Transportation – Cargo vehicles crashing into airplanes cost airlines millions of dollars a year. A client was developing an autonomous solution and even acquired an airplane fuselage to test on, but it was unsuccessful in mimicking a functioning airplane chassis. Working with CVEDIA from the design phase out, gave the client the ability to create multiple training sets to test the best location and orientation for their sensors.
- Agriculture – Harvest Croo and MechaSpin are working on an autonomous strawberry picker and needed help making sure it wouldn’t run into any of the last few Mexicans who were outstanding in their field. CVEDIA not only helped identify humans but also built training sets that were used to help classify strawberries – green, developing, or fully ripe and red.
These are just a few examples of machine learning applications that can stand to benefit from some synthetic data.
The Future of Synthetic Data
As we exhaust the obvious use cases for machine learning that surround existing big datasets, new business models are popping up that require datasets we can’t collect for whatever reason. These algorithms can be trained just as fast as we can generate synthetic data, and CVEDIA wants to be the catalyst and provider for these brand-new use cases so that they won’t be encumbered by data collection. The company is able to achieve results that mirror, and sometimes even outperform, real-world data – and claim to be the first and only company to do this. Up to this point, the synthetic data market has only been companies that act as a basic stand in, but none of them have proven to be as successful as real-world data.
CVEDIA stands out because they’re not trying to create their own platform from scratch, they’re simply using the best engine out there and then building on top of that foundation. Someone else gets to deal with that whole software development thing while they just download new releases and focus solely on generating synthetic datasets which can be sold for top dollar. Can you say high-margin business? It’s almost like that guy who sold a virtual piece of property in some virtual world for $100,000 – another reason why the aliens don’t want to talk to us.
CVEDIA has already expanded across 19 countries and they claim to be ahead of everyone else. Their investor, FLIR Systems, has more than $520 million in cash right now, so further funding rounds won’t be a problem. Since CVEDIA is expanding into other applications, could be a great bolt-on acquisition for FLIR that lets them expand into additional industry applications. If that doesn’t fit FLIR’s strategic direction, then maybe Unity – with their $601 million of funding – could provide another potential exit. Regardless, the whole synthetic data business model is showing some real promise. and we’d expect to see other synthetic data players out there follow suit and align themselves with corporate partners which will help accelerate development and validate their technologies as CVEDIA has done.