The Big Business of Big Data Labeling as a Service
Table of contents
In the dot com era of the late 90s, demand for talent became so fierce that just about anyone could get hired in tech. It wasn’t uncommon in those days for some of your team members to be on work release. Any felon who could fog a mirror was capable of doing software testing by simply following scripts put together by slightly less nefarious software test leads.
Such was the case at a Bellevue, Washington, software testing company called ST Labs, which was acquired by Data Dimensions in 1998, which was acquired by Lionbridge in 2001. Today, Lionbridge is all but unrecognizable. What hasn’t changed is that if you have a pulse and a desire to work, they probably have a role for you in their one million strong team of crowdworkers.
From Localization to Data Labeling
As one of the world’s largest language service providers, Lionbridge has a 20-year history of leveraging its 500,000 linguistic experts to provide companies with services like localization testing, making sure that an application translates into other languages while still making sense. In 2019, they acquired a Japanese startup called Gengo which provided language services in addition to artificial intelligence (AI) training data services. At that time, the CEO of Lionbridge told Slator that he could envision a day when their AI business becomes larger than their localization business. Today, AI training services are a core offering for Lionbridge.
This might be a good time to talk about why there’s such a demand for AI training data.
AI algorithms are only as good as the big data you feed them. If you’re developing computer vision for a self-driving car, you’d want it to recognize all the things that a human driver would recognize. Signs, cars, people, animals, these are all things you want a self-driving car to recognize. To train an algorithm how to identify a stop sign, you give it 1,000 pictures of stop signs that are labeled as such. Then, you need to do this for all variations of street signs you might encounter.
Labeling street signs in pictures and video is an example of a “data annotation” task that humans would perform for building a data set for self-driving car companies. Pictures, video, audio, are all forms of media that can be annotated. Now, think about doing these same tasks across 300 different languages. Lionbridge brings to the table 20 years of expertise in language translation. Our next company may not have as much experience, but they’re emerging as a leading provider of AI training data.
The Data Platform for AI
Founded in 2016, San Francisco startup Scale AI has taken in $277.6 million in disclosed funding from investors who have ascribed a $3.5 billion valuation to the company. (This valuation comes from a $155 million round that was raised weeks after this article was published.) We’ll skip the part about how the 20-year-old founder dropped out of MIT to found the company, because it just reminds us of how little we’ve accomplished with our own lives. Scale uses a combination of high-quality human task work, smart tools, statistical confidence checks, and machine learning to consistently return scalable, precise data.
Perhaps the best way to understand what Scale AI does is by looking at some use cases they’ve handled for notable clients such as:
- Skydio – needed a large and varied dataset for teaching their drones subject tracking and obstacle avoidance so they’re able to do things like taking inventory in warehouses.
- Toyota Research Institute – needed a challenging annotation type in which every point in a 3D point cloud needs to be painted. Scale AI adapted to their ever-changing needs, ramping up throughput 10x in a matter of weeks, and coding a new annotation method in 24 hours.
- Embark – needed training data for self-driving semis. Turn-around time for datasets was reduced to 5 business days and data quality improved to over 99%.
- Skip – needed to ensure their shared electric scooters were being parked in appropriate places. Scale AI analyzed their images in real-time against local parking regulations to determine compliance.
Throughout these examples, we see how Scale AI’s flexibility as a startup lets them do things quickly for clients that a larger company might not be able to. When you become a publicly traded company, suddenly there’s all this process and procedure that gets in the way. Still, that hasn’t stopped our next company from growing rapidly.
Update 04/14/2021: Scale AI has raised $325 million in Series E funding to grow their install base amid growing competition from other AI training dataset startups. This brings the company’s total funding to $602.6 million to date.
Reliable Training Data for AI
Last April, we wrote about how Crowdsourced Big Data is Big Business for Appen, a $2.9 billion Australian firm that’s a “global leader in the development of high quality, human annotated datasets for machine learning and artificial intelligence.” Similar to Lionbridge, Appen (APX:AU) has an on-demand crowd of one million workers covering 180 languages in 130 countries. Revenue growth has been nothing short of spectacular in the past four years for this profitable company.
Speech and image annotation made up about 10% of 2019 revenues for Appen, while the remainder came from a menu of “content relevance” services that Appen offers as seen below.
This isn’t meant to be an exhaustive list of all companies out there doing data labeling. There are others, and some we’ve looked at before like Hive. The purpose of this piece was to look at how some of the leaders in data labeling stack up against a company we currently hold shares in.
The Long Term Outlook for Data Labeling
As current shareholders of Appen, we want to be aware of the competitive environment. As AI algorithms continue to be trained on all this data, there must come a point where it’s no longer needed in such large volumes. The question in our minds is what direction would a company pivot when the demand for their data labeling services subsides? We also need to consider substitutes such as synthetic data, or technology solutions that automate the data labeling process.
It seems like the content relevance work that Appen does will outlast media data labeling, at least until we achieve artificial general intelligence. For Appen and Lionbridge, their core competency is a crowd of over one million skilled multi-lingual contractors dispersed throughout the globe. What sort of tasks could this population perform in the absence of data labeling or content relevance? It’s a question we don’t have the answer to, but we’re hoping that becomes apparent in Appen’s communication of their long-term strategy.
Data labeling is a great example of how AI won’t just displace jobs, but also create jobs. The rapid growth of Scale shows how much demand there is for data labeling services, while Lionbridge shows us how to successfully pivot into growth trends using a core competency. As for Appen, they give us a good look under the hood as to how lucrative data labeling can be.
Appen is just one of more than 20 disruptive technology stocks we’re holding in The Nanalyze Disruptive Tech Portfolio. Find out the rest by becoming a Nanalyze Premium annual subscriber.
Become a premium member and get access to hundreds of premium articles, reports and additional content.
Nanalyze Premium is your comprehensive guide to investing in disruptive technologies. Read by the top investment banks, management consultancies, VCs, and research houses. Trusted by over 100,000 institutional and retail investors. Covering disruptive technologies for nearly two decades.