Unstructured Data Sources for New Opportunities
Many of us spend our days trying to bring order out of chaos, with varying degrees of success. Can you imagine trying to do the same for the Internet, which according to some estimates will contain more than 40 trillion gigabytes of structured and unstructured data by the year 2020? That’s not far off from the rough count of stars in the universe (a “1” with 24 zeroes behind it), and the virtual universe of our computers is doubling in size roughly every two years.
Upwards of 90 percent of that data is unstructured data, a catch-all term that refers to information that isn’t organized in a way that makes it easy to analyze or understand. Unstructured data includes everything from Facebook posts to Anthony Weiner sexting photos to news articles about said sexting photos. In contrast, structured data is organized in a table or a relational database, making it easy to slice and dice, with each bit of information easily categorized in its own special zip code.
Most people aren’t even aware of their own “unstructured data exhaust” which is something we touched on in our recent article about big data privacy. Companies like Diffbot build tools that allow us to turn unstructured data into structured data and then use it for targeted marketing or other purposes. We’ve all experienced that Big Brother moment when our last Facebook post about dinner breaks out in a pox of ads for cookbooks and Groupon offers. That’s the low-hanging fruit from unstructured data that businesses have been harvesting for years. (Though Facebook recently upped its game with DeepText, its own AI that can better understand user intent and sentiment.) What we’re interested in talking about here are uses of unstructured data that only a higher mind – an artificial intelligence (AI) – might find ripe for the picking.
We covered this a bit last month when we reviewed a company called Aspectiva that uses AI to make product recommendations from user-generated content, such as product reviews on Amazon. Aspectiva isn’t just aggregating content; it uses a natural language processing algorithm to make sense of customer sentiment across an entire collection of product reviews.
Sentiment seems to be a keyword here when looking at the value of using artificial intelligence to make sense of unstructured data. In fact, computer scientists from the University of Utah’s College of Engineering recently developed what they call “sentiment analysis” software that can automatically determine how someone feels based on what they write or say. They tested the software’s machine-learning model by examining more than a million geo-tagged tweets about the U.S. presidential election. The database determined whether a particular county was all atwitter for Republicans or Democrats. This is similar to some of the work Ayasdi was doing with their topological big data analysis technology.
The analysis tracked closely to the New York Times Upshot election forecast website, finding a close correlation with the state-by-state analysis just a week prior to the election. In other words, the computer got it wrong, just like the pollsters. However, we expect to see more and more of these sorts of analyses from AI systems using unstructured data from social media in the future. For example, a company looking to open yet another brewery in craft beer-loving Colorado might want to know if beer drinkers prefer IPAs over stouts before setting up their tap list. We could also see how tweets and Instagram photos from our vacation might benefit travel engines looking to customize an experience for a family versus an independent traveler pursuing more hedonistic goals.
One company that sees polls and surveys as so 20th century is New Zealand-based Parrot Analytics. It has created a “demand rating” that tracks global interest in a TV program using traditional viewing, streaming services including YouTube, discussions on fan sites and blogs, posts on social media, viewer-generated ratings on sites like Rotten Tomatoes, wikis and other research sites, and downloads and streaming via peer-to-peer networks. One of the company’s biggest clients is BBC Worldwide, which told Recode.net it uses the unstructured data harvested by Parrot Analytic’s AI system to help determine strategy in more than 200 markets worldwide.
In the same vein, the movie industry is wrought with examples of movies – some with budgets of $200 million or more – that flopped because people didn’t receive them as the studio had hoped. Since AI is now creating movie trailers, why not have AI produce multiple trailers for a new movie, which can then be posted on YouTube? The comments could then be parsed for sentiment to determine how people might react to elements of the film, as a way to build better movie trailers and to sell more seats in the theater .
Of course, one can only squeeze so much marketable information out of cute pug videos on YouTube, even with the most sophisticated machine learning systems scrolling through all the unstructured data on the webpage. In that case, we prefer to take a bird’s eye view. Well, let’s go even higher than that – in orbit around the Earth.
Collection of satellite imagery was once the domain of governments but now private sector players like DigitalGlobe (NYSE:DGI) are constantly shooting pictures of the Earth, producing petabytes of unstructured data. Five years ago, one of the more creative uses of satellite imagery involved counting cars in Wal-Mart parking lots to determine customer flow, which helped analysts come up with mathematical models to predict Wal-Mart’s quarterly revenue each month.
Now companies are employing machine learning and computer vision to solve much more complex problems using satellite imagery. For instance, Fortune online reported earlier this year that Facebook used its computer vision algorithm and satellite imagery to locate some two billion of the world’s most remote human beings in order to help provide them better connectivity to the internet – and produce new customers. We can envision a time not too far from now – and some of it is happening now – when we will be able to track, say, crop yields from space or real-time shipping traffic, and understand the potential economic impacts.
All of these high-resolution pictures of places around the globe got us thinking about property and real estate. Surely there’s plenty of unstructured data – Craigslist and the like – out there waiting to be exploited by the right AI platform. And then we came across City Bldr, a Seattle-based start-up that has created a tool using data science and machine learning to identify sites for development.
According to the company’s website, it ranks property sites on “their redevelopment potential, using algorithms and feedback from more than 180 variables, including sale data, zoning, lot size, topography, and proximity to transit.” It appears to use a mix of structured and unstructured data, with more than 118 million points of data.
“CityBldr’s goal is to create smart cities,” CityBldrCEO and co-founder Bryan Copley told GeekWire. “We can use artificial intelligence paired with empathy and create happy, functional, sustainable communities.”
At first blush, unstructured data may seem like so much chaff that needs to be separated from the wheat. But thanks to innovations in machine learning, computer vision and other aspects of artificial intelligence, we can harvest it all and spin it into gold.