The Wide World of Web Data Integration

October 11. 2019. 6 mins read

Based on some cursory demographic analysis of our reader base, most of you won’t remember when optical character recognition (OCR) first made its debut. That would have been in 1976 when Kurzweil Computer Products developed the first OCR program that could recognize any style of print.

Called The Kurzweil Reading Machine (KRM), the device used flat-bed scanners, and text-to-speech synthesis to create the first print-to-speech reading machine for the blind that could read ordinary books, magazines, and other printed documents out loud. Having been founded in 1975, Microsoft would have certainly watched the evolution of this technology. It took a mere forty-three years before Microsoft announced that – thanks to “artificial intelligence” – Microsoft Excel will now let you snap a picture of a spreadsheet and import it. That may be the single best example of how unresponsive large corporations can become over time.

While Microsoft fine-tunes their OCR with some machine learning, other companies out there are looking to develop OCR technology that can capture digital content. The process of collecting screen display information – web scraping or “screen scraping” as it’s sometimes called – is much more difficult than it sounds. You need to convert unstructured web data into a structured format by extracting, cleansing, and integrating the web data so that it can be used by other applications. If the website you’re scraping from changes, it’s a showstopper as your IT team scrambles to figure out what broke. One firm wants to make those headaches a thing of the past with something they’re calling Web Data Integration.

Import.io and Web Data Integration

Click for company websiteFounded in 2012, Saragota, California startup Import.io has taken in just over $38 million in funding since they first debuted their technology – the ability to turn a web page into an application programming interface (API) – back in October of 2013. The technology was hailed by Yahoo! Co-founder, Jerry Yang, as having “the potential to revolutionize how we look at data on the web.” It’s easy to see why given the advantages Web Data Integration has over traditional web scraping:

Web Scraping vs. Web Data Integration
Source: Import.io

If you’ve ever tried loading data from one database into another database, you’re familiar with the extract transform load (ETL) process and can, therefore, appreciate just how difficult it is to consolidate structured data and turn it into something useful. Now, imagine trying to do the same with unstructured data. That’s the power of Web Data Integration. It’s opening up an entire world of alternative data with use cases that are only limited by the imagination.

Alternative Uses for Alternative Data

We’ve been talking a lot about how alternative data can be used to generate alpha which has raised a serious question. If everyone has access to all these new alternative data sources, won’t that erode the alpha away? The answer is no because there are so many ways to use the data. The same holds true for use cases. Companies across all industries are finding ways to use unstructured data from the Internet to gain insights.

Source: Opimas Analysis

Let’s look at some real-world use cases for web data.

 What Customers Are Saying

Any company that sells a product to consumers wants to know what people are saying about it. Just knowing that 738 people collectively gave your product “4.5 stars out of 5” isn’t that helpful. What would be more useful is to understand what words customers are using to describe your product. Even better, how about adding your competitor’s products to the mix too? In the below example, we see how this works with some coffee makers being sold on Amazon.

Mining product reviews for sentiment
Source: Import.io

The value here goes way beyond simple sentiment. Imagine a product management team being able to scour every review for words that might portray how customers are reacting to a particular product feature.

Product Pricing and Placement

E-commerce sites compete for clients by changing their prices regularly, sometimes by the hour or minute. Being aware of how your rivals are pricing their products is crucial in determining your own pricing strategies. “86% of consumers are checking out your competition for at least half of their purchases,” says Import.io, and “studies show that only about 25% of businesses consider the competition when setting their prices.” Import.io makes it easy to establish a baseline report of prices and products sold on Amazon where 55% of shoppers start their product searches. Tracking changes over time becomes a breeze.

Retailers can monitor competitor pricing and manufacturers can check things like “minimum advertised price.”

Making sure your product is being accurately portrayed in e-commerce sites is equally important. Import.io lets you extract photos and product descriptions directly from target sites in order to make sure the latest details are being used. Since some retailers use product popularity data from Amazon to organize their own online marketplaces, monitoring your product rankings also becomes something you can easily do with Import.io.

Online Travel

In the travel industry, there’s a term called “revenue management” which refers to maximizing profitability by selling the correct number of seats, rooms, or cars at various price points based on demand and pricing elasticity. “Spillage” means you sold your inventory too fast and “spoilage” means you aren’t selling your inventory quick enough. It’s an extremely difficult problem to solve and something that lends itself well to machine learning. However, we all know that machine learning algorithms are only as good as the delicious big data you feed them. That’s where Import.io comes in allowing players in the travel industry to set dynamic pricing strategies, forecast occupancy rates, identify travel trends, and gauge travelers’ sentiments.


The U.S. political circus is a great demonstration of how to drive audience engagement. Just coral a population into opposite corners of a broad spectrum and then feed them what they want to hear. See how engaged they all are as they spout vitriol at each other ad nauseam? One AI startup thinks the entertainment industry can learn a thing or two about what people are interested in by looking at what they’re already engaged in. Austin, Texas startup StoryFit uses Import.io to compare tens of thousands of elements within books and movies to identify audience fit and make market recommendations for books, movies, and television programming. Using the tool saw “time to data” drop by 93%.

Asset Managers

We recently looked at Eight Ways to Use Alternative Data for Trading where all the data is being sourced from a sole data broker – Eagle Alpha. With tools like Import.io, sophisticated asset managers can now start to create their own alternative data sets that are unique to their needs. In fact, they’re already doing that, except now it becomes a whole lot easier.

Nearly half of all asset managers are already using web scraped data. Just imagine how much more use can be had when tools like Import.io make web data so much easier to collect.

Growth Through Acquisition

With potentially billions of dollars at stake, startups need to be aware of who their competition is and address the threat. Frameworks like Porter’s Five Forces can help describe the competitive environment, but firms need to ultimately decide where the biggest threats lie. Acquiring a competitor can create synergies for both firms in a transaction leading to a 2 + 2 = 5 situation.

Import.io had their own transaction earlier this year when they acquired one of their competitors, Connotate, that was also focused on web data extraction. An article by VentureBeat on the transaction talked about how Import.io picked up eight patents, 30 employees, and enterprise clients like HP, Dow Jones, Thomson Reuters, FactSet, Cox Automotive, and Capital One Services. Following the acquisition, Import.io ended up with 850 clients. About seven years ago, Connotate also acquired a competitor, Fetch Technologies, that was primarily focused on retail and background checks.

There’s still plenty of room for competition though, as spending on web data moves from internal to external.

With a tool like Import.io, every bit of data exhaust generated by the Internet becomes alternative data. If your corporate IT department presently scrapes web data using internal resources, now you can save your firm some money and flaunt it in your coming year-end review.


By 2020, it’s estimated that for every person on earth, 1.7 MB of data will be created every second. In the example we gave of StoryFit, we can see an entire business being built around web data which inherently contains some sole-supplier risk but points to the “revolutionary” aspect of Web Data Integration. It’s become an indispensable tool for companies, a layer that sits between the biggest repository of information that exists – the World Wide Web.

The bigger Import.io gets, the better their algorithms become. With enough enterprise clients under their belt, they’ll make a great bolt-on acquisition for a large database company like Oracle. That possible exit event seems a whole lot more credible when you consider that one of the investors in Import.io’s seed round was Michael Widenius, a man who sold his last company, MySQL, to Oracle for $1 billion.


Leave a Reply

Your email address will not be published.