Diffbot: Extracting Structured Data from the Internet

February 14. 2016. 3 mins read
Table of contents

Google search is something we cannot live without. There isn’t a day that goes by where we don’t use Google to look up some fact or to research some new topic or concept. All the articles you read on Nanalyze come from information obtained through Google searches. If you’re a regular Google user, you would have started to notice some changes in search results. We first started noticing it when we were looking up populations for various cities and countries around the world. Try asking Google what the population is for any country or city you can think of right now. You see? It answers your question and provides a nifty graph like the one seen below:


Traditionally, search algorithms will try to answer your search by combing through millions of pages to find the one single web page that will most likely answer your question. Now, through the power of deep learning, Google can begin to answer questions directly instead of redirecting you to someone else’s web page. Now if I search for “When did the Beatles start”, Google just tells me right off:


These types of answers are referred to as “structured knowledge”. If you ask Google “how many people are in Athens” or “population of Athens”, it knows what you are looking for even though you are asking the same question in two different ways. Google calls this technology the Google Knowledge Graph, which is a large database of semantic data describing more than 1 billion people, places, and things. Pretty soon you’ll be able to speak to Google using actual human language. If you ask Google “What was the name of that long Western movie about a slave who is freed by a bounty hunter and then goes and seeks revenge”, it’ll soon answer with “Django”. Using deep learning to comb through big data and create large databases of structured data is the next big thing for search, but it also has applications outside of search. One interesting company in this space is called Diffbot.

About Diffbot

Click to company website

Founded in 2010, Silicon Valley-based startup Diffbot was the first startup funded by Stanford’s SSE Ventures. The Company just closed a Series A funding round of $10 million last week bringing their total funding to $12.5 million. Their latest round was led by Tencent, one of China’s largest Internet companies.

Diffbot’s technology automatically extracts content from websites, articles, products, discussions, images and more. Incredibly, it does so with better-than-human-level accuracy across any website or language. This technology is available via software-as-a-service (SaaS) and uses advanced artificial intelligence technology to retrieve clean, structured data without the need for manual rules or site-specific training. Big companies like AOL, Adobe, Cisco, and eBay, are all using Diffbot’s technology including Microsoft who uses it to compliment Bing (that’s Microsoft’s search engine). Diffbot utilizes an “on-demand” business model with the below pricing model:


Not only does Diffbot offer their technology for others to use but they also use it themselves to demonstrate how powerful it is. Last year, Diffbot performed a study in the travel industry to analyze customer sentiment. The unique aspect of this study was that it analyzed user-generated content (UGC) such as article comments, reviews in TripAdvisor or Yelp, blog posts etc. all of which reflect what users are actually saying about their travel experiences. The study captured 230,303,990 datapoints across over ten thousand sites in a 2-week time frame. Can you imagine how long it would take you to read 230 million comments? Diffbot completed this study in just two weeks, and found out that British Airways, Hertz, and Hilton customers are the most grumpy out there. You can read more about Diffbot’s interesting findings in this article by Fortune on the topic.

Diffbot is building their own version of Google Knowledge Graph which already contains more than 1.2 billion objects and is increasing at a rate of 10 million objects per day. So how can retail investors make money here? The problem for retail investors is that Diffbot remains private at the moment and with the Series A funding round just closing a few days ago, it won’t be likely to look for an IPO anytime soon.


Leave a Reply

Your email address will not be published.