What Does Databricks Do and Why Should Investors Care?

Nanalyze has been writing about big data since, well, it became big. While the term has been around since the 1990s, it really hit the big time about 10 years ago. It refers to extremely large and unwieldy datasets, which can be analyzed to reveal patterns or trends, often related to human behavior. Big data is usually linked to artificial intelligence (AI), as the latter needs the former to provide intelligence or predictions for decision-making. This includes everything from sales forecasting to the likelihood of developing cardiovascular disease. In recent years, a number of platforms have emerged breaking down the walls between big data storage, processing, and AI analytics. One of the most hyped and valuable private companies in the world playing in this space today is Databricks.

A Big Data Storage Company for AI

Click for company website

Founded in 2013, the San Francisco-based startup has raised a staggering $3.5 billion in funding, including a $1 billion Series G in February followed six months later by a $1.6 billion Series H in August. There are more than three dozen investors, representing a diverse portfolio of interests. There are the traditional and nontraditional venture capitalist firms like Andreessen Horowitz and Tiger Global Management, along with investment and asset management firms such as Fidelity, T. Rowe Price (TROW), Baillie Gifford, and Franklin Templeton (BEN). All of the big tech cloud companies are also represented: Microsoft (MSFT), Alphabet (GOOG), and Amazon (AMZN). Databricks is now valued at $38 billion, a jump of $10 billion since February, leaving it just outside the top five most richly valued private companies in the world.

List of top unicorn companies.
Credit: CB Insights

What Does Databricks Do?

So, what exactly is Databricks doing to earn itself such vast wealth and sky-high valuation? Databricks “empowers data science and machine learning teams with one unified platform to prepare, process data, train models in a self-service manner and manage the full [machine learning] lifecycle from experimentation to production.” Or, to put it a bit more simply, the startup has developed an open-source big data platform that is extremely flexible for deploying AI and machine learning applications for enterprise solutions. 

We’ll need to bulldoze through a bucketful of buzzwords to learn more.

What is a Data Lakehouse?

Databricks originally created Apache Spark, an open-source software for processing big data in a form amenable to training AI algorithms. However, the company really made a splash when it added a component called a lakehouse, which is a hybrid database that combines a data warehouse and a data lake. That’s a lot of data, so let’s break this down a bit.

We previously defined a data warehouse a few years ago, so we’ll stick to the thumbnail version here: A data warehouse is just a collection of structured, relational databases optimized for quick access, reporting, and analysis. It basically integrates data from disparate sources to create business intelligence, such as how customer demographics have evolved over time. A data lake is also a central repository but is capable of storing both structured and unstructured data at scale. It can run “different types of analytics – from dashboards and visualizations to big data processing, real-time analytics, and machine learning – to guide better decisions.” That last bit actually comes from Amazon Web Services (AWS), along with this helpful table:

Comparison of data warehouse versus data lake.
The big question is where to put all that big data. Credit AWS

In effect, Databricks breaks down the distinction between the two with its so-called Lakehouse Platform, combining business intelligence and artificial intelligence.

[A Data Lakehouse is] what you would get if you had to redesign data warehouses in the modern world, now that cheap and highly reliable storage (in the form of object stores) are available.

Credit: Databricks

If you’re someone who enjoys Kimball vs. Inman data warehousing architecture debates, you’ll want the advantages spelled out a bit more. Here are some intriguing ones if you’re fluent in nerd:

  • Multiple data pipelines allow many people to be reading and writing data concurrently, typically using a SQL query
  • Data types include images, video, audio, semi-structured data, and text.
  • Cluster configuration – storage and compute use separate clusters, thus these systems are able to scale to many more concurrent users and larger data sizes. 
  • Atomicity – guarantees that operations (like an INSERT or UPDATE) performed on your data lake either complete fully, or don’t complete at all.
  • Every Databricks workspace has a transaction log that provides an infallible ground truth
  • Allows business intelligence tools to run on source data – no need for your database administrator (DBA) to create copies of data for the data analysts to run their reports on.
  • And on top of that, real-time reporting is the standard – truly actionable insights at the speed of your data

The company’s 5,000-plus customers can build lakehouses on all major cloud platforms – AWS, Microsoft Azure, and Google Cloud – to support every data and analytics workload on a single unified analytics platform.

Databricks Use Cases

It may feel like your head is in a bit of a cloud at this point, so let’s look at some real-world examples of how the Databricks Lakehouse Platform can be used. 

Let’s start with the example of a British multinational consumer goods company called Reckitt (RKT.L), which bought into Azure Databricks, a data analytics platform optimized for the Microsoft Azure cloud services platform. Reckitt has struggled with demand forecasting in retail grocery, especially because it serves thousands of small mom-and-pop stores in emerging markets. The company collects tons of data but its legacy system makes it difficult to extract useful insights in a timely fashion.

Visualizing all of that delicious data with the help of Databricks. Credit: Databricks
Visualizing all of that delicious data with the help of Databricks. Credit: Databricks

By deploying Azure Databricks, Reckitt is now able to provide a unified data science platform that its teams can use to develop machine learning-powered insights to the business. Some of the benefits included:

  • 98% Data compression from 80TB to 2TB, reducing operational costs
  • Accelerated the running of 24×7 jobs by 2x (from 24 hours to 13 hours to run all of their pipelines)
  • Increased its ability to support its customers by over 10x – from 45,000 stores to 500,000 stores

Comcast (CMCSA) is another customer success story touted by Databricks, which claims it helped the telecommunication conglomerate vastly improve its entertainment business by creating an innovative intelligent voice command system to boost engagement. The platform also enabled Comcast to optimize data ingestion, replacing 640 machines with 64 while improving performance, which means humans spend more time on analytics than fixing infrastructure. 

Some other real-world efficiencies: Databricks helped Comcast reduce the number of devops full-time employees required for onboarding 200 users from five to 0.5 and reduced deployment times for AI models from weeks to minutes.

Databricks Competitors

One would assume that Databricks is a leading company in the Data Management and Machine Learning segment. Gartner certainly believes that’s the case:

Gartner magic quadrant.
Databricks is a leader in the Data Management and Machine Learning market. Credit: Gartner

There are some obvious dinosaurs names like IBM (IBM), along with upstarts like Alteryx (AYX), another big data and AI analytics company. (We recently added to our position in Alteryx while the market overreacted to some short-term problems.) We’ve also covered Dataiku, a company that combines data warehouse with machine learning capabilities built in. Since that article, the New Yawk-based startup has amassed its own small fortune – nearly $647 million at a $4.6 billion valuation. 

One of our astute readers asked whether Databricks was also a direct competitor to C3.ai (AI), a company that has (nominally) expanded its enterprise AI platform beyond analytics for Internet of Things (IoT). Certainly, there is increasing overlap between many of these enterprise AI companies that are focused on big data, data science, and machine learning. However, C3 is largely focused on analyzing data coming from many sources, including traditional databases, social media, and especially sensors, using their own homegrown platform which is customized by industry vertical. Databricks’ strength is in its hybrid Lakeshore Platform for harvesting insights from structured and unstructured data which it provides to customers across any and all industries for a fee structure that’s likely based on the amount of data processed. These are two different business models addressing different target customers.

One conspicuous name missing from the above quadrant (though found on a different one for Data Management Solutions for Analytics, which sounds awfully close to the first Gartner version of Hollywood Squares) is Snowflake (SNOW). This is a company that Warren Buffett famously bought at its IPO last year and more than doubled his investment. An article in the latest issue of The Economist profiled Databricks and directly named Snowflake as its most serious rival. For a deep and far more technical dive into the potential rivalry and the history of data warehouses, data lakes, and data spas (OK, we made that up), check out this informative blog post from which the below was taken:

Infographics regarding Databricks Lakehouse and Snowflake Cloud Data Platform. Credit: Datagrom
Credit: Datagrom

If Snowflake and Databricks are such close competitors, we can then use our simple valuation ratio to compare the two companies should an initial public offering happen. Here’s how Snowflake stacks up:

  • Snowflake
    Market cap / annualized revenues
    93.279 billion / 1.088 billion = 85

We don’t touch any stock with a ratio greater than 40, no matter how great a story they’re telling. Let’s hope that should an IPO happen, Databricks will price it at a more reasonable valuation.

In fact, there is a great deal of speculation that Databricks could IPO later this year, eclipsing the biggest public offering ever by a software company – a record currently held by Snowflake. Databricks is reportedly on track to generate $1 billion or more in 2022 revenue, growing 75% year over year. The company already claims its annual recurring revenue has climbed to $600 million, up from about $425 million the prior year. 


These are all impressive numbers but until we can look under the hood with some real SEC filings, it’s impossible to say if a Databricks position belongs in our portfolio. Even if the valuation is on order, we still need to consider if a position makes sense given the other enterprise AI stocks we’re holding.

There’s been plenty of speculation that Databricks is just fanning the IPO flames in order to drive up its price for an eventual sale to Microsoft, which has made the Databricks platform a premiere feature of its Azure cloud ecosystem. One thing is for certain: Big data needs a place to roost to feed AI-powered solutions, and Databricks seems to have emerged as one of the preferred homes of this very valuable corner of the cloud-based software industry.


Leave a Reply

Your email address will not be published.

  1. Your comparison of DataBricks to C3Ai would have readers believe that these are two different forms of companies, that play in different pools, however it would appear they swim in the same pool, only wearing different bathing suits!

    1. That’s a very good analogy. We just pulled this up from The Economist – a bit dated from 2019 but interesting nonetheless:

      C3’s chief rival in building a bona fide AI platform is not Big Tech or the very biggest data-analytics unicorns. It is a company called Databricks. It was founded in 2013 by computer wizards who developed Apache Spark, an open-source program which can handle reams of data from sensors and other connected devices in real time. Databricks expanded Spark to handle more data types. It sells its services chiefly to startups (such as Hotels.com, a travel site) and media companies (Viacom). It says it will generate $200m in revenue this year and was valued at $2.8bn when it last raised capital in February.

      Though C3’s and Databricks’ niches do not overlap much at the moment, they may do in the future. Their approaches differ, too, reflecting their roots. Databricks, born of abstruse computer science, helps clients deploy open-source tools effectively. Like most enterprise-software firms, C3 sells proprietary applications.

      It is unclear which one will prevail; at the moment the two firms are neck-and-neck. In the near term, the market is big enough for both—and more. In the longer run, someone will come up with AI-assisted data analytics that are no more taxing than using a spreadsheet. It could be C3 or Databricks, or smaller rivals like Dataiku from New York or Domino Data Lab in San Francisco, which are also busily erecting AI platforms. The field’s other unicorns are unlikely to give up trying. And incumbent tech titans like Amazon, Google and Microsoft want to dominate all sorts of software, including advanced data analytics.

      The entire piece is published on C3’s blog: https://c3.ai/technology-firms-vie-for-billions-in-data-analytics-contracts/

  2. 5th Aug 2022: Databricks Inc. says it has topped $1 billion in annualized revenue, a milestone that comes as the nine-year-old data analytics company looks to acquire other tech startups to drive growth.