Big Data vs. Data Warehouses. What’s the Difference?
Technology progresses at a pace that’s impossible to keep up with, and aging technology executives will soon find that all those undergraduate technology classes are becoming quickly outdated. If you’re a Chief Technology Officer (CTO) at a large firm, you don’t have a ton of time to learn about new technologies because you’re too busy fighting fires and making sure you look good at the next board meeting in front of all the other people who are trying to do the same thing. At some point in time, you may find yourself asking: what’s the difference between big data vs. data warehouses?
Any technology professional is going to be familiar with what a database is. It’s simply a collection of data that grows over time, and from which you learn interesting things by querying. Then there’s the notion of a data warehouse which is what the name implies. Let’s not get into the whole “Kimball vs. Inmon” conversation and keep this real simple.
A Data Warehouse Simply Explained
A data warehouse is a number of disparate databases in an organization that can be connected by a common key. For example, we might connect records across multiple databases using a unique field called CUSTOMER_ID. Here are databases in various departments where records exist that we may want to link using CUSTOMER_ID:
- Accounting – An invoice management system
- Sales – A Customer Relationship Management system
- Customer Service – A customer support ticketing system
Using CUSTOMER_ID, you can then easily print out on a single page, a list of all invoices that haven’t been paid and a list of the 10 most recent service requests that a salesperson can then take with them to a sales meeting. Of course, today we just use Salesforce for all of this, but this simple example gives you an idea of how useful it can be to connect disparate data sources. That’s what data warehouses are all about, except they take it a step further and use the connected data to make decisions at very high levels. When building a data warehouse, you usually know which questions you might want to answer because some C-level person is asking for certain Key Performance Indicators (KPIs) to be measured. You don’t just go building data warehouses for the sake of building them because it’s an expensive task. Now, let’s talk about “big data” and data warehouses.
Big Data vs. Data Warehouses
The first thing we need to define is the term “big data” which pretty much defines itself. You’ve probably heard the often-cited statistic that 90% of all data has been created in the past 2 years. That’s big data. All the ginormous sets of data exhaust that are now being generated can be mined (remember data mining?) to extract insights. In today’s high-tech world, we might want to generate insights that we don’t know exist. Donald Rumsfeld cleverly referred to these as the “unknown unknowns,” things we don’t know we don’t know about. In the world of psychology, this concept is referred to as the Johari Window. You know that person in sales who is unaware of the fact that their mere existence makes everyone around them want to pull a Peter Pan off the nearest high-rise? The fact that the person is unaware of how annoying they are – and the fact that the people around that person can’t exactly put their finger on why – is an “unknown unknown” in that nobody knows why Rob in sales is just a big, fat, obnoxious prick. Anyways, moving on.
Remember how we talked about how a data warehouse is just a collection of databases that are connected? Well, what if one of those databases contained “big data?” That doesn’t change our data warehouse one bit, but we might want to attach some machine learning algorithms to that set of big data and try to learn about some “unknown unknowns.” That’s where we’re going to switch now and look at a concrete example of a startup that provided the tools you need to connect all your data – both big and small – so that you can then let your hungry AI algorithms munch away happily on it.
Dataiku – Enterprise AI
In their own words, “Dataiku is the centralized data platform that moves businesses along their data journey from analytics at scale to Enterprise AI, powering self-service analytics while also ensuring the operationalization of machine learning models in production.” In other words, it’s a data warehouse with machine learning capabilities built in. That’s especially important, because we’ve talked before about just how difficult DevOps can be for machine learning implementations.
Update 08/24/2020: Dataiku has raised $100 million in Series D funding to fuel their continued growth. This brings the company’s total funding to $246.8 million to date.
Digging through the Dataiku datasheet, everything sounds pretty data-warehouse-ish with statements like this one:
Connect to existing data storage systems and leverage plugins and connectors for access to all data from one, central location.
Yep, sounds like the same concept as data warehousing. Here’s a look at all the data sources they’ll plug into:
Of course, not all data is created equal, and that’s where Extract-Transform-Load (ETL) work is required. Dataiku calls this “data preparation,” and they’ve worked to build automation around this area since data preparation work typically takes up 80% of the time required for a data project. Once all the data sources are connected and the data has been properly prepared, data scientists can then start developing use cases to solve problems. Some of the problems that Dataiku has been able to solve include the following (links lead to case studies):
- E-commerce players are able to raise their conversion rates
- The supply chain can optimize their stocks and delivery times
- Energy providers adapt their production to the predicted demand
- Banks and financial actors predict risks and detect frauds
- Telco or similar subscription businesses who want to work on churn prevention
- B2C companies evaluate customer lifetime value to focus their best efforts
- Industries prevent breakdowns before they happen
- Sentiment analysis for brands by gathering Facebook and Twitter conversations
If they can rename their company using a name that people might be able to pronounce without needing a secret decoder ring, they may just be the perfect solution for enterprises that want to move from data warehousing to enterprise AI.
Could we then say that a data warehouse with integrated machine learning capabilities that can access multiple sources of big data is “enterprise AI?” Sure seems like it. Of course, every BSD tech executive out there is going to have some opinion about how “big data” and data warehouses are the same, completely different, or somewhat similar. What we’ve given you here are some educated opinions which you can feel free to spew forth at your next board meeting, along with an equation you can write on the whiteboard:
- (DATA WAREHOUSING + BIG DATA) X MACHINE LEARNING = ENTERPRISE AI
Make it look like you’re one step ahead of the game and justify your high salary because you’re a “thought leader” that the company can’t do without. Nobody will challenge you because nobody’s really listening to what you’re saying anyways. They’re all too busy trying to think how they can somehow steal your thunder. If what you’re saying makes logical sense, there is no wrong answer when it comes to talking about how big data and data warehouses differ, so just own the message and the audience will be none the wiser. You’re welcome.
Pure-play disruptive tech stocks are not only hard to find, but investing in them is risky business. That's why we created “The Nanalyze Disruptive Tech Portfolio Report,” which lists 20 disruptive tech stocks we love so much we’ve invested in them ourselves. Find out which tech stocks we love, like, and avoid in this special report, now available for all Nanalyze Premium annual subscribers.