Snowflake’s Vision For The Rebirth Of The Data Warehouse
For too many companies, the data warehouse remains an unfulfilled promise. The work that was started with data warehouses to create a living, clearly defined source of truth about what is happening in a business has never really been finished. Far too few companies have achieved the data nirvana of creating a clearly defined, searchable, and scalable data warehouse. A smaller number still have complete metadata management, comprehensive data governance, and data lineage. Whe
n these victors started to address big data, they didn’t toss their data warehouse, but rather learned how to extract signal from big data and added it to this beating heart of value.
In my view, Snowflake, a SQL-based data warehouse built from scratch on the cloud, is founded on the premise that it is easier to create data nirvana by:
- Implementing a cloud-native data warehouse to new levels of flexibility to adapt to workloads and self-service and simplified administration through automation.
- Adding big data capabilities to a SQL data warehouse, instead of adding SQL to big data repositories.
By offering capabilities based on these ideas, Snowflake seeks to both overcome the challenges of previous generations of data warehouse technology and embrace big data. (DISCLOSURE: I have worked on research and content marketing projects with Snowflake, Teradata, and most of the other players in the data warehouse and big data space.)
What Went Wrong With Data Warehouses?
For a variety of reasons, the journey toward a data warehouse has not been victorious for most companies. That is not to say that data warehouses have been a failure. We must remember that the need for the data warehouse grew out of the proliferation of enterprise apps. The data warehouse emerged to collect information, create one version of the truth, and support reporting and analytics. This it has done.
But the ways that scalability was supported with pre-computed structures such as star schemas, the difficulty of changing the structure of the database, and the need for experts to configure and administer the data warehouse, eventually led to frustration. The development of MPP technology, BI suites and the later generation of self-service technology were aimed and solving some of these problems.
The arrival of big data really put the data warehouse to the test, less so because of volume, because the best MPP data warehouses are quite scalable. The bigger problem was the variety of big data and a vast array of new types of analytics, some of which were not easy to do on the data warehouse.
Snowflake essentially argues that data warehouse should be as flexible and easy to use as the best of the self-service technology that allow analysts to get their hands on both massive SQL data sets and big data directly. Technology like 1010data provides a specialized language to enable this. Snowflake says, let’s do it with SQL, and add support for variably structured big data as well.
Of course, the traditional data warehouse vendors such as Teradata, and specialized tools like Wherescape and Attunity Compose are aimed at solving some of these problems, as are Google’s Big Query and Amazon Red Shift. But Snowflake delivers something different from these choices, as I will explain.
In the end, the unrealized potential of the data warehouse has led to high levels of frustrations for everyone from the CEO who can’t understand why the businesses can’t use data more effectively to analysts who can’t get the data they need in a reasonable amount of time.
Can’t Hadoop Do It All?
Hadoop seemed to offer a chance at solving many of the problems mentioned. When the promise of big data emerged along with a new set of technology, it seemed reasonable to think of Hadoop as the one repository to rule them all. Perhaps by using big data technology, many companies have thought, we could finally achieve the data warehouse victory we always been after. After all, Hadoop was built to scale out, to handle all forms of data, and to allow any type of analytics. Also, even though Hadoop was not built specifically for the cloud, because it rose to prominence as the cloud has become mature, there are many cloud-based hosting options for Hadoop.
But it eventually became obvious the expand part of the Hadoop vendors land and expand strategy wasn’t working out. Getting data into a data lake is one thing. Managing that data and extracting value is quite another. This second step hasn’t been made easier by Hadoop, but by the emergence of Spark, which provides mechanisms designed for application development and various types of analytics.
But it also turns out that big data doesn’t stay big for long. It quickly gets distilled into tables and SQL is a great way to perform data analysis, especially in combination with the huge body of analytics and tooling that support SQL. This has led to a quest to put SQL on top of Hadoop, but most companies have found trying to use SQL on top of Hadoop even more complicated and exasperating than conventional data warehouses. Hadoop on SQL is a work in progress that started with simple queries against large data sets, which work reasonably well. But when you get to the complex SQL queries and multiple workloads that most data warehouses can handle, Hadoop-based SQL is not yet mature enough to handle them.
Because Hadoop did not turn out to be the one repository to rule them all, most companies have actually been left fighting a battle on two fronts: the fight to conquer big data and the battle to make the data warehouse work. The problem with this is that most big data technology seems to ignore the victories of the data warehouse and instead attempts to start everything over from the beginning. This is counterproductive.
Bob Muglia, CEO, and the founders of Snowflake noticed these trends and realized that a new synthesis was possible. Why not take some of the great aspects of big data technology, such as the ability to scale out and handle semi-structured and machine data, and add them to a SQL-based data warehouse that was built to deliver a cloud-built SQL database and embrace the zero-administration, ease of deployment, and auto-scaling of the cloud. All of the SQL skills your company now has remain relevant to the world of big data. Given that a huge amount of enterprise data is now generated in the cloud, it makes sense for the engine to process it to be there as well.
Based on this vision, Snowflake developed a strategy to win both the data warehouse and big data battles by building on the achievements of the data warehouse, the flexibility of systems such as Hadoop, and the true elasticity of the cloud. Essentially, through their work, the data warehouse has been reborn, not as what it has and hasn’t been, but as what Muglia sees as what enterprises have always wanted it to be, including a way to embrace big data.
What makes Snowflake so interesting to me is one of the company’s strategies. It’s easier to make a SQL data warehouse speak big data than to make Hadoop speak SQL. The engineering distance from SQL to big data turns out to be shorter that the distance from Hadoop to mature SQL. Snowflake’s argument is that by creating a data warehouse in the cloud that can truly scale to big data levels, and connect to semi-structured data, the data warehouse can become what it was always meant to be – what Hadoop tried to be.
I think this argument is made stronger because big data is quickly distilled as it is used and becomes columnar data of a manageable size that is best operationalized using SQL. Snowflake doesn’t solve all of the problems that prevent data nirvana, but it takes many off the table.
The Current State of Play
Here’s Snowflake’s approach in more detail. The company was founded on three key ideas. The first was to recognize the victories that emerged from the struggle to make data warehouses easier to use and more powerful. Second was to build on the accomplishments of cloud computing by reengineering the data warehouse to take full advantage of the cloud. Third, Snowflake sought to incorporate big data by remaking the data warehouse with the tools needed to easily handle big data.
As Muglia recently told me, “Understanding data is still very much a work in progress for almost all customers. Data is a huge opportunity for almost everyone and it’s very unrealized.” The reason for this is that the challenge in making data warehouses enterprise-ready is not always technological in nature. It is often most difficult to create a common understanding of the business – as in, what does everyone in a corporation need out of data and what are the tools needed to meet these requirements?
One company that achieved this vision on their own is Warby Parker, which I’ve written about a few times (“Why You Can’t Be Data-driven Without A Data Catalog” and “The Heartbeat Of A Data-Driven Culture: How To Create Commonly Understood Data”). Using Looker, they built a solution in-house, creating a definition of the concepts used to run their business, embedding them in a good data warehouse, and making them explorable and findable. Now the entire company is served by a foundation of clean, well-understood data. Data nirvana come to life.
Yet Warby Parker is the exception, not the rule: most companies have struggled. They simply do not have the time or expertise to create this type of advanced solution. Their frustrations are understandable. Often, they also lack an organized approach to data. But there are two other big reasons companies have failed with existing data warehouses: 1) scale and concurrency limitations due to lack of resources (for the larger use cases) and 2) ease of use. Additionally, data warehouses have not been able to handle big data due to the structure of the data itself. Traditional data warehouses cannot easily incorporate machine-generated and semi-structured data.
For in truth, the modern data warehouse wasn’t built to do what most companies need it to do in order to fight the two-front battle of big data and the data warehouse. Running SQL on Hadoop hasn’t worked as well as it needs to, and that’s because of the engineering distance issue. Making SQL run on Hadoop in a trivial way is far different from making a truly scalable SQL-based data warehouse. SQL is a relational technology and Hadoop was never built with a relational model in mind. And more broadly, Hadoop was not created with the enterprise needs at the forefront. The story of Hadoop has been one of gradually adapting a scalable data processing technology to enterprise needs. That story is still unfolding and most of the solution seems to be provided by Spark.
Existing data warehouse companies like Teradata, Vertica, Neteeza, and Greenplum are facing the challenge of making their technology as easy to use and as simple as public cloud technology. They are attempting to adapt their engines to the cloud model to achieve the simplicity and ability to scale of cloud native technology. Muglia thinks this engineering distance will be larger than these companies realize. Part of the reason for this is that traditionally, data warehousing solutions have the compute and data sides tightly coupled. They’re embodiments of the old adage of bringing the data to the compute. But this is actually causing the problem, because resources end up over tasked at any given time. To be fair, this limitation is high on the minds of all the traditional data warehouse vendors and they are working feverishly to address it. For example, Teradata recently announced support for AWS using the same engine as it uses on-premise. My recent article on Teradata (“Teradata’s Quest To Become The Perfect Cloud Data Warehouse”) describes their journey. It is the first entry in a series I am doing on evaluating the capabilities of cloud data warehouses based on the framework for comparison set forth in this story “What Should The Data Warehouse Become In The Cloud?” I will use this framework to evaluate Snowflake, as well as Google Big Query, and Amazon Web Services Redshift, other examples of the data warehouse based in or brought to the cloud.
One thing is for sure: All of the vendors are claiming they have separated compute from storage in one way or another. I have found it difficult to understand what each of them are doing and what impact it will have. I will address this point in later stories.
So what, exactly, is Snowflake doing that is different?
A Way Forward
For Snowflake, separating the compute from the data is key to overcoming the limitations of the traditional data warehouses. Making this division is what Snowflake’s model is based on. By doing so, they’ve created a new modern cloud infrastructure that is low on administration and high on automatic scaling, making it easier to bring big data to the data warehouse than most companies thought possible.
Muglia explained the thinking behind Snowflake’s approach. “We completely break apart data and compute so you can store as much data as you possibly want and do so very cost-effectively on the one hand, and throw compute resources against that in a way that is completely independent of the storage,” he told me.
Snowflake can thus have multiple sets of computing resources working on the same data at the same time. “We can allow for what is effectively infinite concurrency because we can throw multiple sets of computing resources against the data problem at the same time,” Muglia said. “And of course, underneath us we leverage the fact that there’s a cloud, which essentially gives us the ability to muster up those resources on demand for our customers.”
This architecture of modern data warehouse built for the cloud remains a crucial factor for companies. The reason for this is obvious: the more difficult you make it to access, analyze, or use data, the less likely people will make the effort to integrate it into their work. Semi-structured data and machine data have made the administrative and formatting side of the traditional on-premises or cloud data warehouse cumbersome, and therefore made the whole data extraction and exploration process seem intimidatingly complex.
Snowflake’s solution for this has been to ensure that semi-structured data can be treated as if it has columnar structure that can then be used with the relational SQL model. Muglia told me that this approach greatly facilitates a better user experience. “All customers have to do is load data and run queries,” he said. “We can load JSON, Avro, and XML into our database because we columnarize data, so we discern information about that data as we load it.”
Snowflake’s cloud-built data warehouse is software as a service, engineered for rapid scaling. Snowflake was designed by engineers for the enterprise. Muglia said his core belief about business today is that “data is the fuel of modern business and customers need to get an answer to their business problems.”
Snowflake is well worth watching going forward. The company is yet another indication that the data warehouse is not outdated – in fact, it’s still essential for enterprises. It just has to be reborn with innovations specifically targeted to help enterprises win the battle with big data.
Article courtesy of Dan Woods.