BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Snowflake Architecture: Building a Data Warehouse for the Cloud

Snowflake Architecture: Building a Data Warehouse for the Cloud

Bookmarks
49:31

Summary

Thierry Cruanes covers the three pillars of the Snowflake architecture: separating compute and storage to leverage abundant cloud compute resources; building an ACID compliant database system on immutable storage; and delivering a scalable multi-tenant data warehouse system as a service.

Bio

Thierry Cruanes co-founded Snowflake and currently serves as Chief Technical Officer. He is a leading expert in query optimization and parallel execution. He spent 13 years at Oracle focused on the optimization and parallelization layers in Oracle databases including as lead for the optimization group.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Cruanes: You have to go back in time a little bit. In 2012, what was a data warehouse at the time was a big honking machine that you had on your basement. It was an appliance. At the time, I was working in a network company, and when we were describing what we are doing as a technology to new employees or prospect employees, people were looking at us like we were dinosaurs. We were building software for something of the past. What happened in 2010, around that time, was actually the rise of the cloud.

The Dream Data Warehouse

Around 2012 we said, "Ok, if we had to build the dream data warehouse, what will that be? What would be the characteristic of that system?" We wanted that system to have unlimited scalability and instant scalability. You want that system to have no data silo. We want it to be 10 times faster than other system, because you can gather a lot of resources. You want that system to be able to store all your data. If you have to store your data in different machines, in different systems, then you are losing, because they are a very complex system to manage.

At that time, it was a huge pressure because all these big data warehouse systems were designed for structured data for a rational system. You want that system to be able to store both structured and unstructured data. You want it to be able to scale at petabyte scale because of very low cost of storage. There was a lot of talk about simplicity. These systems are insanely complex to manage, so you would want that system to be super simple. You don't want to deal about management tasks. You want that system to be offered as a service. No tuning knobs.

If you go back to Visio, Hadoop, MapReduce, all these crowd of people that were pitching big data system, they were all compromising on things. They were compromising on performance. They were compromising on security. They were compromising on a lot of things. The mantra at the time was, in order to build a very big scalable analytic system, you had to give up on all these things. You have to give up on transaction, you have to give up on security, you have to give up on SQL, you have to give up on ACID transaction. You have to give up on everything just to be able to scale. We said, "No, you don't have to give up on all these to build a data warehouse."

Why Then?

What happened around that time? Why then? We were a little bit naive too when we started, and all interpretation of a cloud was a little bit naive in terms of engineering. The first thing that happened is that storage became dirt cheap. It's super easy to store petabyte and petabyte of data. We weren't worrying too much about the cost, and some storage system can take advantage of that. The other thing that happened is that network gave us the bandwidth we needed in order to build very, very scalable system, very large system. These big switches, these million-dollar switches gave us flat network. Luckily, Intel helped us, helped the cloud a little bit by giving up on improvement on the single-core performance.

Now, in order to gather performance, you need to gather cores, multiple cores, and multiple machines that can aggregate all this processing power. Data warehouse and analytic workload are super CPU-bound. They are CPU-hungry. They want a lot of CPU. They want to be able to aggregate a lot of resources in order to do their work. When we were looking at building that new system, we said, "What is the perfect sandbox for this to happen?" This is the cloud. This is our naive view of a cloud an infinite amount of resources that we can use and abuse in order to build these big analytic systems.

The design principle that we were going after was we have to design for abundance of resources instead of designing your system for scarcity. For a very small number of CPU, very small number of SSD, very small number of network, you don't do that. You design your system for abundance.

Three Pillars

If you can do that, you have something amazing. If you can build such a system that can actually gather the resources of a cloud in order to do something, then you have something magical. This something magical is on three different things that are very general things, I believe. One is an architecture where you can leverage these resources. We call it the multi-cluster shared data architecture. Essentially, it's an architecture that allows you to leverage the elasticity of a cloud and where you pay only for what you use. If you have to keep these resources always on and remove them when you don't need them, then you win. The fact that you have this architecture, you have instant scalability. "I want machines in the next two minutes. I want resources in the next second." If you have a system that can do that, you win. These systems are also performance isolation. If I can isolate on different set of cluster of things, then I win.

The second pillar comes from the fact that storage became instantly cheap. Because storage is cheap, you can keep multiple version of the same data. You can build system on immutable storage. There was a great talk this morning. I remember a paper from a long time ago, too long time ago, about immutability of storage and the implication of it. It has very deep implication across all the software stack. If you have an immutable scalable storage, you can have extremely fast response time at scale, because you can have multiple resources that are read-mostly storage. It allows for fine-grain vertical and horizontal pruning on any column. We wanted that storage, and that storage can be applied to both structured and semi-structured data.

The third aspect which is very important to all system but that we learned along the way, and we didn't really have an experience with it, but we had to learn. Probably, it's obvious for most of you, but building a multi-tenant system is insanely important and has very deep implication in the architecture of a system. The system has to be self-tuning. It has to be self-healing. You cannot babysit that thing all the time. For your customer, it has to be 24 by 7. Therefore, it has to provide transparent upgrade. The system is upgraded all the time. Most of the components of a system [inaudible 00:08:15] all the time. Customers don't see any of this, only you would. The second thing is that you want an architecture which is designed for availability, durability, and most of all, security. Because you are providing a service, you are responsible for providing all these things to your customer.

I'm going to go through these three different pillars of data architecture, and we will be starting with the compute.

Architecture

If you go back in time or even if you are looking at the most traditional architecture today, in order to build scalable system, people have either used shared-disk architecture or shared-nothing architecture. I'm not going to talk too much about the shared-disk architecture, because almost everybody today uses shared-nothing architecture in order to scale. It seems very simple. You take a piece of data, you have a petabyte of this data, you slice it in pieces, and you put it on local machines. Then the application or the way you're processing that data is going to target each and every of these machines, and then you do a gather or scatter processing. It's very easy to understand.

The problem of that architecture in the cloud is that now the data is leaned to your processing. If you want to scale that processing to support more and more customers, you still have that data which is located on the machines. You need to replicate. You need to have more and more things. Beyond that, law tells you that when you are putting more and more things on the system and you're trying to scale it, the amount of data that you are processing is going to be smaller and smaller, and you are going to be eaten by hotspot and things like that. If you want to increase concurrency on the system, you are forced also to scale that system in order to allow more user on that system. It's not really what you want to do. You want to gather more CPU for the same data. You don't want to spread the data super thinly in order to support more and more workload. More importantly is that when you're going to get rid of that system, if you want to leverage the cloud premises that you can publish your own system and then get rid of the system at any point in time, then you cannot get rid of this system, because data is on each and every of these machines that you have.

When we were designing the architecture for Snowflake, we said, "We are in trouble now," because yes, we have infinite resources, but we cannot really leverage this infinite resources if we don't change something. We actually separated compute and storage. What you really want is the data to be shared. You want the state of the database system to be shared and unique, because you want a lot of different use cases on that data. You want the different compute on the data accessing that data to be isolated. You want to be able to scale them independently. You want to have multiple customers accessing the same data. You want to have a lot of processing to a certain workload, no processing for others. This architecture is what we are using to scale.

How does it work? What you really want is the data to be at the center of our universe. As a single copy of a data, you are managing that data, and that data can have multiple formats: JSON, XML, or Parquet, etc. Then, in order to process that data, you want to allocate compute resources. That is how we call them in Snowflake, but I think it's called virtual warehouse. That virtual warehouse provides you compute resources to access that data. When you're done with it, you get rid of these compute resources. You don't need them, you don't pay for them. The data is still there, but the compute is gone. What it enables you is actually to have multiple workload accessing the same data, but with very different compute resources.

When we started, it was a very technical thing, and it took us a while to understand what was the implication of that architecture for our customer. The implication for our customer was that there is no data silo. Your storage is completely decoupled from compute. You store any data. You have unlimited scalability. The cost of compute is actually very easily controlled because you decide to allocate this compute resources for the amount of time that you are doing these processes. Because the data is centralized, it provides easy way to do dev test and QA, because the same data can be used for your test system and your production system. It is exactly the same thing. The storage system that we are leveraging is the cloud storage, the object storage of any other cloud provider. That thing has incredible durability and incredible availability, S3 or GCS or Azure Blob Storage. We have 11 9s of durability. This is handled off in any database system, because you have a database system which is under a single cluster of machine. We don't have that.

What is this virtual warehouse? You can think of it as a cluster of one or more MPP system. It's your data system. It's your native system. It's a set of compute. It's a unit of failures and performance isolation. When a workload is running on a particular warehouse, which is a cluster or a set of clusters, it does not impact another workload, which is another set of computes. Each and every of this virtual warehouse is resizable on the fly. Because the storage is centralized and can be moved into this different warehouse, you can resize on the fly. You can access any part of the storage. In order to get performance, this data is actually moved lazily from the blob storage, which is a remote, slow, super durable storage, into SSD and memory, and that's how you get performance. You move data closer to the processing, and you get instant performance. Of course, if you do that, you have split your workload, and now you need somebody else to call in a transaction, etc. We'll see a little bit later how you can do that.

Now, if you have such an architecture where you have decoupled the storage from the compute, you can abuse the cloud. Now you can leverage the abundance of resources in order to allocate multiple clusters of machines. Think of it as a ride system for database. I'm allocating one cluster, two clusters, three clusters, or four clusters as my workload is increasing. It automatically scales compute resources based on concurrent usage. What is interesting to notice is that it's not about growing a cluster horizontally. It's really about allocating new clusters of machine to absorb the same workload. If you look at query processing on a system, they have a sweet spot of resources that they are consuming. There's a hot amount of data that they are possessing. The virtual warehouse is a way to capture that sweet spot for a particular workload. Of course, these different clusters that you see, again, because of the cloud, we decouple them and we put them on different availability zones. This means that if something happened to one of the data centers the other two clusters in that picture would be available to the query processing.

What does it mean in the real world? You have a production database where you store all your data, and usually, you have multiple workloads that are going after this database. Here, just an example of things that you want to do. Usually, you have your ETL and your maintenance, which is running on the virtual warehouse. It's running 24 by 7 just pushing data into the system. You have continuous loading which is coming from either a Kafka queue or any streaming system into warehouse continuously. These different workloads, because they run on different computes, because they run on different isolated compute clusters, they don't interact with each other. It's not beacause at 8 a.m. Monday morning, I need to load suddenly 10 terabytes of data into the system, but I'm going to impact my continuous loading, or but I'm going to impact the reporting that I need to do, because these two things are actually running on completely different compute system.

Scalable Immutable Storage

One of the important things to notice is that, in order to make that happen, you need to have a very scalable storage system, which is very smart about how the data is accessed and how the data is controlled. Now, you have a lot of [inaudible 00:19:27] accessing that data, and you need transaction consistency, and you need a new storage which is very scalable. Luckily Amazon and Google and all these guys build insanely scalable systems. Now, how do we build a scalable storage system for a database system on top of this object storage? That's why it was [inaudible 00:19:53].

One fundamental property that we leverage around the design of a scalable storage is immutability. Immutability allows a system to accumulate immutable data over time. There is the version 1 of a data, version 2 of a data, version 3 of a data, version 4 of a data. Of course, if you do that on a traditional system, which is bounded by physical resources, I have only 100-terabyte of storage on SSD or 200 or 1 petabyte, it's a very stupid thing to do. The cost of storage, the cost of the hardware that you are going to put on the floor in order to be able to accumulate all this version is crazy expensive because the same system is used for query processing, your SSD, your memory than for actually versioning the system.

This immutability property allows you to separate compute and storage, because no, on the same version, the compute access a particular version of a system at a point in time. If I take a copy of a data, I send it to somebody, it can do the exact same processing of that data, but I had to do it locally. This immutable storage is heavily optimized for read-mostly workload. There is a different caching layer that you can build in order to get performance across your stack. It's, of course, a natural fit for analytical processing. What is interesting is that when you have a storage which is based on immutable data object storage, almost everything becomes a metadata problem. Transaction management becomes a metadata problem. Which version of a data do I access? I need to track down all these different versions.

Multi-version concurrency control and snapshot isolation semantic are given by this. I mean, this is what we use in order to give transaction semantic. Now, we switch the problem from being a data problem to a metadata problem, which has its own problem on its own. Again, transaction processing becomes a coordination between storage and compute – who has the right version, how do I lock a particular version, etc. That transaction management across multiple compute system, which is separated, it's global, is what allows for consistent access across all these compute resources. Of course, there is coordination across all these compute resources. They are not only writing stupidly to each of the storage. We need coordination. By moving all the coordination from transaction management to a different place in the architecture, you allow for actually synchronization across all these compute resources. All of that is enabled by this immutability of storage.

Now, I have immutable storage, great, but I want that storage to be scalable. How do I make that storage scalable? Dirty secret – for data warehouse workload, you want to partition the data, and you want to partition the data heavily. The unit of access that you have on that data in that storage system is going to be your unit of modification, your unit of blocking, your unit of application, your unit of recovery.

You want this thing to be as small as possible, and you want, again, the system to learn about that micro-partitioning of that data automatically. Each of these micro-partitions that you see here are both columnar. If I have 200 columns, we'll have 200 columns in each of these micro-partitions. I have very precise data demographics about each and every of these columns. Because you have data demographics for each of these columns and each of these partitions, and we have hundreds of millions of this partition on immutable storage, then you can essentially skip IOs that you need to do in order to process that data. If I have min/max on each and every of the column, I don't really need indices on the data. I can actually zoom very precisely to the set of partition that are supposed to fulfill a particular operation. It's true, this particular representation of a partition is true for both query processing, but also for DML, update, edit, insert, all these things, but also for very large bulk operation. If I want to drop last year data, it becomes completely a metadata operation. Again, by moving the storage, the understanding of a system of a storage, we created a metadata problem.

One of the things we wanted to have is system pushing more and more semi-structured data. The big data wave was a lot about pushing JSON document, XML document, very nested things. It was really a goal for us to actually have the same performance characteristics for structured data or rational data, which are really rows and columns, and semi-structured data and pushing my document into that storage. Why we wanted that is because, if you imagine a data warehouse system, a data warehouse system is taking snapshot in time of multiple production systems. If you are looking at the cloud, then you are looking at the system which is centralized where you have multiple production system pushing data from different sources.

20 years ago, it was one system, one OLTP system that was pushing data to a data warehouse system. The pipeline between that OLTP system and that data warehouse system was very simple. I was snapshotting in time my production system. Nowadays, people are talking about microservices, about services. The modern companies today have 20,000 different sources of data that need to land into a single system for [inaudible 00:27:35]. To be fair, it's not fair to the existing traditional data warehouse system to sustain these things, because each time a new source of data is added to a system, you need to change the ETL workflow that is going to push that data into the centralized system. It was critical for us to create a modern system in which both semi-structured data that are coming from cellphones, IoTs, devices, etc., goes into a rational system.

Just a quick example of how the architecture is deployed. You have, at the top, client application, ODBC driver, Web UI, Node.js, etc. that are accessing the system through HTTP. The first thing you have to do when you are new to a database is you create a new table, so I'm pushing this table into metadata. Then, in order to process that data, I'm going to allocate compute resources. This is an example of a warehouse. I'm allocating a number of resources for supporting my other workload. As you're accessing the data, which are these micro-partitions at the bottom, are going to move lazily into each warehouse, either memory or SSDs of your warehouse. Same thing for the other one.

I'm not just doing redundant things. I'm going to load that data warehouse. I'm allocating a loading warehouse, which is going to push new data into the system. This new data on commit is going to be pushed to the back end, to the storage system which give us 11 9s of availability. Of course, now, suddenly, this is a new version of the data that needs to be processed, and that new version of the data, the other two warehouse data there, it needs to access it. Lazily, the compute warehouse because we realize that a new version of data has been pushed, each of the query workload would lazily access the data.

What is interesting is that we struggled at the beginning to actually make things super secure because by default, the data is shared by everybody. This architecture actually enables data sharing between companies. If I'm Walmart and I want to share data with Nike or if I'm Heusen, I want to share data with somebody else, I can do it through that architecture. The architecture of a system actually enables data sharing between companies, not only between different things, different entities in a single company. It enables also replication, like replication between Azure West and Azure East or AWS West and AWS East, but also replication between different clouds. I can replicate between Azure and between AWS. I can have actually a disaster recovery scenario where I can fit over between different clouds. This slide is outdated because we now support Google too.

Multi-Tenant Service

Probably, this is the biggest learning experience for us, because all of that is database technology, we were database guys. The biggest learning for us, and maybe not for you – you're already here – it seems it was actually to build a multi-tenant service, what does it mean to build a multi-tenant service, and we are to learn that other time.

Providing a data warehouse as a service means multiple things. First, it's a multi-tenant service, so we are responsible for all the problems of a system. We are responsible for the administration, your upgrade. The system is also responsible for availability of data. It's also responsible for durability. When you have your own system, your own MySQL, your own PostgreSQL, your own system, you are responsible for it. You are responsible for building these things. We are taking ownership of that.

There were a lot of discussions about open-source and things like that. The new way software is delivered to customer is through services. It's not anymore through packets software that you installed somewhere that you think around it's delivered as a service. If you look at Snowflake service, and it's probably the case for any services, there's a metadata layer, a contour plane, I would say, which contains semantic and manageable state of our service, which is authentication, metadata management, transaction management, optimization, anything which access with state is in that cloud service. Now, you have unit of processing that are completely stateless, because you move a state to the cloud service, you want the rest of the system to be completely stateless. This virtual warehouse that we are talking about are stateless in all sense. We are stupid number cruncher that don't really know what they are working on. The metadata layer, the state is managed in the upper layer. The storage has to provide you durability.

I'm not going to spend too much time on that slide because it seems that this is your expertise. When you are building a service, you want that service to be built-in for disaster recovery and high availability. Meaning, you want that service to be replicated on few data centers, active-active. You want all the tiers of your service to be scaling out independently. If you take a picture of any database book today and you look at the different layer of the database that forms the database system, essentially, what Snowflake did was taking that book, that picture of that map of how to build a database system, and move different layers of this database system in different completely independently managed services.

The way these services are communicating is interesting, because when you put all the services into a single box, if you don't think about a database system and think about an operating system, the device driver is co-located with the memory manager, is co-located with the process manager, etc. If you think of architecturing an operating system from a cloud or database system from cloud, like it was our case, you split all of these things in different layers so that you can scale these things independently.

Another interesting thing is that, by having different layers that are communicating in a very asynchronous manner and decoupled manner, you have reliability, you can upgrade part of a service independently, and you can scale each and every of these services independently of each other. We are lucky because, since we own the client, we own the drivers, the ODBC drivers, the JDBC drivers that are actually living on the client side of things. We can easily do control back pressure, throttling, retries, all these mechanisms that services are putting in place in order to protect the service from bad actors or to protect the service from fluctuation in workload. The way you access a database system is very well understood and has a very narrow API, in a sense. Then you can implement all of these things transparently to the client because you are not connected. You are not connected, and all these services can scale up and down, and retry, and try to go independently of each other. Not all system have that. It's interesting that we control the client API.

Probably, the previous slide was something that you guys know a lot of, because you are all building services, but this adaptation and this fluctuation of performance is actually important all the way down to the lowest level. A database system is not a black box. There's things happening inside that system that allows it to actually adapt. You want all the layers of these services to be self-tuning and self-healing internally. You want algorithm to be able to do automatic memory management. If you have any of these components that are managing resources on a fixed size basis, then you have a system which is not very adaptive, which is not very flexible. If I cannot adapt memory, I commit memory to a particular system for a long period of time. If I cannot scale the CPU on my system, then I'm committing resources to this activity for a long period of time. If I cannot automatically handle failures as part of the processing, then I'm committing resources for the duration of this particular activity. For analytic workload, it can be minutes, it can be seconds, it can be hours, it can be, hopefully not, weeks.

This principle of having adaptability of a system going all the way from the client down to the processing is very important and has implication all the way down. We use a few things that help guiding our thought when we are designing new features for the system. I'm just giving an example of how we do skew avoidance inside the system. Maybe it's a little bit too database geeky for the audience. When you have a join, you want to be able to detect skew, because skew kills the parellelism of a system. The way you want that feature to work is completely transparently.

You want that thing to be always enabled automatic. You want to detect popular value at runtime. You don't want to have somebody telling you, "These are the popular values from my join." You want the system to detect the popular value. You want the system to be self-tuning. "What is the number of distinct values that I want to actually propagate in order to optimize my join?" You don't want somebody to tell you that. You don't want the DB to tell you that, because we have millions and hundreds of millions of queries in that system. In order for that system to be trustful, it has to guarantee that there is no harm. That probably should be number one, because when people are designing adaptive system, all this back pressure, etc., they need to make no harm. You need to have a guarantee that the system is going to deliver the service without performance degradation in front of enforcing things. The system should decide automatically when it kicks in and when it does not kick in. It has to be enabled by default. It has to be invisible to the user.

What's next? Today, database systems are a little bit in the cave. The way database systems are used is, you connect to a database and then you push a workload to that database by expressing it through SQL. Serverless data services is something which is actually taking ownership of this workload but are running outside of a database system or data warehouse system and being pushed into a system. The same principle applies if you want to reoptimize your storage. If you want to create a data structure that optimizes your workload, if you want to do things that are in your database workload, you want these things to be taken care of by the system. You want data services.

You want the system to take ownership of this workload for you. These services have to horizontally scale automatically. They have to handle failures, because you take ownership of what they want to do, what your customer wants to do. Because you take ownership of that, you have to have retries, transparent retry on failure. The state of a service is maintained by the service. The state of that workload is maintained by the service. Therefore, we can secure it. Therefore, we can manage it, we can scale it, because the state is maintained by the back end, not by the application. To come back to a precedent talk, in order for people to trust the system, you have to give back observability into what the system is doing. People have to be able to monitor the system and be confident. It's like your self-driving car. You still have speed control and some feedback that you trust about your car.

The next frontier for database, or shall we say data warehouse, is actually to take ownership of these different workloads. "I want to do forecasting. I want to do and pushing down into the back end such that they can be self-managed, secured automatically up to date." Not easy.

Cloud Native Architecture

It's an interesting journey because when we started in 2012, the cloud was the sandbox for us, engineers, to scale. It was about performance. By rethinking the architecture for the cloud, actually, you can add features over time. It's really a gift that keeps on going. If you don't architecture your system for this property of the cloud, then your competitor will. It's really a gift that keeps on going. You really have to rethink how you manage resources for this type of workload.

Questions and Answers

Participant 1: I'm really surprised by the fact that the system can save all type of files. In your data warehouse, you save JSON, XML, CSV, other relational data together. Is that a good practice to save everything in a single database or in a single place? It's like when you do the query – you search the file versus you search a data in your table. That's different. How do you handle this? Is that a good practice?

Cruanes: It is. Forget about the format, what you really want is the information to be in a single place. You want performance, you want security, you want all of that. It's not that the data format is important. You want to be able to query, for example, your IoT data, which is pushed into the system and join the data with your business data, my towers for a cellphone company. You want to separate the systems when the systems don't provide you these characteristics of a database system. If you don't have to use a specialized system, then you don't need to separate that data.

Participant 2: You actually maintain multiple versions of the data in the system. At the time of ETL transformation, how do you know what is the latest version? How do you make sure it's the latest version which is being accessed?

Cruanes: Snowflake is pure ACID compliant. You start a transaction, you do all your changes in your ETL. That creates version of the data undercover. Then when you commit, this version becomes visible to everybody. The transaction system actually is based on a multi-version concurrency control or snapshot isolation in the database structure where you can maintain transaction visibility across these versions. It's transaction resistant. We never gave up on transaction.

Participant 3: With the shared storage and compute or decoupled storage and compute, are we not going to flood the network by constantly pulling data into compute for short-lived computations?

Cruanes: Yes. If you are looking at the network bandwidth today, not compared to SSD, you probably had a 1 to 10 performance difference, 1 to 15. What I didn't go into too much details is that you really access that data from the data you need, the column you need, the micro-partition you need. A lot of this data, actually, the working set of your query actually fits into usually these types. You're right. For very short-lived data, your system is going to run at the speed of your network. Today, networks are pretty good, and that's one other thing that changed and created the cloud essentially – the ability to build switches and networking architecture that are very flat and that gives you uniform throughput across data centers.

 

See more presentations with transcripts

 

Recorded at:

Feb 04, 2020

BT