InfoQ Homepage Presentations Automatic Clustering at Snowflake

Automatic Clustering at Snowflake

Bookmarks

View Presentation

Speed:

Download

47:32

Summary

Prasanna Rajaperumal presents Snowflake’s clustering capabilities, including their algorithm for incremental maintenance of approximate clustering of partitioned tables, as well as their infrastructure to perform such maintenance automatically. He also covers some real-world problems they run into and their solutions.

Bio

Prasanna Rajaperumal is a senior engineer at Snowflake, working on Snowflake Databases' Query Engine. Before Snowflake, he worked on building the next generation Data infrastructure at Uber. Over the last decade, he has been building data systems that scale in Cloudera, Cisco and few other companies before that.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Rajaperumal: This is about automatic clustering at Snowflake. I'm a developer at Snowflake, my name is Prasanna [Rajaperumal]. Let's get started.

I wanted to spend a couple of slides going over our general thinking philosophy that permeates through all the design choices that we make and the algorithms that we pick. It looks like a marketing slide, but I'm going to break this down into how this is applicable to automatic clustering. What is Snowflake? Snowflake is a SQL data warehouse that's built for the cloud and delivered as a service. That's quite a few terms put together, I'm actually going to break that down. It's a SQL data warehouse which means we are SQL compliant, we have to be efficient in doing all forms of joints and we have to be efficient in evaluating those big ware clauses that usually come with the SQL statements.

We are a data warehouse, which means we are good at doing analytical queries. People from databases usually know that there are two categories of workloads. One is OLTP, that's Online Transaction Processing, and OLAP, that's Online Analytical Processing. Our system is optimized to do OLAP queries. OLAP queries are usually having large scans with some form of summarization or aggregations done or that scan. Usually, there's filters on a bunch of dimensions, one of those is a time damage, so large scan doing some form of summary, that's usually OLAP. OLTP is more of like a point update or a point lookup. You want to update a specific record and that's usually the OLTP style.

Snowflake is built for the cloud, what we mean by here is, all our algorithms and system design choices is cognizant of the fact that compute can scale elastically and it's a separate layer from actual storage. It's good to also keep in mind that the cost of compute in the cloud is much higher than the cost of storage. These trade-offs play a huge role when we are picking the algorithms.

Snowflake is delivered as a service. Sure, Snowflake is a hosted solution on the cloud, but that's not all what we mean here. I think there is a fundamental mind shift that customers really don't have to worry about maintaining or monitoring the health of your database system anymore. Snowflake would like to be the DBS for our customers, when we are developing features, the general theme that connects all these features are "Can we do things automatically?" There is a bunch of things when we introduced, we wanted to do this automatically and we have done that to a large extent. We do memory management of individual operators automatically. You can think about individual operators that run on a specific server, can scale up in terms of course, in terms of memory and scale down if there are less resources available. It can spill to disk if it's constraint on memory. It can spill to remote storage if it's constrained on disk as well. You can visualize these individual operators as having its own little brain and adapting to its environment.

Snowflake also detects the total number of workers that should be working on your query, that's the degree of parallelism. Also, it detects for a cluster how many parallel queries it makes sense for you to push on to that cluster of concurrent quakes. It does automatic failure recovery, if we detect data failure is probably due to an intermittent failure, discrete or our network, we automatically retry and usually, the retry succeeds if it's really intermittent. We can scale entire clusters up, if there are enough queries running on a cluster, we can just spin up a new cluster and direct the next set of queries to those clusters. We can just scale up query clusters and scale down if your load has gone down.

Obviously, when we thought about doing clustering, we wanted to do that automatically as well. We didn't want the customers to think about, "When do I run my clustering statement or how does it affect my existing workloads? How much resources should I provision for this thing?" We didn't want the customers think about all of that.

Architecture

Quickly, a high-level architecture of how Snowflake works. At the bottom of the stack, there is the cloud storage, the cloud storage here is Amazon S3 or Azure Blob Storage based on the cloud provider. That has the source of truth, all persistent data is stored in that layer. The top of the stack is a bunch of servers what we call as cloud services layer, these are common to all the clients that connect into the system. This takes care of all the common things like security, infrastructure, and transaction management and optimizations that's done on the query. It's interesting to point out that the metadata here is stored separately from the actual data, it's not stored along with the data. We extract the metadata about the data and this is managed by a completely different system for us and that's available to the cloud services layer.

When a user connects to snowflake, he is connected to one of the services, one of the servers in the cloud services layer and he can provision his cluster or his team's cluster. Compute is elastic, you can have any number of servers in the cluster and you can direct your queries to a specific cluster. One thing to point out here is that, sure, reading from S3 is costlier than reading from local disk, but it turns out that in modern cloud environments for OLAP style workloads, reading files, the IO that you do to scan files is not the bottleneck anymore, so that's why we are able to do this and have a shared desk architecture. This is the shared desk architecture where your compute is separate and you have a shared storage.

Table Data

Let's get into some of the details about how data is actually stored inside the tables. The table data is partitioned into segments and what is called as files sphere. I have to put a caveat here, it's not really files. We call it internally as micro-partitions, but that's not easy to visualize, so I'm just going to stick with the term files here. This maps with a single key-value entry in your S3 or Blob Storage. The file format here is columnar, it's very similar to the one described in the paper packs, that's an interesting paper, it's actually hybrid columnar. I have a little visualization about what hybrid columnar is. Row-oriented is essentially just storing every row next to each other. It compresses an entire row together and then it starts the bytes, and then the next row goes sequentially after that. Columnar is exactly the opposite where all values of a single column goes into a file.

Hybrid columnar is the hybrid approach of these two, you pick row sets, a bunch of rows that goes into a file, and then you store the values in a columnar format. Why do we columnarize? One of the main benefits is we get really good compression. If we pick the compression scheme based on the data type of the column, then we get really good compression, we use base encoding for numbers and dictionary and try encoding for strings. If you compare the compression ratio, it's huge when you store it columnar versus row-oriented. If your size is small, remember the virtual warehouse caches the files it reads from S3, so we end up catching a lot more files. That's quite beneficial for us.

Some characteristics about these files. These files are immutable, when you have to make a change to this file, it's essentially doing a copy-and-write. You have to copy all the other records that you don't change and write it into a new file, that's why it's a unit of update. If you have a file with two records that has, let's say, the same table that we had before name, age, and country and if I wanted, for some reason, to update the age of Trevor to 45, maybe 36 would have made more sense, but let's say he lied about his age and we had to update it to 45, we have to remove that file and write this new file down. It's the same for an insert into the file, there is no way to append or make an in-place update.

File is also a unit of concurrency for us. If there are multiple updates that are going into the system, these overlap with each other only if they end up deleting the same set of files. That makes sense, it's the unit of locking. Because files are the unit of locks and unit of an update, we have to size it accordingly. A too big a file size if we pick will reduce the concurrency and we pay a high right amplification when you're doing an update. We want to be cognizant of that and we usually size the files to be few tens of MBs to avoid that pitfall. Because we sized the files that way, for a petabyte-scale table, we might end up with tens of millions of these files. It's good to have this picture at the back of the mind when we talk about how we can do clustering on these files.

What kind of metadata do we actually collect? We collect zone maps on each of these files. Zone maps are basically stats on every single column on every single file. Let's say you have these four files here and it has two columns, ID and date, the metadata for this four files would be for each of those files and for each of the column within those files we collect the min and max. We collect a lot more than just min and max, we collect the number of distinct values, number of nulls in that file. If somebody says, "Give me all the records where this column is not null," we can look at the metadata and do some optimizations there.

You can also see how we can do optimizations around if a query has a range, let's say a column, you get me all the rows that falls between 10 and 20. We're looking at the min and max of that file. We can do an initial optimization to prune away all the files that don't fall in that range, this is called as pruning. I'll be using that word quite a bit, pruning is basically a way of reducing your initial scope of the search, the search for the query down to a certain set of files which we think is relevant. We prune away the rest of the files, the more pruning that we do, better query performance. We are scanning a lot less data.

Index Comparison

At this point, let's do a quick comparison of why zone maps, why not a conventional index structure like a B-Tree. B-Trees are for a specific column, OLAP style of queries, like we talked about, can have different dimensions in the ware classes like there could be multiple columns that you are adding in the ware class. This means we have to create all the B-Trees for all the dimensions that we care about, it's not a single B-Tree that we are going to maintain. OLAPs, the workload usually is bulk update and large search.

B-Trees are not efficient when we do bulk updates because we are going to rewrite a large portion of that tree because a lot of records are going to be moved around and updated, so you have to update a large portion of this tree. Large scans also mean that you are going to chase down these pointers and figuring out where each of these records are, and that's not an efficient way. We need a really lightweight indexing mechanism and zone maps are the right fit. The zone maps actually was introduced by this paper called Small Materialized Aggregates or SMAs way back in the '90s I think, and that's a really good paper to read if you're looking for suggestions.

Default Clustering

Let's look into how Snowflake clusters the data or partitions the data by default. Clustering is basically grouping a bunch of values together so that it improves your query performance. By default, Snowflake cluster is based on the order in which we receive the records. Just imagine a stream of records coming into the system and we just chop when we are able to create the file size. The only partitioning logic that we use is, "Are we able to create the file size that we want?" We keep collecting, let's say, we have 10 records when we can create the file size that we want, we chop at every 10 records and create this file and flush to S3. As you can see, it's the values are grouped only by one dimension and that is the dimension in which data is being loaded into the system. It's not grouped by any other logical dimension within the data. We don't look into the data itself and try to group it based on a specific column by default.

Let's look into a quick example here, let's say I am doing that query, I'm executing that query. Select a sum of some aggregates, some summarization from a table called orders were the date column falls between three months, September of last year and on December of last year. Here, because date matches with the those records we received into the table, you can actually see pruning works pretty well because we are able to just not look at all the files that were created before September and after December, so this means we are only reading the set of files that are relevant for this query. What if we are interested in a different dimension than date? If somebody executes this query, select count from orders, they wanted to see how many orders came in for a particular popular product like iPhone, think about the optimizer here.

It has the min and max for that column product and it tries to prune the files that is relevant for that query based on the min and max of that column for each of those files. Since iPhone is quite popular, it could be there in pretty much all the files at least one row, so you won't end up pruning a lot of these files. You have to scan this complete file to find out what are these records that has this product iPhone. This ordering clearly does not help for any other dimension other than the load time.

Partitioning Comparison

I want to take a quick detour and compare other partitioning schemes that you might have heard about, you might have seen. There are a couple of ways to do data distribution. One is hash-based where you look at a key, hash the key, and based on the hash value, it goes to whether partition one, partition two or partition three. Essentially, you lose the order because your hash is not order preserving, so it's not efficient for range scans. The other way is range-based partition, in range-based you have strict range boundaries. Partition 1 is responsible for all records that has keys from 0 to 25, partition 2, 26 to 50, and so on. The problem with this is a particular range can be really popular and it can get big, like if 0 to 25 is a very popular value, then that particular partition gets really big and we have to end up splitting that partition so that work can be equally distributed across all of these partitions.

A key thing to note here is Snowflake does not use these keys for data distribution. We are a shared storage system. In that sense, the compute nodes don't actually manage data. The only reason we do partitioning here is to do pruning and approximate sorting is good enough. We can do away with overlap in these ranges and still be fine. In this range, a value from 20 to 30 could be either in partition 1 or in partition 2 or in both because there is an overlap between those ranges, and that's fine. If you're querying between 20 to 30, we might end up scanning 2 partitions and that's not the end of the world.

Default Clustered Table

Going back to our point that the default way we cluster is not the optimal way for any other dimension. This is another illustration of that. We have 24 records coming in into the system and we have created these 4 micro-partitions based on the order in which it came in. We have the first six rows going into partition one, the next six going into partition two. You can see that this data is roughly audit.

There is some overlapping because that date is probably when the event actually was created, and the order is the way that the rows are actually loaded. These are late-arriving data and that's fine, there's usually some inter-leaving between that. Let's say I'm executing this query, select count from this table where ID equals 2 and date is 11/2. Let's try to be the optimizer and try to optimize this. Look at the min and max of each of these files and let's try to prune. Let's look at date, it's 11/2, this file we have to include, but this file we can prune because it doesn't fall into the range. Let's look at ID now, ID equals 2 does not really help you because the min and max pretty much contains the entire range for the IDs because it's everywhere. The two is pretty much in every partition.

Explicit Clustering

I think we can do better, if we know that ID is going to be there in most of the queries on this table, then we should order things differently. We should be able to select a list of expressions or list of columns saying that these are my interesting dimensions in the table, and most of the credits are going to have ware classes on these dimensions and I can order these columns and set that as my clustering key. Now the clustering key defines its own order, there is a natural order by which we got the data and then there is the order defined by whatever you set as the clustering key. Automatic clustering is basically the problem of taking files which are ordered this way into the files that are ordered explicitly using the clustering keys. Roughly, the problem definition can be reduced down to continuous sorting, but at petabyte scale.

Just driving to the point that if we were to set the clustering keys on that table that we just saw, the clustering key is now date and ID. This is a multi-dimensional key where the higher order key date, so you try to group the date first and after that, you try to group the ID. You can see that the interleaving is gone here in the date, so we only have two partitions that has 11/2 and we actually have 1 partition that all the 2s went into that and there are 2s here, but we don't care about that because date pruned those micro-partitions of it. We only end up selecting that one micro-partition.

Hopefully, I've convinced that good clustering will yield better pruning and which in turn will yield good performance, good query performance. That's not everything that we get from good cluster. There is pruning that's done during query execution as well, during the execution of joints. When you're trying to aggregate based on certain columns, you can see how the data that we want to send to the upper layers get reduced a lot if the cluster things that are grouped together most often in the query. The fundamental thing is we care about good clustering because that directly impacts query performance.

Challenges

This seems obvious, but what are the challenges? It's with that part continuous, there is constantly data coming in, changes coming into the table. How do we keep the order? One way to do this is whenever changes come into the table, we put that in the right bucket. Here, there is a column, it's clustered on a column that can go from 0 to 100. It's nicely partitioned from 0 to 10, 10 to 20. Now there is a new file that has the center range, 0 to 100 rows, with all possible ranges. How do we maintain the order? For us, there is no in-place updates or in-place append, so we have to rewrite these files completely. This, obviously, is not efficient.

Reclustering in line with the changes has really high write amplification for us. The speed at which you can load data into the table is now bound by how quickly you can do the reclustering which is not ideal. It's not really practical for even terabyte tables. Obviously, the solution is don't do it right away with every single change, let's batch them up together. Because you're batching these changes, you are more likely to touch a lot more existing partitions because each of these partitions has a fan out degree and we have to be transactional inconsistent. We have to block other changes that come into the table when we are doing this huge reclustering job. Depending on this frequency that we do the recluster, we could end up with huge query performance variance, the huge variance in query performance. Think about running a query where the sorting is all messed up right before we started the reclustering job and running the same query right after where everything is all nice and sorted. For these reasons, it's not practical for petabyte-scale tables.

Requirements

None of the off-the-shelf approaches works for us. We have to really think about what are the actual requirements that we need from the algorithm and let's actually list them down. Key thing here is actual sort order does not matter for us. Approximate sorting, some form of overlapping is acceptable. The goal is to be able to prune files effectively with the min and max.

We want the whole thing to be a background service, we don't want the customer to think about when to run this re-clustering job and all the effects that it might have. This is critical, we have to find incremental work to improve the clustering state or improve query performance because there is always going to be phases of time where we are not keeping up to the changes happening to the team. Whatever work that we do during that time should be the biggest bang for the buck in terms of query performance. We want to do incremental work that is most effective at that point in time.

Obviously, it should not interfere with other changes, other changes to the table should just go on. Customers should be able to change these clustering case, workloads change. The way I access tables change, you should be able to set a different clustering key. There is a lot of work doing the whole resorting thing, but that should be taken care by the background service, it should be able to elastically scale out, do the job and get to a good clustering state based on the new set of keys automatically and quickly.

Clustering Metrics

Let's go back to the problem definition that we had and add this little caveat. It's approximate sorting. I'm going to define a couple of metrics here to help us understand how well clustered a table is or how badly clustered a table is. Unless we have a way to measure that, we cannot say, "This is bad. Let's get to a good state." Let's go over one metric that is the width of a partition.

You're going to see this X-axis a lot in the next few slides, I'm going to explain this. The clustering value range is based on the clustering key that we set, this is all possible values that key can have. This is min of min on that clustering key of all the files and max of max on all the files that you have, this is the entire range. A width of a file is you point the min and the max and the length of that line is the width of this partition. It's confusing, the length is the width, but how wide the partition is, is critical to understand that how many overlaps that it might have. Here, the min and max for that particular file is 18 and 78, so the width of this is 60. The clustering key is age there.

The width of the file does not directly correlate with the number of records in that file. We could have a narrower file with lot more records, just that the min and max is pretty close to each other. This is an ideal state, and this is a bad state for us, this is ideal because you cannot have a lot of overlaps or it's hard to have a lot of overlap or a narrower width.

The other metric is the depth, depth is calculated at a single point in the range. You calculate how many lines overlap or how many of these files overlap on a particular value. For example, we have the two files that we saw in the previous slide, I added one more file which ranged from 27 to 100. For this particular value, 32, the depth is 3. For 70, the depth is 2, for another value like 20, the depth is 1. You see what the depth here really means. If I have a query that accesses this particular value, then I'm going to scan these many files because I can prune away these files, I don't have to look into these files. If a query asks for age equals 20, then I don't have to look at these funds, but if a query asks for age equals 32, this is bad, we end up scanning three files.

Goal

We see that the query performance can vary a lot between what value range you are interested in or the query is interested in. The fundamental goal is to have predictable maximum query time. Depending on which range you query, you can have huge variation like we saw that we could be scanning thousands of files versus few hundreds of files. The goal here is to reduce the max depth of all of the files that are overlapping. The way to do that is to reduce the width so that they don't overlap with each other. That's how these two metrics correlate with the algorithm. The goal is to basically reduce the worst clustering depth, and the way we do this is by looking at the width of the file and the overlap.

Let's go through this scenario really quickly, we have the same value range here and we are looking at the depth of two particular values, B and F, and we have four files here under letters that are actual values in the files. You can see that this particular file has a width of 11 because it has values from A to K. Let's say there are two new files that are added into the system. The intention here is if we work on the files that are really wide that overlap a lot with each other, if we merge these three files and create three new files, then the overlapping goes down.

If you go back, the depth here is increased to three and four on that particular value because we had these new files and now after working on these three files, we are able to reduce the depth again on those values to two. This is the fundamental intuition. We can pick small set of files that we can merge and repartition and reduce the depth, but what set of files do we pick? There are tens and millions of these files lying around. What are these small set of files that we pick and incrementally improve the clustering state?

This is the same visualization but at a much larger scale. This is the clustering value range and these are files with the width as the length of the line here. If they don't overlap with each other, I've just put them next to each other and I'm just stacking up. Here, when somebody queries at this particular value range, we are going to scan a lot more files than say here.

This portion is poorly clustered, we get that by plotting this histogram. We have this target depth that we are ok with and that's how approximate we want our algorithm to be. This sets the level of approximation. As long as we have these peaks down reduced below the target depth, we are good. If we find these ranges that overlap that go beyond the target range and work on the files that fall in that range, we should be able to bring the pig down. Once we finish reclustering, it should look something like this. This is a good state to have, we don't have to do any more reclustering on that table because that's our target depth.

Algorithm Outline

Quick outline. When changes comes in, we run the partition selection algorithm and that's the one that draws out the histogram and figures out the peak ranges and picks the files that needs to be merged and creates these small batches of files that needs to be worked on. The actual merging happens in our elastic warehouse which can scale up and down based on the queue size and it can quickly pull in these tasks, try to do an optimistic commit because we don't want to interfere with any change that's happening, and goes back and checks if the state is good enough. If it is good, then there's no more work we need to do on the table, otherwise, it just keeps going.

Effect on Query Performance

I want to show this slide where the X-axis is time and the green line here is your average query performance is measured by the average time we took to respond to every query. This is basically the number of reclustered jobs that happened during the time. The Y-axis here refers for each of these lines and this blue bar shows how many files were actually added into the system. At point time, T1, we see a lot of files that are being added and your query performance degrades because now the clustering is affected, the pruning is not very efficient.

At T2, clustering reacts to the fact that there are new files that are added into the table, and it starts doing the execution of these tasks. As soon as it starts doing some incremental work, you can see the depth in the query performance, and it actually goes down. There is a few files that are added at T3 and it, again, goes down and reaches eventual steady state which is where it started with and there is no more work for us to do.

Future Work

A little bit about future work, what's left? I think multi-dimensional keys are a very interesting problem because if you linearize this sequentially, if you think about two columns, higher order column, and a lower order column, the min and min of these columns, and the max and max of that column could be a very wide file. We could consider that as a very wide file. Think about geo coordinates. The min and min of two points and the max and max of two points, we could think that if we serialized this linearly, we could think that this is a really wide file whereas actually, it is not. We need other linearization mechanisms where the higher order key bits and the lower key bits actually interleave with each other. We are having prototypes around Z-order, grey order and Hilbert curves and figuring out which one works the best for which data type.

There is a manual step here that the customers have to figure out the list of columns that is going to be useful to be set as the clustering case and the order. We want to analyze the workload and set that automatically for the customer. Today, partition selection algorithm does not take into account the query usage. It just works on peaks that looks bad, but those ranges may not be even queried at all. That should be a factor in what set of files that we work on to get to a better clustering state, and that's partition selection algorithm in phase two.

I think Snowflake is an amazing team and we are taking on a lot of good, interesting projects, so if you're looking for change, then just reach out to us. We recently opened up a Berlin engineering office as well in case you're interested.

I shouldn't take credit for a lot of the things that I said. This guy, who's my colleague, actually implemented a lot of this and I give a special shout out to him and our founders to think about this algorithm and I should give them credit for that.

Questions and Answers

Participant 1: Sorry, I don't know enough about Snowflake. Wouldn't it be helpful to know the business model and understand your charge by query or by computation or amount of data? Also, when customers provide data would they have to define some partition keys or not?

Rajaperumal: About the billing, we have a cost, a fixed cost for storage and the cost of compute is then based on usage. If you end up storing few petabytes of data in cloud storage, we bill pretty much the same what Amazon would bill us and the compute is based on how much you use. That's an elastic cluster.

Participant 1: Usage you mean in terms of time elapsed for queries to run?

Rajaperumal: Yes.

Participant 1: If they are not efficient, essentially, you will be charging more.

Rajaperumal: Yes.

Participant 1: What I'm trying to say is, you don't have a lot of incentive to optimize it. Is that correct?

Rajaperumal: No, one of the first values is do it right by the customer. We don't really think that we are going to lose money by optimizing. We are just going to bring on more workloads if we are able to do things much better. We are going to have customers run a lot more queries on us. That's the way we think about it.

Participant 1: Also, the partition keys, when you create your data set, do you have to define that key beforehand?

Rajaperumal: Yes. That's our clustering keys, you have to think about what sort of columns would be most likely used in the queries and you have to think about the ordering today and set that as the clustering key.

Participant 2: I'm just curious, did you try to use some other randomized algorithms or locality sensitive hashing which are really popular in search and ranking field? You mentioned you don't need to be exactly correctly, approximately 95% is good enough for you?

Rajaperumal: Not that I know of.

Participant 3: Thanks for the talk, that was really interesting. How do you guys as Snowflake communicate to your customers that, as lots of files are coming in, the query time can go up for a period of time before the background work catches up? As a dev, having spent a lot of time putting a lot of data into something like fire store or something, and then watching spikes not really understanding what's happening, wondering why does it work that way. How do you guys communicate to watch out for that to the customers?

Rajaperumal: We get inquiries from customers saying that their queries are sometimes slow, and that's usually because of this. They didn't realize that there is a lot of data that just came in and they are querying that particular range. The one way we actually tried to solve this problem is by scale out. We know that there is a lot of work that needs to be done when there is a lot of data that comes in. We can potentially just throw a lot of missions at the problem and try to get to a better state really quickly, whereas in terms of a fixed-size cluster, you are competing this reclustering work with the queries that comes in, so that worsens the query performance even more. Being in cloud has that leverage for us.

Participant 4: How would you differentiate yourself from Google's BigQuery?

Rajaperumal: I'm sure there is a lot of bullet points that the sales engineers have, as an engineer, I respect Google BigQuery a lot because they are the pioneers of thinking about how to do SQL execution in cloud. We are pretty similar, I was about to site internal performance comparison that we did, but I'm not sure I can.

Participant 4: I mean, you're a multi-cloud, you could run on Amazon or Azure.

Rajaperumal: Yes.

Participant 4: That's probably one big differentiator.

Rajaperumal: Yes, that's a big differentiator. I'm just so inbuilt to just think about query performance and not think about features like this, but yes, that's a huge differentiation. We can potentially have a single Snowflake database that can be in multiple regions and across multiple cloud providers and provide a single view for you.

See more presentations with transcripts

Recorded at:

Jul 19, 2019

Prasanna Rajaperumal

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Automatic Clustering at Snowflake

Summary

Bio

About the conference

Transcript

Architecture

Table Data

Index Comparison

Default Clustering

Partitioning Comparison

Default Clustered Table

Explicit Clustering

Challenges

Requirements

Clustering Metrics

Goal

Algorithm Outline

Effect on Query Performance

Future Work

Questions and Answers

Related Sponsored Content

This content is in the DevOps topic

Related Topics:

Related Editorial

Popular across InfoQ