BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations From Open Source to SaaS: the Journey of ClickHouse

From Open Source to SaaS: the Journey of ClickHouse

Bookmarks
43:59

Summary

Sichen Zhao and Shane Andrade discuss architectural design decisions and some of the pitfalls one may run into along the way.

Bio

Sichen Zhao is Senior Software Engineer @Clickhouse. Shane Andrade is Principal Software Engineer @ClickHouse.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Zhao: Have you watched a tennis game before, or have you played a tennis serve before? What makes a perfect tennis serve? It mainly needs two things, speed and accuracy. Speed is intuitive. You need to hit the ball hard enough so that the opponent has very little time to react to it. That gives you a better chance to score. How about accuracy? The ball needs to land in bound of the serving box, and not right in the middle of the serving box, but on the edge of the serving box or on the lines, so that even if the opponent player has a very fast reaction, they need to move one step or two steps aside in order to reach the ball. It gives you a better chance to score.

Why am I talking about tennis in a QCon architecture track? The reason is, in our mind, building a ClickHouse Cloud service from ClickHouse open source database is very similar to a perfect tennis serve. First, we need speed. In this competitive business world, we need to build something fast. Speed is essential. Second, we need accuracy as well. We need to build something that really meets the customer need and solves the customer pain points. That's not only including features, but also scalability, security, reliability as well. That makes building ClickHouse Cloud even harder than making a perfect tennis serve. In tennis, you still get the lines in the field, so the player can aim at the right spot in order to get the accuracy. In the business world, those lines are blurry. There's no lines, sometimes. That means we need to talk with many customers multiple times in order to listen to what they really want us to build.

Outline, and Background

In today's talk, we will go through our journey of building ClickHouse Cloud from ClickHouse open source database. I first introduce, what is ClickHouse? We'll introduce the database and some of the common use cases. Then, we'll dive deep into ClickHouse Cloud architectures. We'll share some of the big decisions we made to build ClickHouse Cloud. Then we will look back on our timeline and key milestones and share what we did, when and how. Finally, we'll summarize it with key takeaways and lessons learned in the retrospective.

My name is Sichen. I'm a Senior Software Engineer at ClickHouse.

Andrade: I'm Shane Andrade. I'm a Principal Software Engineer at ClickHouse.

What is ClickHouse?

Zhao: What is ClickHouse? ClickHouse is a OLAP database. It's good for analytics use cases, good with aggregation queries, good for support of data visualizations. It's good with mostly immutable data. It's been developed since 2009. It was open sourced in 2016. Since then, it gets a lot of popularity. It's a distributed OLAP database. It supports replication, sharding, multi-master, and cross-region, so it's production ready as well. Finally, it's fast. That's the thing that is most important about ClickHouse. That is the thing which makes ClickHouse special. How fast is ClickHouse? Let's take a look at this query. This is a simple query in a SQL, running against a ClickHouse Cloud instance that aggregates all the new videos uploaded between 2020 August to 2021 August. It aggregates by month and calculate the sum of the view counts on each month, and also the percentage of how many of those videos has subtitles. The bottom line shows how many data it processed by this query alone. It's more than 1 billion rows processed and more than 11.75 gigabytes of data touched. Let's have a guess how long this query will take. I'll give you three options. A, 1 to 2 minutes, B, 1 to 60 seconds, and C, under 1 second. It's under 1 second, 0.823 seconds to process this much of data.

How is it possible? Why is ClickHouse so fast? In a traditional database, all the data are organized in rows. For queries like the previous slide shows, even if the query is interested in only three columns, basically month, view, and has_Subtitles, the aggregation queries will need to not only retrieve the three columns, but also all the other columns that present in the same row of those data. That creates a lot of waste. In ClickHouse, on the other hand, we will organize data by columns. For the aggregation queries like the previous slide shows, we only need to retrieve the data and compute the data of those three columns that the query is interested in. That makes ClickHouse fast. From the animation you can also tell. Other than the high-level columnar database architecture that makes ClickHouse fast, there are a lot of other columnar databases out there in the market. What makes ClickHouse even faster than those columnar databases competitors? For that I need to mention bottom-up optimization. This is just one example I pick out of hundreds or even thousands of examples in ClickHouse open source code. If you know C or C++, you must know memcpy. It basically copies a piece of memory from source to destination. Mostly it's provided by the standard library itself on the right-hand side as shown. In ClickHouse, after running the aggregation queries, and after running memcpy in a loop for many times for those big aggregation queries, we figure that there might be some better way of implementing memcpy, that's on the left. On a high level, it adds a little bit of right padding on the memcpy each time, so that when we use this left-hand side, small amounts of memcpy in the loop. We can load a little bit more data, and that allows us to do SIMD, so Single Instruction, Multi-Data Optimization on CPU level. A little bit of parallelism on each time we do a loop.

I just want to show this example, as one example out of hundreds or even thousands of bottom-up optimizations we did in ClickHouse. We have our own implementation for special cases like this for a standard library of C++. We also have special implementation of hashing on compression, sometimes optimization on compiler level or hardware specific level as well.

Accumulating all of these optimizations together, we're able to achieve faster speed compared to the competitors. As a result, along the year of ClickHouse lifespan, we attract different use cases and different customers as well. Because ClickHouse is fast, so real-time data processing naturally kicks in as a common use case of ClickHouse. Also, because ClickHouse is able to process a large amount of data, good with aggregation, good for supporting data visualization, BI customers can use ClickHouse as a good use case as well. Also, as you see, ClickHouse supports billions of rows of per second read and supports millions of rows inserts as well. Logging, metrics for those large amounts of machine generated data, ClickHouse performs well as a tool as well. Also, recently, we support vector database functionality as well, along with the aggregation and big data processing mechanism we already supported, ML and data science use cases starts to come to ClickHouse as well.

ClickHouse Cloud Architecture

Andrade: As you saw, we had this wildly successful open source database. We knew at this point we wanted to put it into the cloud and offer it as a service for people to use. That's great, but we had to actually ask ourself this question, how do we actually take something that's in open source and put it into the cloud? Think about that for a second? Because the answer isn't obvious and it wasn't obvious to us. We actually had a lot of decisions to make. We didn't start with any architectures or any designs, but we did have a lot of decisions, especially a lot of technical decisions. Within AWS, for example, there's a variety of services that we're able to use, this is just a subset of them. For compute, we have things like Kubernetes, we have EC2. Even within EC2, we have different flavors of that, we have VMs, we have bare metal. On the more managed side of services, we have things like lambda, Amazon ECS, and Fargate. For storage, there's a number of options. We have things like Amazon S3 object storage, we also have AWS EBS volumes, and even some EC2 instances have SSD disks attached to them. For networking, there's a variety of options as well. VPCs themselves can be configured in a variety of ways. There's a variety of load balancers to choose from, including classic load balancers, ALBs and NLBs.

Knowing that we had to make all these decisions, how did we do so knowing that we were making the right decisions? To do that, we had to actually come up with some guiding principles. These guiding principles, you can think of them almost like the gutters that you would find in a bowling lane, and that you would put in the gutter. These bumpers are going to push your ball down the lane further and keep it centered within the lane so it's not going to veer off course and miss the target at the end of the lane, the pins. This is exactly how we treated our guiding principles. They were a way for us to navigate these decisions and let us reach our goal of being able to deliver this cloud product. What were these guiding principles? The first one is we knew that we wanted to offer our customers a serverless experience. This was very important to us from the beginning, because we knew that a lot of people coming into our cloud, were not going to be people who were familiar with ClickHouse and probably have never seen it before. We wanted to make it as easy as possible for them to get started. Ideally, as simple as pushing a button. In order to do this, we wanted to remove the barriers to entry to get started. No having to learn about all the different ClickHouse server configurations, figure out the right sizing for your cluster. We wanted to make this really simple.

The second guiding principle was performance. ClickHouse is known to be extremely fast, and we wanted to keep that same level of performance that our ClickHouse users were already used to and offer that in the cloud as well. The third guiding principle was around this idea of separation of compute and storage. By doing this, we knew that it would unlock a few things for us, that we would be able to do, such as being able to temporarily pause the compute while still maintaining the data that's stored within the cluster. It would also allow us to scale these two resources independently of each other. We saw we had a variety of different use cases, and we had to be able to support all those different use cases on the cloud. Those different use cases are going to have different scaling properties, each for compute and storage. Having them separated lets us scale those independently. The fourth guiding principle was around tenant isolation and security. Being a SaaS company and having users trust us with their data and upload it to us, we knew that we had to keep that data safe and secure. We knew that we needed to prevent situations such as one cluster being able to access the data of another cluster. Our last guiding principle was around multi-cloud support. We knew that we wanted to be wherever our customers were, meaning that if they're running workloads on AWS, we should have a presence there. If they're running workloads on GCP, we should have a presence there as well. Same for Azure. In order to support multiple clouds, we knew that we needed to make architectural decisions that were going to be portable across these different clouds.

With these guiding principles in place, how exactly did we use them? This was one early decision we had to make is what kind of compute do we want to run our cloud on. As you saw on one of the previous slides, there were just a number of options for compute just within Amazon. Using these guiding principles, we can actually already discard a few of them. For example, the bare metal option just wasn't going to work for us because it didn't align with our serverless vision. Some of the managed services such as lambda, ECS, and Fargate, those are specific to Amazon, they're not going to be multi-cloud. That really only left us with two viable options, we have Kubernetes, and we have virtual machines. As you can see, Kubernetes actually checks more of the boxes for us. It provides a better serverless experience. It's basically an abstraction over the hardware. We're dealing with pods instead of the actual hardware, and servers. It also gives us better separation of compute and storage with built in primitives like persistent volume claims. Lastly, it also provides a better multi-cloud support, and that it's much more portable across clouds, again with that hardware abstraction that it provides for us.

Now that we're able to actually start making decisions, we can actually come up with an architecture. This is a very high-level overview of what our cloud looks like. At a high level, there's two primary components, we have the control plane at the top and the data plane at the bottom. The control plane you can think of as essentially our customer facing piece of our architecture. It's essentially what our users interact with when they log into our website, or if they do write any automation against our API that's going to interact with the control plane. The control plane offers a number of services and features for our customers, including things like cluster and user management. It provides the authentication into the website. It also handles user communication, such as emails and push notifications. This is also where the billing takes place. On the other side of things, we have the data plane. This is essentially where the ClickHouse clusters themselves reside, as well as the data that belongs to those clusters. The data plane provides a number of sub-components that offer additional features for ClickHouse in our cloud, including things like autoscaling and metrics. This is also where our Kubernetes operator lives.

When users want to connect with their cluster, they do so through a load balancer that's provided by the data plane. This allows them to connect directly to the data plane rather than going through the control plane, saving on a few network hops. When the control plane needs to communicate with the data plane, one such example would be when a user wants to provision a new cluster, the control plane has to reach out to the data plane, and it will talk to the data plane via the API that it exposes. It's just a simple REST API. The data plane will start working on that provisioning process. Once that's done, it's going to communicate back to the control plane via an Event Bus and send it an event. The control plane can then consume that event and update its own internal state of the cluster once it's completed, and also inform the user that their cluster is ready for use.

Using that same example, I want to take you through what actually happens on the data plane when someone provisions a cluster. Once the control plane calls the API of the data plane, the API is going to drop a message into a queue. This queue is going to be consumed by our provisioning engine which resides on the data plane. The provisioning engine has a number of responsibilities during this process. The first thing it's going to do is it's going to start updating some cloud resources. In AWS, it's going to start creating some IAM roles and update our Route 53 DNS entries. We use IAM here, going back to that tenant isolation principle, this is one of the ways we enforce that. These IAM roles are created specifically for this cluster, and it allows only this cluster to access the location in S3 where this cluster's data will reside. We also have to update Route 53 because each cluster that gets created in our cloud gets a unique subdomain associated with it. In order to properly route things, of course, we have to make some DNS entries for that.

The second thing provisioning engine will do is it will update some routing in Istio. Istio is a Kubernetes based service mesh that is built on top of Envoy. We use this to do all of our routing from our external requests coming from our users, using that unique subdomain that was generated as part of their cluster, and route things accordingly. It's basically going to put some routing rules in there so Istio knows where to send requests for that subdomain to. Lastly, provisioning engine is going to then actually start provisioning resources in Kubernetes, specifically the compute resources. We do this through a pattern called the operator pattern in Kubernetes. The operator pattern in Kubernetes, essentially, lets us define and describe a ClickHouse cluster in a declarative manner using something like YAML. The operator knows how to interpret this YAML file and create the corresponding cluster that reflects that specification or configuration for that cluster. In this YAML, you're going to have things that are specific to this cluster, such as the number of replicas, the size of each replica, so the amount of CPU and memory. Also, other things that are specific to ClickHouse like different ClickHouse server configurations, for example.

Now that we're able to actually provision a cluster, it's up and running, how does the user actually connect with? They do so through this load balancer that I mentioned before on the previous slide. The load balancer is a shared load balancer per region. All the clusters in that region, all the traffic goes to the same load balancer. All it does is hand off the request to Istio, which already has been updated with the routing rules when the cluster was provisioned, so Istio knows exactly where to send that to. It'll hit the underlying cluster in Kubernetes. Once that request hits ClickHouse, it's probably going to have to start interacting with some data. Depending if you're doing an insert or a select, it's going to either read or write some data. To do that, it's going to go out to S3. S3 is where we keep all of our cluster's data. It's our persistent and durable storage for all of our customers cluster's data.

Going back to our guiding principles before, we had this idea of separation of compute and storage, and this is one such example of that. With open source ClickHouse, this wasn't the case, we actually had local disks with open source ClickHouse. This is something we had to build specially for our cloud within ClickHouse to be able to support this separation, and be able to store this data on S3. That being said, this also introduced some network latency, because we're no longer going to a local disk, we have to go out to S3 and fetch that data. We have two layers of caching on top of that. We have an EBS volume, and we also have an SSD. We only use instance types that have SSDs attached to them. The reason for that is because EBS volumes, you can think of them as essentially network attached storage devices. Because they are network attached, there is still a network latency involved there. They're not as fast as a hardware attached SSD, so we've also introduced a second layer of caching the SSDs. With this setup here, we essentially get more or less the same performance as you would with a self-hosted ClickHouse in our cloud.

If you heard me mention that we have a shared load balancer, you might be wondering, how does that align with our guiding principle of having this tenant isolation, if we're having shared infrastructure? We actually didn't start with that, we actually started with something like this, where we had a dedicated load balancer per cluster. The provisioning engine instead of updating Istio, which wasn't in place at the time, we actually provisioned a brand-new load balancer every time someone requested to provision a cluster, and each cluster had a load balancer. There were a few problems with this approach. The first problem is around the provisioning times. These load balancers on AWS could take anywhere up to 5 minutes sometimes, which wasn't a great experience for our customers, because the cluster was completely unusable until that was provisioned. The second problem with this was around cost. Because we have a load balancer per cluster, this was essentially eating into our margins. The third problem was around potential limits we could run into with AWS, for example. I think by default AWS has a pretty conservative limit on the number of load balancers you can have per account, something like 50. As we were approaching beta and GA, we knew that we were going to be provisioning hundreds, if not thousands of these clusters. This just wasn't going to be a scalable approach for us to having to constantly reach out to AWS support to increase our service limits for load balancers. That's why we actually decided to bypass the tenant isolation principle, but this was a compromise or a tradeoff here to provide better user experience and also reduce some of the costs and stuff.

That being said, we still want to offer things like tenant isolation, so we did so using Cilium. Cilium is essentially a Kubernetes network plugin that provides things like network policies that you can use within Kubernetes, and this coupled with the logical isolation we already get from Kubernetes in the form of namespaces. All of our clusters' resources within Kubernetes, all belong within the same namespace. With the network policies on top of that, we can actually prevent cross namespace traffic, so one cluster cannot call into the network of another cluster running in the same Kubernetes cluster. If you want to read more about how we use Cilium, there's a link there at the bottom. There's a white paper from the CNCF, that's the Cloud Native Computing Foundation. You can read more about our use case and how we use Cilium.

We have this architecture that we're able to provision clusters with. Now we need to start scaling those out. Because we're offering the serverless experience, we don't have any ways for customers to be able to tell us how much resources they need or how big of a cluster they need. We need to do this in some automated fashion. We do this with something called autoscaling. There are two types of autoscaling that we have. We have vertical and horizontal. On the vertical side, this refers to the size of the individual replicas, so the amount of CPU and memory that they each have. There are some technical complications with this. The first is that it can be disruptive. Because we're using Kubernetes, when we have to resize a Kubernetes pod, we actually have to terminate the old pod. That means any workloads that are running on that pod that is the target of a vertical autoscaling operation, we either have to wait for that to finish, or we have to terminate it and kill it. Not a great experience for users if that's maybe an important query that's running. Also, not a great experience if there are important system processes such as backups running on that pod. Because we have automated backups in our ClickHouse Cloud, we didn't want to be interrupting backups in emergency situations when customers might actually need them. We don't want to interrupt their queries or their backups.

The second problem with vertical autoscaling is around cache loss. I showed how we have two layers of cache. We have the EBS volumes and the SSD disks. Because the SSD disks are actually attached to the underlying hardware, if a pod gets autoscaled, or if a cluster gets autoscaled, when the pod moves, there's no guarantee from Kubernetes that that pod is going to be rescheduled on the same node, meaning that it may not have access to that same SSD cache that it had before. In that situation, the way we actually get around that is by the use of the EBS volumes. I mentioned that they are network attached. They can actually follow the pod when they get rescheduled, so if they end up on another node, we can just reattach that EBS volume and provide it a warm cache instead of a hot cache that the SSD might provide.

On the other side, we have horizontal autoscaling. This actually refers to the size of the cluster itself, so how many replicas it has. By default, our clusters come with three replicas. Horizontal autoscaling can either increase or decrease that number. Some technical challenges around horizontal autoscaling. The first is around this idea of data integrity. This has to do with the way you can specify your write level in ClickHouse. When you're inserting data into ClickHouse, you can specify what quorum level you want. You can ask for a full quorum, meaning that all replicas have to acknowledge that write before you consider it successful. At the other end of the spectrum, you can say, I just want to insert it, I don't really care if anyone else acknowledges it, and that's fine. What can happen in that situation is if the replica that received that insert, in the case of no quorum, that data is only going to reside on that replica until it's replicated in the background to the other nodes. Until that happens, if that replica got removed, that data only resides there, meaning that that data would be lost if it's the target of a horizontal downscale. The second problem with horizontal autoscaling has to do with the way ClickHouse clusters work. Each replica has to be aware of every other replica, so that they can communicate properly. What can happen is, if you remove a replica without unregistering it, the other replicas will still try to talk to it. This is a problem for certain commands, such as, when you're trying to create a table, all the other replicas have to be aware of that change. If it cannot reach one of the replicas, the statement will fail. If you don't properly unregister the replica during the horizontal downscale, then you're going to end up in that situation and your CREATE TABLE will not execute. With these challenges in place for horizontal autoscaling, we actually realized that these are actually pretty risky things. We decided to actually launch beta and GA with just the vertical autoscaling piece. We're still working through some of the kinks, these ones specifically for horizontal autoscaling, but it should be out shortly.

We did launch with vertical autoscaling, and I want to go into the details of that and show you how we solved the disruptive problem for vertical autoscaling. With vertical autoscaling, we want this to be done in an automated fashion. In order to automate this, we have to really understand what's happening on the cluster. To do that, we have to publish some metrics to some central metric store. These usage metrics include things that are internal to ClickHouse, so metrics that are already generated by ClickHouse, we ship those to the metric store. We also ship operating system level metrics, so for example, the memory and CPU utilization. Some of the other signals that we publish are going to be things that are going to tell us, for example, how many active running queries are happening on that cluster, as well as number of active backups that are currently happening. Now that we've captured those metrics, we're able to make smarter decisions about whether or not it's safe to proceed with a vertical autoscaling operation. The second piece of vertical autoscaling is this scale controller. A controller in Kubernetes is essentially a control loop that watches for Kubernetes resources and is able to react to them. What is the scale controller watching for? It's actually watching for these resources called recommendations coming from this recommender component. The recommender will wake up every so often and take a look at the metrics store and basically generate recommendations based off of the metrics coming out of those clusters, and send those to the scale controller. The scale controller at that point can make the decision whether or not to scale up or scale down based off the current state of the cluster and the incoming recommendation. This is how we're actually able to provide vertical autoscaling.

Now that we're able to scale these individual clusters, how do we scale our cloud? Because one of the things that often companies run into is running into these cloud provider limits, and service limit quotas. How did we get around this, again, knowing that we were going to be provisioning hundreds, if not thousands of these clusters? Taking a look at our architecture, again, we actually were able to identify that there's a distinct line that could be drawn between two pieces of our infrastructure. We have a more static side and a more dynamic side. The top part in yellow is our static piece, meaning that it doesn't really change based off of the number of clusters running in our cloud. The other piece is dynamic and does change and grow based off of the number of clusters we do have running. If we give these names of management and cell, and draw a more simplified version of this, and instead now think of these as individual AWS accounts, so we have a management account and a cellular account. All the resources on the previous page were in their corresponding accounts. Once we start running up to limits on the cell account, because that's the part that grows as the number of clusters grow, we can actually just add another cell account and register that with the management account. We can continue to do that without running into individual limits. This is how we were able to scale our cloud. This is called a cellular architecture.

To wrap up the architecture portion, I just want to bring it back to the guiding principles, because that's what allowed us to get to this point. We wanted to offer our users a serverless experience to make things as easy as possible to get started. We also wanted to maintain that same level of performance that users were already used to with ClickHouse in our cloud. We wanted to offer this separation of compute and storage to be able to scale them independently and be able to support a variety of workloads. We also wanted to provide tenant isolation and security so that our users trust us with their data. Lastly, we knew that we wanted to have multi-cloud support in AWS, Google, and Azure. The architecture I showed earlier was actually built in AWS and we were able to port it successfully to Google, which is going to be launching soon. We're also already looking at Azure using the same architecture.

Timeline and Key Milestones

Zhao: Now that you all know about our architectures, and all of those hard decisions that we made along the way to build ClickHouse Cloud, how did we fit all of this software development work within just one year? Let me share with you. Using the initial tennis metaphor, now you know our techniques as a player, how did we end up getting the trophy, which is launching ClickHouse Cloud within one year? It all started from the end of 2021, when our co-founder got initial founding and started the company. From the beginning of 2022, we had a very aggressive timeline set on ourselves, which is launching cloud service within just one year of time. We break this entire one year of a goal into three key milestones, which is private preview in May of 2022, public beta on October, and then GA on December. What did we do for each of these milestones? For the private preview, since ClickHouse is already an open source database, there are already a lot of customers using ClickHouse, so we invited those existing customers of ClickHouse trying to use ClickHouse Cloud. We build basic cloud offering. This type of offering is limited in terms of functionalities. We don't have autoscaling in place. We don't have metering, because we were not going to charge customers for private preview, but we have all the other essential stuff in place, that includes SOC 2 Type 1 compliances. We think about security from the very beginning, as those customers are real customers with real customer data so we don't want to leak any of this data. We want to be responsible even for our private preview. Also, we have all the fully managed system, so that customers can come and self-serve themselves. They have an end-to-end experience of cloud service, so they don't need to ask us to provision for them manually. On top of that, we also have well-defined and well-developed monitoring, alerting, and on-call system. We have the full reliability in place so that customers are self-serve, and all the things are monitored. We take care of whatever clusters they created on cloud in private preview. Internally, we also prioritize our automation on infrastructure as code and CI/CD. That allows us to move quickly before private preview, and after for the second half of the year, when after we got the private preview feedback, we needed to speed up and build even more than what we initially planned for.

What did we do for public beta? Initially, we planned mainly two things. One is autoscaling, which is very hard to do, as you see in the previous slides that Shane talked about, all those challenges. It takes a lot of time. Then metering as well. We are going to start charging customers on public beta, so we need to build metering. On top of this, too, because of the private preview feedbacks we get, we also did enhance security features like private link, IP filtering, and auditing, which all of those enterprise customers asked us to do. Also, because public beta is going to be public, it's going to be our first appearance in the general market. We did rounds of reliability and scalability testing before public beta launch. That's when we decided to do cellular architecture internally as well.

After public beta, we're heading towards GA with just two months. What did we do there? First two are mainly about customer feedbacks from public beta. After we launched public beta, we heard feedback about our SQL console is not having enough functions. We had our enhanced cloud console built within just two months. Then, after public beta, we also realized that the public beta customers are a little bit different from our private preview customers. For private previews, those customers mostly are enterprise customers, they're requesting features like you see in the public beta phase. All the security features, compliances, big companies worry about those. After public beta, we got a lot of traffic from smaller individual developers who use ClickHouse to build their own project or to learn ClickHouse as a database. For those individual customers, all the security fancy features of durability, reliability, highly available database is not as appealing as the low cost in their credit card. As a result, we start to support Developer Edition in the general availability. Also, we keep on going with our reliability and security, for supporting uptime SLA for GA, and also SOC 2 Type 2 compliances.

Key Takeaways

Looking back, in retrospective, back to our initial question, how did we fit all those software development work within just one year? We think there are three key things that helped us to build something that is speedy, and accurate to solve customer problems. Those three things are, first, milestone-driven development. We set ourself a very aggressive timeline from the very beginning. This helped the entire company across different teams to have the same goal. We also set three milestones on different times of that year. That's, clear the goal of everybody in the company, that's one. Also, we respect the timeline, regardless of what's going on. Maybe sometimes we underestimate some of the projects, sometimes some of the projects slip through. We respect the timeline. We never change the timeline. We keep a long list of prioritization list to keep changing the priority so that when something happens, we are able to move around the priority of different projects that we are targeting, and cost, scope, instead of postponing the time. That makes sure the time is always met, and everybody is believing that we are going to launch ClickHouse Cloud within just one year.

Second, reliability and security are features as well. From the beginning, as you see, even before the private preview, we invested a lot in reliability and security features from the starting point. That's because throughout the years of experience of us building cloud service in different places, we find it's easy to track features, but it's hard to track and prioritize security and reliability. On the other hand, as a cloud service provider, we need customer trust in order to put their valuable data on us. Reliability and security is the founding stone of our company's existing success. We started early on reliability and security, so that we build a lot of reliability and security features ahead of time before the final crunch. Also, we continue to invest in reliability and security for public beta, and also GA, to even iron down the reliability and security offering we had for our customers. That makes sure our final product is production ready for big enterprise customers, as well as small customers. Finally, listening to users early and often is the key as well to build an accurate product that solves customer pain. As you see in our previous slides, in private preview and public beta, we both heard customer feedback and we change our prioritizations based on customer feedback. We had enhanced security feature for public beta. We have enhanced console and Developer Edition for GA. Both times, it proves the fact that we're able to act fast and react to customer feedbacks, really delight customers and earn more customer trust so that they are willing to put more workloads onto our cloud service as a result.

Network Latency Difference

Andrade: With the SSDs there's not going to be a network latency involved. They're much speedier, and we did actually see a pretty big performance increase once we introduced those. They weren't part of our original design. They came a little bit later, once we saw EBS volumes weren't going to be enough for us. With the local SSDs, we're able to essentially match what we get in terms of performance with self-hosted ClickHouse.

Zhao: The difference is huge. It's a few times difference.

ClickHouse's Istio Adoption

Participant: I'm just curious about your adoption of Istio and whether you're using it for any other things besides load balancing and traffic management, and if you have any hurdles or stories there to share.

Andrade: We use it only in one place. Another feature we get with Istio is we can actually idle our instances using Istio. What that means is if there's a period of inactivity on our clusters, we can actually suspend them to free up the compute and save us some costs, and our customers some cost as well. We use Istio to essentially intercept the calls coming in. If the underlying cluster is in an idle state, Istio will hold that connection while it triggers the steps in order to bring that cluster back up and serve that connection. That's another use case we use for it. It's the same instance of Istio.

 

See more presentations with transcripts

 

Recorded at:

Jan 16, 2024

BT