BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Polyglot Persistence Powering Microservices

Polyglot Persistence Powering Microservices

Bookmarks
48:57

Summary

Roopa Tangirala takes a look at Netflix’s common platform used to manage, maintain, and scale persistence infrastructures (including building reliable systems on unreliable infrastructure). She shares the benefits, pitfalls, and lessons learned from their polyglot persistence architecture to shape the success of various services.

Bio

Roopa Tangirala leads the Cloud Database Engineering team at Netflix, responsible for cloud persistent runtime stores for Netflix, ensuring data availability, durability, and scalability to meet the growing business needs. The team specializes in providing polyglot persistence as a service which includes Cassandra, ElasticSearch, Dynomite, Mysql etc.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Thank you, everyone. We all are working in companies that start small, and have a monolithic app build as a single unit. It is generating a lot of data, so you pick a data store and quickly, the database becomes the life line of the company. Since you all are doing such an amazing job, growth picks up and you need to scale the monolithic app. It starts to fail at high load, and runs into scaling issues. What do you do now? You do the right thing. You break the monolithic up into multiple microservices that have better fallback and can scale well horizontally. But you don't worry about the back end data store, you back up the microservices to the originally chosen back end, and then things become complicated and complex at your back-end tier. And your data team feels like something like this; they're the ones that have to manage the up time of your data store, and they are trying to support all kinds of antipatterns that the database might not be capable of doing so.

And now, imagine, instead of trying to fit all of your microservices to fit one store, you leverage the strengths and features of your back-end data tier to fit your application needs. No more do you worry about fitting your graph usage, or doing ad hoc search queries into Cassandra. And then your data team can be peaceful and in a state of zen.

Polyglot Persistence Powering Microservices

Hello, everyone, I'm Roopa Tangirala, I manage the cloud base engineering team at Netflix. As Randy was saying, I have been with Netflix for almost a decade now and I have seen the company transition from going from the monolithic in the data center, to microservices and polyglot persistence in the cloud. I will cover how, at Netflix, we raised polyglot persistence. I will cover five use cases, go over the use case and the reasons for choosing different back-end data stores.

And being a central platform team and providing different flavors of database as a service, I will go over the challenges which me and my team face in providing database as a service across all of our microservice platforms in Netflix. I will also go over what is our current approach in overcoming these challenges, and stop for the takeaway.

About Netflix

And briefly, about Netflix. Netflix has been leading the way for digital content since 1997. We are over 109 million subscribers in 109 countries, and a global leader in streaming. Deliver an amazing viewing experience across a wide variety of devices, smart TVs, your mobile devices, name a device, Netflix would be on it. Bringing you great original content in the form of Stranger Things, and many more.

And all the interactions you have as a Netflix customer with the Netflix UI, or the device, all of that data that your membership information, your viewing history, your book marks, all the viewing information, all of the metadata that is needed from a title to move from script and screen and beyond is stored in some shape or form in one of the data stores we manage.

Introducing the cloud base engineering team at Netflix, we are running on Amazon cloud, and we support a wide variety of polyglot persistence. We have Cassandra, Dynamite, EV cache, Elastic, TitanDB, MySQL, three or four data sets, and Arias. I will go over five data stores, and to give everyone a brief history, we will look at each of them and see the strengths.

The first is Elasticsearch. Elasticsearch can take any data set in any format and provides great search capability in real time. EV cache is open-sourced by Netflix in 2011, and it is a distributed caching solution in memcache. And Dynamite is a distributed layer by Netflix; it supports Reddit, memcache, and it added Cassandra to non-distributed data sets. And TitanDB is a scalable graph optimized for storing and querying data sets.

We will look at the architecture and how we are deployed in cloud and how the data sets are in Amazon Web Services. And in Amazon, we are running in three regions, globally, taking all of the traffic. So at any given time, based on your location, your user traffic is routed to any one of the regions, which is nearest to you. And so, primarily, we have U.S. west, U.S. east, and EU. If there's a problem for one region, our traffic team can shift the traffic in less than seven minutes to the other two regions without impacting downtime, or very minimal downtime. So all of our data stores need to be distributed and highly scalable.

And now, let's move on to the very first use case. And now, if you are like me, are a fan of Netflix, and love to binge on titles like Stranger Things, you have to click play. And after you click play, we look at the user authorization and licensing of the content, and Netflix stores connected appliances that are stored all over the world. These are what stores the movie bits, and the sole purpose of this is to deliver the bits of movies as quickly and efficiently as possible to all your devices. So we have an Amazon plane that handles the microservices and data connection, and the storage that delivers the movies for you. And this service that I'm talking about is the one which is responsible for generating the URL, from there, we can stream the movie to you.

Requirements – CDN URL

So what are the requirements for this service? The requirements were, the very first one, it needs to be highly available. We don't want any user experience to be compromised when you are trying to watch a movie. High availability was number one. And very low read and write latencies, less than a millisecond, it is in the path of streaming, and the moment you click play, we want it to stream to you. That's a very important requirement. And high throughput per node, the files are pre-positioned in all of these caches, they can keep changing based on the cache held, the new movies introduced, and there's multiple dimensions on which the movie files can keep changing. So this service receives a very high read and write throughput. So we want something where per node throughput can be high so we can optimize.

And now, let's take a guess. What do you guys think the use is for this particular service- any guesses? I don't see any hands. So no penalty for guessing wrong, but we used EV Cache; it is a distributed in-memory caching solution, based on memcache, and it has low latency because it is in memory. The data model for this use case was simple, it was a simple key value, not a very complicated model, and you can get that data from the cache. And EV Cache is distributed, we have multiple copies in different EVs, and we get better fault tolerance as well.

And now, moving on. You have been watching a movie and then you click play again, and this time you get this error. How many of you have seen this error? Oh, wow. That's not a good sign. But, anyway, this is called a playback error. he playback error happens whenever you click a title, and the title is not playable. And so, when you click a title, there are multiple things in the title. You have the title metadata, you have the ratings, the genre, and the description. And then you have the languages it supports, the audio languages and the sub title languages the title is supporting. And you have the location from where the movie needs to be streamed to you. So all of this metadata is what we call a playback manifest. And this is needed for us to play the title to you successfully.

And now, there are hundreds of dimensions based on which the playback, you can get this playback metadata error, and there are hundreds of dimensions based on which the user can have an altered playback experience, basically. So to name a few, depending on your country, there is data that is licensed in specific countries and we cannot play to you. Depending on the user preference, maybe the users want to watch Narcos in Spanish, or if you are using WiFi or fixed network, we might have to change the bit rate that we are streaming the movie. And based on your device, some devices do not support 4K or HD, or some just HD, we have to change that based on the device type, and the title metadata based on location can change. And there are hundreds of dimensions based on which your play back experience can be altered.

Requirements – Playback Error

So what are the requirements for this service? We wanted the ability to quickly do incident resolution, when there's a problem happening, we want to have someplace where we can quickly look at what is the cause, which dimension is not in sync, which is causing your playback error. And even if we have ruled out a push, we want to see if we need to roll back, or roll forward, based on what is the scope of the error; is the error happening in all the three regions, specific regions, or on a particular device? There are multiple regions where we need to figure out the data set.

And interactive dashboards- we wanted to have the ability of the data set to slice and dice so I can see the root cause of that error. And near real-time search, when the issue is happening, I want to figure figure out if that push has caused a problem or not. And, ad hoc queries, doing queries because, there are so many dimensions, I don't know my query patterns. There may be multiple ways I want to query the data set to arrive at the place of what is causing the error. So ad hoc queries was another thing that the back-end needed to support.

Take a guess, what do we use for this service? Yes. Elastic. It takes data in any form and it can provide great search and analytics, and it gives a way of interactive dashboards through Kibana; Elastic is used a lot for debugging use cases.

And moving on, so this is the interactive exploration I was talking about; Kibana provides the great UI, and we can see -- we can slice and dice the dataset and see what is the error, like, is the error in a specific region across multiple devices, or on a specific device, or on a particular title that is causing the playback error? It helps to debug that. And support for top-end queries, I can look at the top 10 device viewers a cross Netflix, that analysis can be also done using Elastic.

And now, the incident to-time resolution, before Elasticsearch, was more than two hours. The way this was being handled was, we were looking at the logs and seeing what is the cause of error and where there's a mismatch between the manifest and what is being played to you. And with Elasticsearch, the resolution time decreased to under 10 minutes. That has been a great thing.

Viewing History

And so, now, moving on to the third use case, which is the viewing history. You have been watching Netflix, building on your titles that you are watching, and building on what we call a viewing history. Viewing history is basically the titles you have been watching over the past few days, and you can click from where you stopped stopped and it keeps the bookmark. If you look at the account activity, you can see the date that you watched a particular title, and the title name, and you can report the problem on the title. If there's a problem with the title viewing, you can always report back. And now, you can see what I have been up to in the past few days, binge watching Stranger Things, basically.

Requirements – Viewing History

So what are the requirements for viewing history? We needed a data store that can store time series dataset, you can see that there's a time series. We needed an ability to support really high writes. Because the viewing history receives a lot of writes as a service and a lot of people are watching Netflix, which is great, and the number of writes on the service is pretty high. We wanted the ability to have cross-region replication, and the reason for cross-region, as I was showing you in the first slides, we are in three regions at any given time, and if there's a problem, we can shift the traffic and we want the user's viewer history in the other two regions as well. And supporting large data sets, viewing has been growing exponentially. We want a storage that can support very large datasets. This should be a simple guess, yeah? Which one? Cassandra, yes.

So, we used Cassandra for this particular use case. Cassandra is a great NoSQL distributed data store, which offers multi-data center, multi-directional replication, which works out great. Cassandra is doing the replication for us. And it is highly available, and it is highly scalable. It has great fault detection mechanism, you can have multiple replicas, and even if one node is down, it does not give down time to the website. You can have different types of consistency levels defined so that you never experience down time, even though there are nodes that will always go down in your regions.

Data Model

So the data model for viewing history was simple to start with; we have the customer ID, and each title you watch is a column in that column family. So as you watch, when you are writing, we just write a very small payload, the latest title you watched. And over time, when you are reading, you can read the whole viewing history. And over time, the viewing history grows, and Cassandra is great for a wide row, so it is not a problem. And when you are reading, you are paginating through your rows, it is okay.

And very quickly, we ran into issues, with this model. And the reason being the viewing history is so popular and there were a few customers who have a very high viewing history, the row becomes very wide. So even though Cassandra is great for wide rows, when you are trying to read all of that data in memory, it causes heat pressure and compromises the latency. And even though the data is growing, the dataset in size is growing.

New Data Model

So what did we do? We have a new model; you split it into two column families, one is the live viewing history, and each column is the title, so you are writing a very small payload, and then we have a roll-up column family, which is basically the combination of all the historical data sets and it is rolled up into another column family, it is compressed. So you have to do two reads; you are reading from the historical compressed and the live column family.

And but, it definitely helps with the, sorry, it definitely helps with the sizing. You can compress the size, and we reduced the actual dataset size drastically because the data was compressed. Half of the data is compressed, and the way the roll-up happens is in the path of read. So when the user is trying to read from viewing history, the service knows how many columns they have read. And if the columns are more than whatever we think it should be, then it compresses the historical data and moves it to the other column family. So it is a roll-up that is happening all the time based on your reads. And so, that works out very nicely.

Digital Asset Management

And now, moving on to our fourth use case, which is the digital asset management. So our content platform engineering team at Netflix deals with tons of digital assets and they needed a tool to store the assets and the connections and relationship between these assets.

What do I mean by that? So you have an artwork; an artwork is what you see on the website, and artwork can be of different formats. We have JPEG, PNG, and different formats for art work. A movie can have an art work, a character, and a person can have an art work. You have various components in art work, and a movie is a combination of different things; this is a package, which is nothing of a combination of trailers, and montages in this case, and you can have the video and sub title combination as a package. Here, we have French in the video format, and the subtitle is French and Spanish. And then you have relationships like montage is a video type. We wanted a store where we can store all of these entities as well as relationships.

Requirements - DAM

So the requirements for the digital asset management service were back in plane to store the asset metadata, and the relationships and the connected data sets, and the ability to search it quickly. And again, we will take a guess. What do you guys think? We used TitanDB; it is a distributed graph database, great for storing graph data sets, and a great feature is that it was supporting various storage back ends. Since we support Cassandra in Elasticsearch, it was easy to integrate into our service.

Distributed Delayed Queues

And now, moving to the last use case, the distributed delayed queues. So the content platform engineering team again in Netflix ran a number of business processes, when there's a new movie that needs to be rolled out, content ingestion, and uploading to CDN, all of these services require asynchronous orchestration to multiple microservices, and data queue is a required part of this orchestration.

Requirements – Delayed Queues

What are the requirements for the queues? We want something that is distributed, highly concurrent, there are multiple microservices that are accessing, we want the ability to take, to have the back-end to be highly concurrent, and at once one delivery semantic for the queue, and the delayed queue, because we don't -- there are relationships between all of these microservices, we don't know the time when the queue will be consumed. So there is some way to delay the queue, and priorities within the shard. And so that we can pick up the queue, which is more priority, and having that ability to put priorities is a critical thing. And again, let's take a guess, this is an easy guess. What do you guys think, this is one service that is left.

So for this particular service, we used Dynamite. It is open-sourced by Netflix for sometime now, it is a pluggable data store, it supports memcache, Redis, and Rocks DB. It works for this use case because Redis supports queues very well, and we had early on worked trying to make queues work with Cassandra and failed miserably, we are running into all kinds of edge cases, and Dynamite worked really well for us in this case. And it provides multi-data center replication, it added sharding and replication so you are not worried about data being replicated across regions or data centers.

Data Model

So the data model, quickly, for the distributed queue is basically three Redis structures; one is the sorted set containing queue elements, the second one is the hash set, containing the payload, and the key is the message ID. And the third one is the sorted set, containing messages consumed by the client, but yet to be acknowledged. So the third is the unacknowledged set.

So these are all the five use cases I wanted to share with you and the reasons for choosing the different data stores. And now let's look at the challenges me and my team faces in offering database as a service and providing all of these different flavors of data stores. I love this quote, I don't think my on-call team feels like this, I expected times like this, but I never felt that they would be so hard, so long, and so frequent.

So the first challenge is the wide variety and the scale. We have so many different flavors of data stores, and we have to manage and monitor all of these different technologies. So how do I think about building a team that is capable of doing all this? What should be the skillset within the team to cater to all of these different technologies? So handling that wide variety, it is a small team, we are not a very big team, and so it becomes a challenge to manage.

The next challenge is predicting the future. With a combination of all of these technologies, we have thousands of clusters, tens and thousands of nodes, petabytes of data; how do I know what given time in the future will my cluster be running out of capacity? As a central platform team I should be the head room for each cluster, if the application team comes and says they are increasing the throughput, the capacity, or they are adding a new feature that causes an increase in the back-end ops. I should say that your cluster is sized well or we need to scale up the cluster.

The next challenge is maintenance and doing upgrades across all the different clusters, be it software or hardware; how do I think about doing maintenance and without impacting any production services? How do I think about building our own solution, or buying something that is out there? So all of those are a challenge.

And the next challenge is the monitoring. We have tens and thousands of instances, and all of these instances are sending metrics. When there's a problem happening, how do we look at which metrics makes more sense, which metrics should we be looking at? How can I reduce the signal-to-noise ratio? All of these are a lot of challenges.

Subject Matter Expert

And now, we will look at the current approach we followed to overcome these challenges. So the very first one is having subject matter experts; so we have a few core people in the Cassandra cloud base engineering team that we call subject matter experts- two or three people, and all of these people, they provide a best-practices, and as well as work very closely with the microservice teams to understand what their requirements are, and so just a back-end data store. They are the ones who drive the features and best practices, as well as the product future and vision.

And so, this -- having this set of core people, and everybody in the team goes on-call for all of these technologies. So having a core set of people that understand what's happening, and how we can fix the back end, instead of trying to patch and building automation tools, you know, building automation on top of things to just fix what is broken. Instead of that, we can contribute to the open source, or to the back-end data tier so it is a feature and does not go all kinds of automations which we need to build.

And the next thing is, we build intelligent systems that do work for us. So these intelligent systems that we are talking about, they are the ones that take on all the automation and do remediation for us, they take the alerts, look at the config, and see the latency we have for each application, and they decide for us; instead of we are getting paged for each and every alert.

CDE Service

And so, introducing to you, the CDE service. This is empowering the CDE team to provide data stores as a service. So the main component of CDE service, the first one, is the the one that captures the thresholds and SLAs. We have thousands of microservices; how do I know which service has the 99th latency requirement? For me and my team, I need to way to look at the clusters and see what are the requirements, and what have we promised, so I can see if the cluster is sized effectively or if I need to scale up the cluster. Cluster and metadata, getting a global view of all of the clusters, what version they are running, the software version they are running on, the kernel version they are running, what is the cluster, and what does the cost the company for managing the clusters, so the application team will understand what is the cost they are having for the particular back-end and the data they are trying to store and whether it makes sense or not.

Having the ability to do self-service, so we are not in the way of creating clusters. And so, CDE Service provides the self-service ability, so that application users can create the clusters on their own. They don't need to understand all the nitty-gritty details of the back-end YAML, but to do the -- we do the cluster creation and make sure it is using the right settings, the right version, and it has the best practices in build.

Contact information, this is another thing, because before the CDE Service, it was all in people's head; whom to contact, whether for this application, we need to page this team. It becomes very tricky when you are managing so many clusters, and having some central place to capture this metadata is crucial. And maintenance windows; there are clusters that can have maintenance windows in the night, and clusters in the evening, they receive high traffic and we will not do maintenance. So depending on the cluster, we will decide the maintenance window.

Architecture

So the way the architecture is laid out, you have the data store on the center, on the left, you have scheduler; we use Jenkin's, it is cloud-based, it is a click of a button, when you have to do replacements, those are kicked off through Jenkins. And under that, you have CDE Service that captures the metadata and is the source of all information like SLAs, the PagerDuty information, and many more. And on the top, you have the monitoring system. At Netflix, we use atlas, which is open-sourced and it is a telemetry system to capture all of the metrics. When there's a problem and we cannot meet your 99th latency, the alert will kick off. And, on the very right, you see the remediation system. And so that is basically an execution framework that runs on containers, and it can execute automation for you.

And so, anytime there is an alert that fires from the monitoring section, say, for example, if there is a high disk alert, and the service -- the cluster pages for high disk, it is running 85 percent or so, and we say we need more head room, and so we have an alert for the 85 percent threshold for disk. And so the alert will kick off, the monitoring system will send the alert to the remediation system and it will take into account if there are any patterns. If it is Cassandra-stored, it will look at, are there any compactions running, or are the back-ups cleaned up. It will do a lot of remediation, it will work on the data store, and it won't even let the alert come to the CDE team. It is in situations that we have not built automation that alerts come automatically to us. So it is in our team's best interest to build as much automation as possible so we can have minimal pages for the on-call.

SLA

And so here is the screen where I'm showing you how you can capture the SLAs. And so, for a given Cassandra cluster, you can see that we are taking down the email address, the PagerDuty, the Slack channel channel, and capturing all of the latencies and SLAs, whether it is read latency, write latency, disk usage, coordinated level, and node level, and the averages. All of these are very important. So some way to capture. This is the cluster view where I can look at all of my clusters and see what version are they running, which environment they are, which region they are, and what are the number of nodes. The customer email, as you are looking at in the previous slide, and also the Cassandra version, the software version we are running on, the hardware version, the average node count. In the interest of space, there are not many columns, but you can pick what you want to choose and it will show you that information.

And the other one is the cost. I can also look at my oldest node, so that I can see if the cluster is having a very old node, we need to replace, we will just run remediations. There's a job that scans for old nodes and run terminations.

And so this is how we create new clusters, the CDE Service, I'm showing you different data stores, and so in here, you can create Elasticsearch clusters. And so, as an application user, sorry, as an application user, you need to know what is your cluster name, and your email address, and what size of dataset you are planning to use for Elasticsearch, like, how much are you planning to store the data in Elasticsearch.

And so, if you give us all of these three dimensions and the regions we want to create the cluster, the automation cluster kicks off the creation in the background. This makes it easy to create clusters whenever they want, and since we own the infrastructure, we make sure that the cluster creation is using the right version of Elasticsearch, and it has all of the best practices in build.

And, this is how we look at goals, right? So, for example, if we have an upgrade running, how do we know what percentage of the cloud testers are upgraded across the whole fleet? When we are talking about clusters in thousands, it is tricky to manage and look at the numbers. So this is self-service, all of the application teams can also subscribe, they can quickly log into see how far along are we in the upgrades. This makes it easy for everyone to take a look.

Machine Leaning

And so, the other aspect is, as I was talking about, predicting future. How do we think about when is the cluster running out of capacity? So we store all the metrics from our telemetry system into S3, and our telemetry system stores it for two weeks- It is hard to push a lot of data sets, and we push to S3, and then to Kibana for dashboard use, from which we can do the analysis. And as well it has Slack channel integration.

Pattern in Disk Usage

For example, I'm showing you how the pattern and disk usage is going on. The line in green is the actual disk usage, you can see that there's a drop in the middle, when there are compactions in Cassandra, the disk usage goes down. You can definitely see a pattern here, it is going up, the disk usage is going up.

And so, we have run -- we have a system called predictive analysis, which runs any of our models and will predict which team in the future your cluster is running out of capacity. So here, the model predicted, the blue line is the actual disk usage and the red is what the model predicted. And it has a great integration with Slack channel, and so the system runs in the background and it pages, or it will give a Slack channel notification, saying that your cluster is running into disk, in 90 days it is running out of capacity. Cassandra is unique in a way that we only want to use 1/3 of the dataset, and next 1/3 we will use for the back-ups, and the last 1/3 is for compactions. So at a given time, we only use 1/3 of the dataset. It is important to have monitoring in place and have a system that tells us beforehand, and not at the cusp of it when the problem is happening, because that leads into all kinds of issues.

Since we are dealing with persistence tools, which are stateful, it is not easy to scale up. For stateless services, you can do it black or scale up the clusters with auto scaling groups and the clusters can increase in size. For persistence stores, it is tricky, all data on node, and they have to stream to multiple nodes. That's the reason why we use this.

And how do we think about maintenance? So we are doing a lot of proactive maintenance, things go down in cloud, so it is bound to fail, and Amazon, we registered to Amazon's notifications and terminate the nodes in advance instead of waiting for Amazon to terminate for us. So the way it helps us, because we are proactive, we can do the maintenance in the window we like, as well as the hardware replacements, terminations, or whatever we wanted to do. And that really helps us with the process as well.

And so, for example, in Cassandra, we don't rely on Cassandra's bootstrap ability to bring up nodes, because that takes a lot of time, and when we are talking about clusters which have more than one TB per node data size, it takes a long time, hours and days together sometimes. And so, in those cases, we have built a process that can terminate the node and, before terminating, it copies the data from the node which needs to be terminated into the new node, instead of relying on Cassandra's bootstrap, we led the automation kickoff; so proactive maintenance helps in those areas.

And also upgrades. Looking at all of these different polyglot persistents, how do we do the software as well as hardware upgrade? It is a big effort, because anytime you are introducing a change, and back-end can have a big impact, if there is a problem, if there's a buggy version, all of your uptime can be compromised based on this. So the way that we do upgrades is we have a -- we build a lot of confidence using Netflix databench. So there was a talk yesterday about this, and it is open-sourced again, so benchmarking tool, which can be used for any data storage, it is extendible, you can plug it for using Cassandra, Elasticsearch, or any store that you want. And it also comes with a client. So you specify the number of operations you want to throw at your cluster, you specify the payload, and you can have the data model you want.

And so, the application teams can test their own applications using ND bench. And so the way we do the upgrades is, we look at popular 405 use cases, we try to capture 80 percent reads, 20 percent writes, and 50 percent reads, 50 percent writes, so a few use cases we are trying to capture and the more common payload people are using in the clusters, and we run the benchmark before the upgrade and capture the 99 average latencies and we run the upgrade after. And then we compare if the upgrade has any integration and caused any of problems because of the latencies spiking up. This helps debug a lot of issues; before they happened in production, because we -- we deliver upgrades when we find issues with this particular comparison, so we have paused, like, three upgrades because of this, or Cassandra; when we are looking at newer versions, we will not upgrade until all of the versions have been fixed. That's way we are able to roll out all of these upgrades behind the scenes without our application teams even realizing we are upgrading their cluster.

And so, health check is another thing; node level and cluster level health check. Node level is Cassandra, or data store running or not, hardware failures and message. And cluster is what the nodes think about the nodes in the cluster. You have a common approach and you take the input from all the nodes and figure out if the cluster is healthy or not. This is noisy, if there's a problem from the crown based to the node, you will have problems. So it has a lot of noise.

Streaming Micro-Services

And so we moved on from that pool-based, so to continual streaming health check. So instead of doing a pool, we have a continual stream of snap shots which are fine-grain snap shots being pushed, and that has the aggregator for the data and the score for the health. If the score is beyond a certain thing, it will say that the cluster is not healthy. That's what we are looking at. Now we will look at the cluster details, what is the dashboard view that we have. So we have this dashboard view, where we can see each cluster, the size of the title represents the size of the cluster, and I can see, in one go, if my cluster is healthy or not.

Real Time Dash

And here, I'm seeing that the cluster three is in red, so there is some problem. And there is a macro view, and if I click on that view, I can see all of the details; how big is my cluster, where is my problem, my node, which instance is bad, and when I click on that instance I can quickly look at the instance which is causing trouble, or having trouble. And so this quickly -- it will help the on-call very easily to debug where the problem is. So that is all that I wanted to share.

Takeaway

The takeaway, is balance is the key to life; you cannot have all your microservices using one persistent store. At the same time, you don't want each and every microservice to use a distinct persistent store. here's a balance, and I'm hoping with the talk today, you will find your own balance and build your own data store as a service. Thank you. Questions.

Raise your hand and I will come find you. Okay.

When you adopt a persistent store for reduction use, do you invest in a back-up recovery disaster for that type of store?

Absolutely. So all our persistence stores, we ensure they have a proper back-up. Netflix is very big on failures, like, we emphasize that the failures would happen, so any type of recovery and back-up is very important. That was your question, right, do you invest in the news tools?

You mentioned you have a different graph database in Cassandra, do you look at it in search together?

You mean, like one product --

Yeah, Cassandra including graph database and solar.

So the problem with having Cassandra servicing the graph with Solar integration built as one project, all of the Cassandra clusters we are talking about are in the path of streaming, most of them receive a lot of traffic, and the latency requirements are pretty stringent. So running another research along with Cassandra as a core process adds a lot of performance overhead; you might have high CPU, we did initial testing and it did not work out well. We thought this was the better way. And though we have, similarly, like for other spots and use cases, we use it separately. So in this particular case, it was easier; TitanDB was using Cassandra underneath and it provides the UI. Yeah?

So, excuse me, one of the things you mentioned as a problem is having a small team supporting all of these. How do you solve that? How do you hire the right people; is it one specialist per technology, or --

So, that is true. Actually, it is a challenge to have a small team. And Netflix hires great people, and so we put a lot of thought into the interview process and hire mature candidates. So the way it works out is having one or two experts in each of these areas who will drive the product. And that is how we make sure that we are able to scale. With all of these different technologies, we are all around 18 or 20 people together. And so, it -- it is a challenge, but that's the whole thing, right? Like, you missed a lot of time in hiring the right person for the team, for the job and, at the same time, it is not that other people are working in these technologies. The great thing for CDE Team, you have so many technologies and you can contribute to any one of them as you need. I think that also helps.

Hello, thank you for the talk. I'm Quintin of ING, and I'm wondering about using a graph database myself for a big data base. Titan is discontinued, so you are using --

Yes, we're in the process of transitioning to Janis graph. It is still in production, that is why I didn't put it in the slide. Yeah.

Do you have a scenario where you have to have the data related between Elastic, and Cassandra, where you would need a similar data? Do you have those types of scenarios?

There are use cases; Cassandra is great for higher ability and the source of truth. But Elasticsearch does not provide that strong repair and having some way that your data is not lost, because whenever you are doing upgrades, you have to take the down time for Elasticsearch. So there's a few applications who use both, and so for the main data store, they use Elastic, if they have any random ad hoc searches, they write to Elastic and query from there, but the source of truth is always Cassandra.

Yes, they are separate. So the application team writes to both. There is nothing bridging the gap between those two.

It looked like there were instances of there were different microservices sharing a Cassandra cluster. How do you make the choice on when to introduce the coupling between different services?

Sure, most of our microservices are based on what they need; if they have a very high requirement of high ops, and very low latency requirement, we spin up their own dedicated cluster. For Cassandra, having a shared cluster is very complicated; you don't know which is the noisy neighbor and having the ability to debug is painful. We used shared clusters for use cases where the foot print is small. Most of the microservices have their own dedicated clusters.

Hi, so technologically, the problem is straightforward; provide the data store as a service and your microservices can have polyglot persistence. Organizational challenges have a bigger impediment; where did you start out in providing a database as a service, you went from one data store to multiple data stores, hiring people and so on and so forth. How did you bring about that change to be where you are today?

That's a great question, and I think a lot has to do with the company culture. So we were using Oracle a few years back, almost -- well, I don't remember the year. But Oracle is a primary data store, using the monolith in the data center, and we migrated to a cloud, using microservices and different polyglot persistence. So one of the things that drives that change is the culture of the company and where we want to go. We knew that we are not able to scale in the data center and be a global leader in streaming. We wanted to have Amazon do the data center for us, and we build the product itself. And so that's the main driving factor, I think. And it takes time to build the culture and to -- it takes time to build the team, which needs to service those things, but I think that I will definitely give it to the culture of the company.

Sorry. Thank you so much, thank you Roopa. Can we give her another hand? That was awesome.

Live captioning by Lindsay @stoker_lindsay at White Coat Captioning @whitecoatcapx.

See more presentations with transcripts

Recorded at:

Jan 14, 2018

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT