Transcript
Voicu: This talk is about the challenges we have at PayPal, scaling the database connections and basically, how we tackle those solutions. We developed a service called the Hera which we open-sourced recently. I will be talking about the Hera features for scaling, resiliency, manageability. Towards the end, we'll also talk about our journey in developing this in Golang.
I'm working on Hera, the database proxy. In the past, I've been doing various work in the backend of PayPal. I'm a Golang contributor to Golang runtime. Prior to PayPal, I worked on various systems like distributed systems, embedded systems, even desktop applications.
Kang: I'm Kenneth [Kang], I work at PayPal. I did some German products for PayPal as well as PayPal's two-factor authentications backend. Then I moved on to do some shared memory for our C++ sharding. Then I helped migrate to Go. Previously, I was at GlaxoSmithKline, NASA and Stanford Linear Accelerator.
Let's talk about how PayPal scales. PayPal is a rather large footprint, we have 200 markets, basically, around the world. We have a lot of transactions, and it continues to grow. Behind all those transactions, we have a lot of engineers, a lot of services. Those services not only address our peer-to-peer market, but also, they address our merchants and merchants supports both at the small, medium, and larger scales. In terms of transaction processing, we have a large number of databases supporting all these services, a lot of storage. They support a whole lot of servers in a whole lot of places. Then we organize them in about 300 different schemas and some of the busiest ones process billions per hour, which requires a lot of connections, a lot of scale. Petrica [Voicu] will take us through how we handle those connections.
HERA: Microservice-First DB Proxy
Voicu: First, let's talk about why we actually wrote Hera in the first place. We all love microservices; this is an example of a microservice which is basically talking to the database, and typically, data requests coming in to the microservice, the microservice will first acquire a database connection, may do a few requests and may do some processing, and in the end, will release the connection. If you do it this way, if you look at how much time it takes for all those operations, we can see database connection takes a lot of time. It's 100 milliseconds in this example, but definitely even much higher for Oracle. Even for MySQL it’s less than Oracle, but it's still a lot.
A common practice is, you pre-establish those connection, you cache them in a connection pool, and you use it for every request, so when a request comes in, the microservice will acquire the connection. It will basically lease from this pool. It will then perform the actual work and then it will release at the end. This is a common practice, for Java, for example, there are many third-party solutions available like C3P0, UCP, DBCP. At PayPal, we also have our own connection pool which we'll open-source soon. Please pay attention to that.
For other languages like Golang - Golang actually comes even built-in with a connection pool in their standard library. This is common practice. What may not be so obvious is that this scheme also helps a lot with the database load because connecting to database, not only is it slow, but it's also very resource-intensive. For example, for the Oracle, it creates a shadow process. It’s going to create a bunch of catalog tables, it may even send some of the resources back to the client. It's actually quite a huge load which database will do instead of doing the actual work of servicing your queries.
Let's see some of the issues with the connection pool in a client. Typically, you have to configure your connection pool size. You configure large enough so that always you have enough connections to do your actual work. When you scale up, you keep adding your microservices and you keep adding more and more and eventually you're going to hit the connection limit. All the databases have a connection limit. Some have a hard limit, some only have a soft limit, but they all have a limit because the more connections you have and the busier database gets, there is a lot of contention and the database performance will degrade.
You're going to look for solutions. You realize that maybe you configure your connection pool size is too much. This is the easiest one - you configure fewer connections for microservice, then you may improve your code, maybe you hold information for too long, maybe some of the queries are not efficient. You may even buy better hardware database. This will help.
Let's see some of the problems with the connection pool in the client. The main problem is that the connection pool is partitioned, you allocate to each microservice a subset of your connections. In real life, the balance is not always perfect, sometimes some microservice may need more connections than others. If, in this case, one microservice gets a lot more of some more requests, but all of the connections are busy, those requests they don't have a database connection have to wait. There will be latency, maybe timeouts, which is unfortunate because the database has capacity. You have connections available, but they are allocated to other microservices.
The solution for this is scaling with a shared pool. This maybe counter-intuitive, you add an extra step. It is well worth it because your database connection, if you want to name one, it's the most important resource and most critical one. In this example, microservices get database connections on demand. If one needs more, it will take more connections and if some service needs less, they will have fewer connections.
Let's see if we can improve even further. If we look, again, at a typical request, we see that database connection is allocated for this session for the service request, but most of the time it’s idle. It would be nice if when the database connection is idle, we could serve other requests. The solution for this is called multiplexing. In this example, basically, all the microservices think they have a database connection to themselves, but they will actually get a physical database connection only when it's needed. This is similar to a phone switch or a phone PBX in an office building, where we have thousands of phones that are mostly idle, but outgoing there are, let's say, 100 physical connections to your phone company to manage the cost in this example. When somebody picks up the phone from the office, the phone switch will actually make the physical connection to the outgoing line.
That's how we decided to route Hera, which is out-process, shared connection pool, and multiplexer. We wrote it many years back, and it evolved over the years at PayPal based on the PayPal needs. Most importantly at PayPal, all the payments must succeed. We work with people's money and correctness is our most important principle and, of course, resiliency, availability, and performance are very important. As we talked earlier, we deployed this on hundreds of databases at PayPal and we have different configurations. It's important to have a minimal configuration. We default most of the parameters, just to be easier to deploy. We have a multi-stack environment at PayPal. We have five primary languages like C++, Java, Python, Node.js, Go. We don't want to duplicate the code on all those stacks, so our client is thin. Most of the logic is in the server. We have our own protocol, which we didn't change over the years. It's a very low overhead, and the ROI wasn't justified to replace the internal protocol. Now, since we talked about how we wrote Hera, Kenneth [Kang] will talk about other features.
Other Features
Kang: First, we'll start off with some features about scaling. We covered multiplexing already, then next, let's cover read-write split before we get into a pretty big chunk on sharding. Read-write split is mainly to address the Oracle RAC cluster. When you have a RAC cluster - a synchronous cluster, so that means you can query and all transactions stay up-to-date, so you can query any of the nodes and you'll get whatever else Oracle is processing and Oracle does a hard work of keeping it synced. However, this means that all the nodes are changing their lock information, their updated blocks, and so it gets very chatty and you can start to run out of interconnect bandwidth, whatever interconnect you're using. Hera offers a capability of read-write split. That means that we preferentially target the writes to a particular node. That means that the interconnect traffic now becomes much more one way from that node fanning out. Now, it decreases that interconnect traffic. The read query still can hit any node, and so it gets you some scale still.
Next, we'll cover sharding and we'll cover it in a few parts. We'll cover how we did it at the client versus server. Finding the shard - we'll run through a small example. We'll go through a query, convert a legacy app which PayPal has a lot of, and then we'll move and actually see how we operationally move a scuttle bin piece of data over.
Client versus server - PayPal started off as a C++ shop. We could have embedded all the logic for finding the data, finding which shard to go to in a C++ library all the way at the client. We've done this I think twice, we've tried it and said, "Ok. It works," but soon we added Java. That means we'd have to replicate all that code, all that logic, all that control into Java. Then we added our current language, Go, and so that would have replicated a third time. Soon we'll add Python and then we'll add Node. That means five languages, five libraries all have to have the same sharding logic. Hopefully, there are no bugs. When you change that logic, you're going to have to change all of them and then monitor everyone rolling out so that you can use it. We did it in the server, we kept that client thin, we moved it into the server. Now at least the bugs are consistent, and then it can all go down at the workers.
Let's go on to actually seeing how this sharding logic works. We shard on the key value here. That means we're going to take a key and its value and pass it through and figure out which database we're supposed to go to. Here we'll take account ID and we'll just say it's 2000. First, what we'll do is we'll hash it and send it to 1024 scuttle bins. In this case, we get scuttle bin 280. Then what we'll do is we'll take a database side map to map the scuttle bins all to actual shards. When we look it up inside our shard map, we will find out it goes to shard one. Then to actually get to a physical shard, we don't always have to map all the shards at this level to a physical, we can actually share them a little bit. Then we use the actual configuration file, whether it's Oracle or MySQL to actually get to the actual physical database.
Let's look at query next. When we start off, it might have some sort of basic query that does a basic join, but here, we need to have the shard key and the account number to actually find it. What we need to do is we need to modify the application a little bit and have it send those shard keys over and its value. Right now we're using positional binds, but for Hera to pick it out more easily, because we don't know if it's position one or position two, we'll actually have our Hera driver rewrite those positional binds to a named bind, and now the Hera server can pick it up both the shard key, what it is, and its value, so now it can actually find the shard.
We just converted a legacy query. Let's see what we have to do to convert a legacy application with some confidence because sometimes legacy applications are large, sometimes they don't bind to the shard key, and we need to figure out how to get that over smoothly without causing too much trouble or taking too much risk. First of all, we configure the server and send everything to shard zero. This makes it reasonably easy. The next thing is we allow any query that failed to convert to just go through shard zero. At least the data will be there, it will be safe, you can execute the query. We log it and so now you can find that log and say, "We forgot to query. We can fix it."
Then after you've done all that fixes, then we can have the next step where we'll do exactly one shard key value, and only that shard key value will go through the whitelist and be directed to shard one. Now, your risk is limited, if that somehow screws up for some reason, only that one particular value gets corrupt state and hopefully, it's easy to fix. You can repeat this and add more values, maybe 10, 100, maybe 1,000, so that you can easily start to ramp up and gain more confidence. It's also redirected to the same database, which keeps the data migration out of the picture. You don't have to worry about, did you copy all the data? Did you forget a table? It makes it easy to roll back and forth.
The next thing you should do is try moving it to a physical shard. Just in case you screwed up a permission and missed a table, it will keep your impact a little smaller when you use a whitelist. Once you're done with the whitelist and you have some more confidence, then you can try actually moving an actual full scuttle. This is 1 out of 1,024. It's 0.1%, but at PayPal scale, this is pretty big. We move it, we tell it to use shard one, it's a logical, so if it screws up, you flip it back and you're back to normal pretty quickly.
The next one we'll do is now we'll start using the production operational sequence to actually start moving shards, because you've done all the checks and we're now ready to start doing actual data copy. You start the data copy first for that particular scuttle bin, and then what will happen is then you'll block the rights, but after most of your data copy is done. This is so that you can have all the checks to say, "Has the data fully copied over?" You want to make sure that you don't miss anything because if you miss a transaction, it kind of sucks. Once it's all copied over, then you can update the scuttle map, and then it'll move over to the new shard, and so then, operationally, you've finished sharding.
When PayPal does this, we have done up to 400 scuttles at once, 400, 500. I think we've done it in about 40 seconds. We don't do it too often, we've done it last year, I think, once or twice, so we haven't needed to rebalance, but all that capability is there so that we can move data very quickly. The DBAs can do it on their own. Here, for sharding, the key thing is to make sure that we've bound to the shard key value, that we can manage the risk as we migrate legacy code, and then that we can in a controlled manner, move data back and forth.
The next thing we'll cover that Hera does is resiliency. This is important because sometimes stuff gets overloaded, and in this case, we want to make sure that we limit the impact and we want to make sure that it's automatic to both start any sort of recovery action as well as when it's already recovered, that we can move back to a normal state. We have a few ways of doing this recovery and resiliency operations. We can bounce a connection, we can have a queue to help absorb some of the surges, we can evict some of the slow queries, and then we have some read replica management that we can do both transparently, as well as allow us to scale more on the read side.
First one, we're overloaded, that means all the connections are busy, that we aren't getting to all the work done, of all the work that's being requested. When some other service starts to connect to the Hera server when we are overloaded, what will happen is we will accept it at the TCP level, and then close it immediately. We won't finish the SSL handshaking. This helps that additional service fail fast, it can now do recovery. It can now attempt another connection, it has control again, it's not waiting for a timeout.
The next thing is, if they've already connected and then they send more traffic down that connection. They think they have a database connection, but we already know that we're overloaded. What do we do? It waits a little bit, it gets queued. When it gets queued, typically, other queries are done in one millisecond. That means that this thing just had to wait another millisecond and now it can get its answer, which isn't too bad. Most services don't mind a little delay. They prefer that versus getting a hard error. However, if it waits too long, that's not good. At one second, if it's been waiting that long, then we'll say, "No, I'm sorry. We couldn't get to your query,we are overloaded. Try to do whatever you can do to recover or deal with a failure."
Another thing that happens when we're overloaded is, typically we are overloaded because we have slow queries. Slow queries, if you remove them, allow hundreds of millisecond type queries to go through. It can really help keep your site up, keep working. It's not perfect, but once you're overloaded, you want to start mitigating that failure, you want to reduce that impact versus take more impact. You want to try to do whatever steps you can to do some mitigation to keep most of your business working. The way we monitor this overload is Hera logs to a state log and very regularly says, "How many database connections are free?" Here we're starting off over time, we're starting off with a database that has plenty of free connections, and then it drops down to zero. Maybe the database crashed, maybe it went for maintenance. Everyone used all the connections for some reason. Then over time, it comes back and says, "Here's your connections again." It's back to normal.
When we first turned on this eviction of slow queries, this is what actually happened a few years back. You can see a number of Hera nodes are doing fine. Then at some point, something happens, those slow queries started appearing, and then we start ejecting queries very slowly at the bottom or just fluttering low because we're taking impact and we're still making just enough room, just a couple of connections to keep a lot of queries coming through. This was good enough that we didn't notice this. We went back and said, "Did this feature kick in?" It said, "It did." We didn't notice it. We're better now, but those first couples of times was, "Interesting. We didn't know."
Another thing we do with read replicas is, PayPal has a lot of reads. We created read replicas to take the data from our system of record, our active transactional database, and put it on replicas. Then the application will actually now use those read replicas when they can know that they can have some data latency so that they can absorb that few seconds of time it takes to replicate data from write to read databases. Those read databases, also we made two of them, just because they can fail, sometimes we don't give it the same hardware, they're lighter weight.
Then the application needs to know how to failover from one to the other. It does it by when in that connection on that first pool bounces, it's just overloaded, then it goes to the second one. This is not really ideal. First of all, you have to keep that first one busy which is not great. Then second, when you're switching over from a more highly connected database to a colder one where nothing is ready yet, you also take a lot of impact there, just trying to get connections up into the second one.
We thought we could do better, so then we built it directly into Hera. Hera now has both sets of connections to both databases. What it can do at this point is not only can it detect that the first one is down and move everything to the second, it can also detect when a single query is just a little slower, and then use that slowness and say, "No, I'm going to switch over to the second one." It switches that over, and so now the application sees it more transparently. Yes, it took a little longer since we waited for the first one to get slow. It helps with the application to be able to just go ahead and query. They don't have to have a failover logic inside their application. It makes them simpler and it makes them, I guess, happier. Next, Petrica [Voicu] will cover some more stuff about manageability.
Manageability
Voicu: We talked a lot about all these features that we have to fulfill microservices' needs pretty much, but we also have other needs coming from the operational side. We want to perform maintenance or update on the Hera nodes. Also, Oracle DBAs, they need to do Oracle patches once a year or whenever needed. What do you do to help this maintenance without the customer impact? First, because of the connections to Hera are cheap, pretty much it’s just a TCP socket on the server, we actually recommend the clients to recycle often. This is probably not very intuitive, but it actually helps a lot and I'll describe how this helps.
Let's say you want to bring one of the nodes down to perform an update, you mark it in the load balancer so that it doesn't take any more connections. One minute later when all those connections in the client get recycled, they all get established to the whatever configured nodes are, in this case, the first two nodes. The third node there is out, it's very fast, one minute, it can be taken out. If we want to put one node and other nodes back, we mark it with a load balancer. One minute later, all the conditions get recycled, they are evenly spread out to all of those available nodes and everything is done very quickly. In a similar fashion we can even do a database cut over, very fast and easy, and without the clients even knowing it.
Another feature we have is to help with Oracle RAC cluster maintenance. If DBAs want to do maintenance in one of the nodes, in this case, let's say, node number three, they will enter one row into a table which Hera reads every few seconds. The DBAs that I mentioned there, when they take the node down now, when the time comes, Hera will actually disconnect from that particular node so that DBAs can perform maintenance, and when it's done, it will reconnect everything. It's done without the clients even knowing what's going on in the server, without being down. This pretty much concludes all of our discussion about features.
Transition to Go
I want to talk a bit about our journey towards Go and how we wrote the last version of Hera. This is not a greenfield implementation, we already had a previous Hera service done in C++. It was based on a framework similar to libevent, netty or Node.js, if you are familiar with the asynchronous programming. This was so great. It was scalable, it was efficient. The only problem is over time, the software got very complex and it was very costly to enhance it and to add new features. At some point, we thought that we wanted to improve the IO framework that we used. Eventually, we considered Go. Go would help with the complexity, because in Go you can write blocking code, the complexity would go down. At the same time, Go scales efficiently out of the box. The runtime takes care of scheduling various what we call light threads over a few operating system threads.
However, as I said earlier, because we are replacing an existing service, we're not sure. Will it meet our requirements? Will it be better than the previous version or if it's going to be worse? How much worse? We can pay maybe 5% worse, but not more. The main concern was, is the garbage collection stop-the-world time a problem for us? We thought about actually moving to Go even earlier, maybe five years back, but at the time, the stop-the-world garbage collection time was too much. Two years back it was less than five milliseconds, we thought that it was going to work for us.
We decided to write Go. However, almost all of us were C++ engineers. We weren't sure, we had no real experience with Go, and so we took a gradual approach. Just in case it doesn't work out, we throw all the work, we mitigate the risk. First of all, we did a proof of concept, but we tackled the hard problem first. We implemented some of the hardest features that we had and we wanted to see two things, performance and complexity; how the code looked like, old versus new and we liked it. We saw similar performance. First, we released just actually going to the live site on one of the databases which didn't have all the features. For example, we didn't implement sharding. We went on site, it was fine.
Then we moved to the next step, we implemented all the features that the previous version had. However, we still used some C++ code. Go works very well with C++. We didn't have to implement everything in Go. Then we implemented everything in Go and it was successful. It’s running a production already for about one year. Latency requirements were met, so what we measured was specific to the client. We measured P99 as well as average performance, versus the previous C++ version. It was comparable, it was similar. Also for the CPU, we measured a CPU consumption on the Hera nodes. Again, the new version versus old version was surprisingly better. Mainly it was because of the IO work. Golang runtime is doing a much better job than our existing framework. All of this we ran without the client changing, because the wire protocol was kept the same, unchanged, so become compatible. Our internal customers didn't know.
We had some issues and some learnings. I want to share one of the lessons that we learned. Do you want to see some Golang code? You see an artificial example, of course, where I have a main program spawning 10,000 go-routines, which, if you're familiar with Go, these are like light threads. What's supposed to happen is, Go runtime is supposed to schedule all these tens of thousands of light threads over a very few operating system threads. I run this test in my VM machine with two cores and I expect very few operating system threads. Now, when I run it, I use my shell, I see the 6 threads, which is about right because besides those two threads per CPU, Golang internally has some threads to handle IO and garbage collection in the background, so this is about right.
I'm going to do a change here; because of my light thread I'm sleeping the same functionality, 10 seconds. However, I'm doing this via a system call. All of this I expected the same behavior, right? Actually, no. When I run this here, Golang runtime is trying to create one operating system thread per each Go-routine. This is not good, this is actually against what was supposed to happen in Go. If I look at the documentation deeply, we can find the explanation. Sometimes the runtime will lock a go-routine to operating system threads. If you have multiple routines ready to run, it will spawn new threads.
There's a key takeaway from here. Pay attention to system calls like os.File read. This will lock your routine to the OS thread. If you use C++ code, pay attention to system calls you may have. On the other hand, of course, Golang locks, socket reads, all of those, although eventually there will be system calls, Golang runtime optimizes them, so there is not a system call every time you do a socket read, those are fine. In another test, you may look at how many OS threads you have even if you don't exceed the number of threads. For us, it didn't happen in the first time in our real example, but once the number of threads exceeded a certain limit, we saw degradation and eventually crashed.
We expected the community to face similar problems that we had at PayPal with database connections, so we decided to open-source Hera, and you can find it in GitHub. We have two clients, the JDBC client and Golang client. If you want to try it, you just swap your database driver with SQL driver and basically no client change. We have a Google group for support, if you have any questions. I want to thank you, my colleagues, who contributed to the project. Please try to look at GitHub. Even if you're running out of connections or if you need to even lower the number of connections that you are using, please give it a try.
We just open-sourced two or three weeks ago, I believe, but this is not a beta. This is a real solid code running in production for about a year. We plan to invest more, we're going to have the same version internally as externally. We do have some stuff in-house which we couldn't open-source, for example, some more tests that we have in-house, but we couldn't open-source yet. We need to clean up our dependencies there. We have some more clients internally, like for C++, Python. We plan to open-source those, it's not a big deal. We have some tools around it for scaling for capacity, so we also plan to open-source those.
Questions and Answers
Participant 1: You had some mention of Oracle with the split read-write. Is that tied specifically to Oracle, or does it port across any particular database vendor?
Kang: We specifically use Oracle in that case because we want a synchronous replication. When we route the queries, our clients typically expect synchronous. If you've updated something, the next one is a read, it should be synchronous for most clients, and so that's what we use. We've thought about having it run in asynchronous replication where there might be some replication lag. We don't think that would satisfy most of our clients because a lot of them you'll see, you want to update your address, so you do the update, then the next one is they want to see the address properly inside the system, and if you have any amount of replication lag, and it just flutters just a little bit, that's not going to look right, and so that's why we use the synchronous replication from Oracle.
Participant 2: We're doing something similar and we're running into problems with migrating data from one shard to the other. I was wondering, how long do your migrations take for a specific user, or however you're doing the sharding on? How long do you have the right lock on?
Voicu: Like we did for one of the bigger databases, we used the particular internal design of the Oracle database - scuttles, we moved the entire scuttles, or they're basically 40 seconds.
Participant 3: I have two questions. One is, you talked about the protocol that the clients use to communicate with Hera. Can you talk a little bit about that? Then my second question is about sharding. If there are two databases with independent sharding, any way of querying data between them?
Voicu: About the internal protocol, it's a very simple protocol. Pretty much it has been for 20 years internally used. You have a length command payload and typical commands are prepare, they map very well to JDBC. That's why it was very easy to write JDBC driver or a Golang driver. Our Golang client is about 1,000 line, that's it. For JDBC driver it’s a bit bigger, but the actual client is 1,000 line, then you have all those JDBC methods like 300 JDBC methods. The internal protocol is very simple, we're thinking about maybe adding MySQL or protocol or something in the future, but we don't know yet.
About the sharding, we have less limitations, so we don't support cross-transactions. Some of our customers, when they wrote the application, they didn't have sharding in mind and now they want a shard. We have some logs to help them migrate. For example, the first one, they think they have some shard keys for all those queries. If they don't have them, we alert them. They can run first in a test mode and will alert. They don't have shard key. Maybe you want to do a cross-key or cross-shard transactions and it helps. It works for us. We don't believe in the two PC and the cross-shard transactions.
Participant 4: You said when a query goes through Hera and can't be resolved as to which shard it should go to, that was how I understood it, that it went to shard zero. Assuming I understood that correctly, does that mean that shard zero has a complete copy of all the data so it can resolve the query, or did I miss some important step in the process?
Kang: That's as we're migrating a legacy application. Legacy applications will have lots of queries, maybe they missed one, and so for that legacy application as it's doing its first steps trying to migrate, we will default all queries that neglect to bind to a shard key. We will send that to shard zero. However, as we enter production and as all the errors have been fixed, we will turn off that behavior. When you actually have distributed data, when a query comes in without a shard key being bound, it will now result in an error because sending into the wrong shard is probably worse than having an error. Cleaning up bad data is really hard.
Participant 5: I see you have multiple microservices talk to the one database. When my team built microservices, one principle we have here is each service has to be independent and has some database. I wonder why you choose to share the database across multiple microservices. Is there a reason for that or it's just a legacy problem?
Voicu: We have different pools. Ideally, yes, every domain may have their own database. In practice, they may share some. Internally, of course, we always improve on this.
Participant 6: I have two questions. One, when you're sharding the databases, are you maintaining multiple copies of that or backups of different regions or locations? Second, are you using any continuous integration deployment in this entire process to make it faster for the entire feature set of things to be developed and deployed?
Kang: There are multiple methods of getting some sort of failover capability, including having backups. We do backups, I guess, at the TNS names or at failover level. We also do backups where the current method is to have a second shard that has no scuttle bins in it, but has data replicating to it and to switch that over at real-time within a four-second window for the shard map to be updated. That's another backup strategy. With regards to continuous integration, we do have some continuous integration to help with the current Hera software to do some basic tests. That covers Hera itself, its functionality. However, some of the stuff that we face is actually more operational deployment, and so we're working on the next level of testing containerization auto-scaling to make sure that functionality is also continuously tested and integrated.
Participant 7: Before you had this Hera, what kind of performance improvement did you see before and after? Voicu: Once you get the multiplexing, for some of the cases, it was like 10 times. Honestly, we did it because we were running out of connections. We were expecting that, because with PayPal, we grow 25% as a volume every year and complexity grows. We grow tremendously every year and we do a lot of connections. That's why we did it. This was a nice surprise when we actually saw the huge improvement in the database load going down just mainly because there was less contention. Doing the same work with the first few connections, less contention, huge improvements. If you are read-heavy, you see more improvements. If you have more writes, there may be a little improvement, not as much, but you should get improvements either way.
Participant 8: Two questions. How many different databases backend do you support? The second question is related to slow query eviction. Do you actually kill that on the database side?
Voicu: Our main footprint is Oracle, but we also support MySQL. It's pretty easy to add any other backend. Using a standard driver, we can easily support Postgres or any other database.
Kang: With regard to the slow query eviction, yes, we actually kill the query. The reason we kill it is that we've detected we're in an overloaded state anyway. Something is taking impact somewhere. We're already having to reject incoming connections. Most likely, any incoming query that doesn't have a connection yet could be very close to timing out, and so at this point, if that's the only query that's being slow and we can reject it and have more normal operation, that can take pressure off the database and help recover. We were skeptical when we developed that feature, and then we saw the actual incident graph where it showed that it was active and we didn't notice it and said, "That does actually provide some relief for the DBAs as well as the database." We still debate on what parameters we should use on how fast the eviction is, how many should be evicted. Fortunately, we haven't had enough incidents to actually figure out those parameters and I liked it that way. They also enter a blacklist for 20 seconds.
Participant 9: This is a follow-up to that. You mentioned one millisecond versus one second. In terms of the distribution of the time for queries, is it very homogeneous like that where it can be broken in two buckets, or is it very much heterogeneous and there are all these different 1 millisecond, 100 milliseconds all over the place?
Kang: Our queries perform multi-modally. They do not conform to at least any easy to characterize distribution.
Participant 10: For sharding and joins, do you ensure that the related data ends up on the same shard?
Kang: We have a set of queries that cannot be run on sharding. It is up to the application to determine that they are shard-compliant. We are unable to block queries that would possibly start doing multiple shards; the application developers really do have to check that it's sharding compliant. We can help with some of them, not all of them.
See more presentations with transcripts