Bio Attila Szegedi is a software engineer working for Twitter, in its Core System Libraries group, and serves as cross-team expert for JVM performance. He also works on OSS projects: he is a contributor to Mozilla Rhino, Twitter's server-optimized Ruby runtime Kiji, author of Dynalink (dynamic linker framework for JVM languages), and a principal developer of the FreeMarker templating language runtime.
QCon is a conference that is organized by the community, for the community.The result is a high quality conference experience where a tremendous amount of attention and investment has gone into having the best content on the most important topics presented by the leaders in our community.QCon is designed with the technical depth and enterprise focus of interest to technical team leads, architects, and project managers.
Yes sure. Well, it’s been a while that I actively did a lot with Rhino. It came into my focus back in, I think, 2005 or because my previous job, we really needed a good scripting environment and my initial encounter with it was I was basically prodding the then maintainers to port the Apache Cocoon fork which had continuations back into the mainline.
And then, you know, by gradual adoption, started fixing bugs and submitting them: eventually became a developer; eventually became a committer; eventually became the only active committer which de facto made me the project administrator. And I wouldn’t say I did anything too big during that time.
I really just tried to keep the project alive, fixing bugs and more, important trying to attract talents to it which I managed to attract one person who fixed the XML support. I reactivated Norris Boyd who was the original author and then eventually managed to give over the maintenance to Hannes Wallnöfer who is right now the maintainer.
Last thing I did, quite recently actually, about a year-and-a-half ago was that with the advent of CommonJS, I figured that it would be good if Rhino had a built-in implementation for CommonJS. So, I actually wrote a working CommonJS modules implementation on top of which you can actually build any other CommonJS system. So, you need to have that as a support in the run-time and then you can get the others on top of that. So, that was my last significant contribution as recently.
Well, invokedynamic is, well Charlie Nutter says that invokedynamic is neither invoke nor dynamic. It’s actually a way for applications to specify their own linking rules for calls. And for dynamic languages, this really means that if you can somehow translate your dynamic language source code into Java bytecode which is what most dynamic languages do: Rhino can compile to bytecode; Ruby can compile to bytecode; Clojure obviously compiles to bytecode, and so on. Then you need not do some slow dispatch and look up when you’re invoking the methods and properties and so on in your language.
But you can actually do more efficient linking to implementations of these methods in Java and you can actually let the HotSpot optimizing machinery kick in at that point; which means that your call can actually be inlined as if it were a Java call and then the inlining also allows for loop unrolling and a host of optimizations.
So basically in the long term, this means that it will be a very significant performance boost to the dynamic languages on the JVM. There is, however, quite lot of additional high level plumbing that needs to be done either on a per language basis or the languages can choose to adopt maybe a unified approach to do this high level plumbing that’s on top of JVM.
That is exactly what my library solves and it’s on a high level basically when you think about it, Invokedynamic provides you with low-level plumbing for customized linkage but when at the end of the day, you need to decide that an expression in your language to what method call will it link to either in Java which is statically typed or any other language, right. Then there’s this decision, this look-up, this dispatch has to be implemented in a unified way. At least, what every dynamic language wants to do is being able to interoperate with Java, right?
So, at least that problem has to be solved with a Metaobject Protocol and actually my solution turned out to be more generic because it went one step further saying that: "You know there’s no reason why you would limit yourself to just having the Metaobject Protocol for linking with POJOs, you might as well be linking with something else."
So if you are in Ruby, you might be linking through to a Python function or to a Python object if you want to.
Yeah, that’s correct, last July. Well, interestingly enough, it was a tweet. I met Evan Weaver who was then the head of infrastructure at Twitter back in 2009 at QCon in London. And he’s an extremely nice person and we just randomly happened to sit next to each other in a pub in London at a Speaker event and struck up a conversation, started to following each other on Twitter and so on and about a year later, he was basically sending out a tweet saying, "Dude, come work at our company. We are hiring." And I was thinking about, maybe taking up a new challenge, I felt like Twitter is an environment where you really have a lot of tough problems to solve and I just basically responded to the tweet and, you know, the rest is history.
Yes, Core System Libraries is a sort of a subteam within the runtime systems and I’m dividing my time basically between the two; runtime systems in general and Core System Libraries in particular.
Within runtime systems, I’m doing pretty much what I’m speaking about in my talk which is I describe myself a bit of a wandering engineer. I basically join various teams at the company when they have performance issues and then sit down with them and then we hammer them out. And this is normally JVM performance tuning.
I did some wacky stuff. One of the things that I did when I joined was actually replacing the garbage collector implementation in the Ruby runtime because it used to spend 33% of the CPU in GC and then we pushed that down to 10%. I mean, I’m not taking credit for doing all that work again. Again, Evan Weaver did a lot of that and few other colleagues at the company. Brandon Mitchell also did a lot of work on that but together we came up with a solution that is actually a Ruby that is usable for enterprise applications.
So, these kinds of things. So these days, Core System Libraries has the mandate of basically writing a bunch of lower level libraries that are used by all the more product oriented teams within Twitter - service discovery, system-wide exception reporting, you think it sounds mundane until you realize that you are talking about real-time propagation and monitoring of error conditions over 10,000 hosts or so. So, it poses some challenges of its own.
Well, the difference is one thing is that for natural communication, again, we want to be low latency. So, interestingly enough, most of the time, we code for load balancing moved into the clients; so, we combine this with service discoveries so clients will normally know where are the instances of the services that they can use and they fail over really fast. We have ridiculously low time outs on connections, which again, is a problem if your service starts to garbage collect while you’re connected. But again, the client would just abandon it and fail over to a different instance really quickly.
However, the real big differences are strong favoring of asynchronous communication model. We have an open source library that’s named Finagle which is actually something that allows you to write really scaling non-blocking IO clients and servers with fail over load balancing, retry and some capabilities baked right in. It’s built on top Netty which is already an asynchronous network communication library but we take it basically one step further by making these other desirable properties of load balancing and so on available in it.
So basically when you write your service or your client for that matter on top of Finagle, you will mostly be dealing with programming with futures, programming with asynchronous callbacks and so on. It’s a fairly palatable and Scala which is the native language on which Finagle is written and we take great pains to also provide adaptors for Java, so it's fairly friendly for Java as well.
So, asynchronous communication is one thing; the other thing is that data storage for almost every data storage needs we need to consider sharding. So, whatever is your back-end, either you’re storing your data in MySQL, either you’re storing in Cassandra or whatever, you need to have a sharding strategy.
Again, we have an open source library for that a well. We’re quite good at open sourcing our generic solutions. We have a library in Gizzard which is again on GitHub, Finagle and Gizzard are both there and then basically you can define sharding strategies for any data source in the background, MySQL, Redis, Cassandra - what have you.
And you can even combine these. We have our own database library named Querulous? Which is an asynchronous sort of it looks a little bit like JDBC except that it’s completely geared towards again, asynchronous communication with database systems because if you want to have database operations behind your otherwise asynchronous network communication stack, you don’t want to have blocking I/O on your database, right?
So these are all the things that if we really want to scale, if we really want to have low latencies because we want to quickly be able to fail over with low time outs and so on, asynchrony is the key and data sharding is the key. So, these are the biggest differences.
And now, as for what stays the same, I would say, "Well, you know, pretty much everything else, probably stays the same but these two aspects they really inform the final architecture of your system in a big way."
I think that’s simply because we are not dogmatic about any particular technology. So right now, I think, that tweets are still being stored in a MySQL database because historically they were. And for them, for instance, I believe that we actually use temporal sharding which is an interesting technique where we just shard in databases based on the timestamps of the tweets and putting in new databases and eventually having the older ones just contain the older tweets and so on. So, it’s for historical reasons and for many purposes, MySQL works just fine and for purposes where it’s an overkill to the point to be a performance hindrance, we will just resort to something else.
But right now, we have timeline caches which use Redis and we use Redis without any storage because at the end of the day, they are caches, right? So, we are using Redis for the pleasant property that it has built-in data types so we can just append to a list in Redis but if Redis instance goes down, no harm, no foul, it will be restarted and eventually be repopulated and the caching is redundant anyway, so the same data is normally available in several instances. So, it comes up it’s backfilled.
And Cassandra, it’s fantastic for storing time series. So, if our monitoring stuff is in Cassandra, basically just pick whatever feels like the best tool for the job and we’re not dogmatic about any of that.
8. So when go and deal with a performance problem with some team within Twitter, are you looking at the code first or do you tend to look at the way the garbage collector's configured or where do you start?
Well, a garbage collector is a global service for a particular JVM and as such, its own operation is affected by the operation of all the code in the JVM which is the Java libraries, third party libraries that have been used and so on, which means that, you can’t really, or, let me put it this way: if you need to look at the application code in order to tune the garbage collector, then you are doing it wrong because from the point of view of the application, garbage collectors are a blackbox and vice-versa.
From the point of view of the garbage collector, the application is a blackbox. You only just see the statistical behavior basically: allocation rates, the typical duration of life of the objects and so on. So, the correct way to tune the GC is to actually inspect the GC logs, see the overall utilization of memory, memory patterns, GC frequencies - observe it over time and tune with that in mind.
Yes, we do. It’s not that heavy because GC will only log when it does something. Now, if it’s doing something too frequently, then your problem is not the logging; then your problem is that it’s doing something too frequently and when it’s sufficiently nicely tuned, then it’s infrequent than compared to the work that it has to do to clean up memory, just the cost of writing a line to the log is completely negligible. You don’t really perceive that.
Yes, that’s the thing that bites us, yes.
Yes. So, within HotSpot, the frequency and duration of the garbage collector pauses; well, generally: if you had a JVM with infinite memory, then you will never have to GC, right? And if you have a JVM with a single byte of free memory then you are GC-ing all the time. And between the two extremes, you have an asymptotically decreasing proportion of your CPU going towards GC which basically means that the best way to minimize the frequencies of your GC is to give your JVM as much memory as you can. Specifically, the frequency of minor GCs is pretty much exactly inversely proportional to the size of the new generation. And as for the old generation GCs, but you really want to avoid those altogether. So, you want to tune your systems so that those never happen. It’s another question whether it’s actually possible to achieve in a non-trivial system with a HotSpot, it’s hard.
I have read papers about it. I did not have the opportunity to try it out yet myself.
13. Jumping to the idea of using code as a way of reducing the amount of GC pause time that you get. You mentioned in your presentation about Cassandra’s use of a slab allocator. So, could you explain how that works?
And basically just to go back to me just telling you that you normally don’t look at the code when you’re tuning GC, right? This is valid in so far as you’re always trying to tune the GC for a given application hoping that you can actually achieve the desired results. Now, if to the best of your knowledge, you have tuned the JVM as good as you could and your GC performance is still unacceptable? Now, that’s the time when it actually makes sense going into the application and try to see whether you can actually ease the pressure on the GC somehow.
So with that in mind, yes you can definitely do things in your application code that will make the GC’s life easier. As I’ve mentioned in my presentation, there’s actually a lot of things that you can do. You can try to have more efficient representation of your data in memory, less wasteful, choose the right data types to represent them; try to minimize the garbage and if everything else is given, then actually doing things like using larger chunks of memory either ‘on-heap or off-heap’ and then managing them yourself to a point might actually be valid. I’m never advocating full blown memory management in your byte array or byte buffer but if you have limited uses like you are filling it up linearly and then you are flushing it and you would need to provide a binary presentation anyway, in case of Cassandra’s disk buffers, then it’s a valid approach.
So, if you have certain constraints for your problems, it might be a valid approach.
Well compared to say, two years ago, yes, just being pedantic. Weak references are a different kind of beast. They’re always eagerly deleted by the garbage collectors when the reference is no longer accessible, so they don’t normally contribute to a memory pressure at all.
Soft references, on the other hand, are, we used to be a fan of soft references and right now, I’m less of a fan of soft references, I just cannot wholesale recommend them in any situation as a way of memory sensitive management of your data in memory because the interaction with the garbage collector is such that with throughput collectors, they are really just all cleared at once when memory fills up. And with CMS, they are gradually cleared so you do get this kind of a memory sensitive gradual eviction of soft reference data; but the problem is that it just increases the unpredictability of your garbage collectors and that’s not really what you want with CMS.
Yes, again double edged sword, it turns out that the "when you can" part of the clause does not really happen that often. So, the best thing you can do to improve the concurrent performance of the programs is to have completely isolated threads that never need to communicate. So you have different HTTP requests being responded on different threads and if they don’t need to access shared data in any way, then you’re fine. Then if you need to access shared data but most of it is immutable, you're again fine because multiple threads can just access them without any synchronization.
And then when you actually need shared mutable state, then in the probably 85% of the cases, you will need to synchronize. In the remainder of the cases, you can get away with cautious use of volatiles, cautious use of java.util.concurrent atomic constructs. But the important thing to know here is to run the application under a profiler, a profiler that knows how to do lock contention monitoring.
And when you actually identify performance bottlenecks with the synchronized keyword, then it makes sense to try to think about whether you can apply some of these other techniques at that point and then do them. But again, these are code changes that are best done when you can make informed decisions based of what you see in the profiler, not just blindly.
Yes, well, of course it does because different languages will have completely different runtime behavior. It’s specifically Scala which started out as a better Java, right? Over the iterations different language semantics divert so much from Java that the compiler actually has to do really hard work to bridge the semantic gap between the JVM execution model and the Scala language execution model.
And a bunch of short-lived objects will be created, lots of the time. So, quite often the primitive values are being boxed to be returned across closure boundaries. Closures themselves are being created quite often. Scala will use a bunch of linked lists, anonymous function instances and so on that will be created, shortly lived. For this reason, it also creates a lot of small classes that also get loaded in your cache with permanent generation.
So it does exert quite a bit of additional pressure on the JVM compared to Java code. Now, of course, there is a bunch of techniques that modern JVMs employ to minimize the impact of these constructs most notably escape analysis is really helpful to say, minimize exactly these kinds of things like a long integer being boxed and returned through a closure boundary and then unboxed.
So, good escape analysis in a HotSpot can go a long way. So, what I notice is that Scala compiler actually generates code so that it pretty much relies on the fact that JVM is very efficient at optimizing. So, even a field accessor in Scala always goes through a virtual method invocation even within a class because in Scala you can actually override a field in a subclass. So, you cannot just use a getfield bytecode to get a value of the field. So, even its own class, we only access its own fields through a virtual method invocation that can be over-ridden in a subclass.
And then it pretty much relies on the fact that, you know, hotspot will recognize that these things can be nicely inlined and then will inline them?
JVM itself, not so much I think that using applications with lots of memory is a problem that manifests itself in every runtime system. So, even writing complex C application that manages large heaps especially if it also has to be concurrent and has complex object model - something that’s really hard to write. Again, paraphrasing Gil from his talk is that you actually think about writing these applications in Java because you can and in C you’ll just see that basically infrastructure stuff like memcache is written this way and then you just use memcache when you have the need for these kinds of big memory storage.
Therefore, when we have a problem that are not specifically Java or JVM related. We have a generic issue with using blocking I/O from any language that’s why we go towards the non-blocking I/O and so on. We had a particular problem with a timeline cache system that we were using internally which is speaking to 150 machines and each machine is running 7 instances of Redis because Redis is single threaded and machines are 8-core and they needed something in the 8th core, so they were using 7 for Redis so this goes out to seven times 150 that’s (probably not, I don’t know math anymore) I’m a software engineer, I just know how to operate a calculator and this is embarrassing).
Anyway, the point I was making was that we had 700-ish or 900-ish network connections basically from each JVM to potentially all those Redises and at some point, the actual bottleneck was the throughput of the network card in the machines.
So, we just saturated that and then we have to figure out whether we can maybe talk a little less chatty to Redis, the protocol is given we cannot change it. It’s fairly compact as protocols go but we had to figure out whether we can somehow batch things or just do some a little bit more pre-processing before we send it on the wire so that we’ll make it a little bit smaller and we sometimes run into these.
At the end of the day you will always have this like of multifaceted boundary for application, disk I/O here, network I/O here, CPU here and then you hit one and then you fix one, you hit the other, you’re always constrained by something.
We are not usually constrained by CPU, we are normally constrained by IO and yes, Cassandra folks had to solve some problems with disk IO. They did a ton of experiments with various block sizes and file systems. They were very vigorous about it. They ran a full lab scale experiment with every known Linux file systems and a bunch of block sizes and different read/write strategies until they found the one that performs the best for us.
And then the timeline cache people they (it’s so funny when they’re going to work) and it turns out, "Oh we have a bottleneck." And they start investigating it and after a while, they realize, "Oh my God, we’ve saturated the network card.
Why did they not consider GC-less C++?
If one goes the proper C++ way and spends some thoughts on memory management initially, tweaking is reduced and less complex. Critical resources like memory should never be given into the hands of an automated manager.
Re: Why did they not consider GC-less C++?