Transcript
Alex Seaton: I'm Alex Seaton. I'm from ArcticDB at Man Group. I'm going to speak to you today about how to build a database without a server. I work at Man Group, which is a very large hedge fund. If you have a hedge fund with a lot of technology, you end up building a trading system in it that looks something like this. At the broadest possible level, this is what a hedge fund trading system looks like. At the top left, you have some source of traditional market data, tick data, what's the price of Apple, this sort of thing coming in.
At the bottom left, you have other sources of data coming in, and more esoteric stuff. This could be stuff like satellite pictures of crop fields or credit card transactions. Anything imaginable, you want it to be as esoteric as you can, because the more esoteric it is, the less likely that your competitors have the same information as you. This all gets piped through this huge system, you do risk management and simulation over the top of it. It gets piped in and fed into a trading system that does its own calculations and calculates covariance between stocks and all this crazy stuff that gets used as inputs for the next step in the pipeline. You need a really good way of managing this flow of information and these connections between systems. Man's approach for a long time to doing this has been to use Python everywhere, have as many of these systems talking Python to each other as you can, and probably controversially use pandas DataFrames a lot as the interchange format.
For the data warehouse stuff on the furthest left, this makes sense, you need to archive off data somehow and work with it in an in-memory structure that makes sense. As you go further along, we also use DataFrames to exchange data between systems in this pipeline. Importantly, you also care a lot about versioning data, because data vendors will often correct data after they've delivered it to you, and it will have this cascading effect on all the calculations that you've done off of that data. You really need to be able to track modification to data over time and see not just what's the most up-to-date, most corrected version of the data, but also what's the data at the time that my system saw it, so you can reproduce the behavior of your systems. We clearly need some good technology for dealing with pandas DataFrames, serializing them, reading them out really fast, versioning them.
The Old World
Starting in 2012, Man started working on this old system to serialize them, all built on top of MongoDB. This was some tooling based on top of MongoDB, some tooling to serialize pandas DataFrames off to Mongo, really efficiently, read them out efficiently. It was a big success. It was open-sourced. Actually, a lot of our competitors started using it, so you can debate whether that was successful or not. You can see this problem in the middle here, which is this enormous Mongo cluster. This is a slide I stole from the 10 years ago version of this talk about the predecessor system. Already within a few years of this system, this Mongo cluster is getting pretty enormous.
Over time, it just kept growing out of control. We were spending our lives chasing around space and asking people to delete data because you can't expand this thing endlessly. Apparently, there's an urban myth that we were running the biggest Mongo cluster in Europe for quite a long time in order to manage this system. Like lots of these systems, it's a victim of its own success. You get into an arms race with your users, and they adopt the system a lot. You get to a point where you can't keep up anymore.
The New World
We wanted to move to a new world, and we were thinking, can we just get rid of the Mongo bit of this system completely? What's it there for? What's it giving us? Can we have a world where all we need to manage is storage itself? Can we have a world where we can easily expand and upgrade what we're doing? I mentioned upgrade because actually upgrading a Mongo system is very annoying, and you have problems with old client versions not being able to talk to it anymore, stuff like this.
Crucially, can you get to a point where your read speed is only limited by the bandwidth of the storage itself? Can you actually get rid of this middle layer in this big database server that's maybe not giving you as much as you would think? To motivate this, this is a plot of bandwidth on top of one of our storage clusters. This is S3 read and write speed. This is just from an arbitrary date. You can see peak read bandwidth of 40 gigabytes a second or more. If you can just stream data straight out of these devices, they can give you a lot. You don't really need the big database server in the middle. The second observation is you can see that data's clearly read a lot more than it's written, which isn't surprising. It makes sense to optimize for reading data rather than writing it usually.
Outline
In this talk, I want to go through a bit about what we've built, and actually some simple ideas about how you use object storage to lay out structures so that you can work with data at good scale and get some good atomicity and consistency and durability guarantees just with object storage. Then move on to some of the harder bits, problems with how you manage global state and these obscure things called CRDTs and some of the nuances of working with them for how to manage global state without any synchronization. Finally, a few ideas about the future and some new facilities that object storage vendors give us and some ideas about how we can use them.
1. What is a Serverless Database?
To start with, I thought I'd talk a bit about what is a serverless database, or more specifically what it's not. If you Google what's a serverless database, you get a bunch of people selling you managed database as a service solutions, which means they somehow provision database servers for you on some cloud and manage nice scale up and scale down for you. If they do their job perfectly, you don't even know there's a database server, which is great. I can see why people want this, but it's not what I'm talking about. What I'm talking about is just a software library that connects directly to object storage. You just have a thick client connecting directly to object storage.
In our case, this would be an ArcticDB client, which is our package, which is like a Python wheel with most of its work actually done in C++ native extensions. You'd have any number of uncoordinated Arctic clients all talking directly to the object storage to do their work and read and write data. Just to make this a bit concrete, I'll show you briefly what it is, but the rest of the talk will be more details about how it works. You might just create some S3 bucket, create a client, connect to the bucket that you've created, and create a library in it. You can think of the library as just a database, a place where you put multiple keys, and then you're away and you can start reading and writing whatever data you're interested in. I'm writing some example data about the prices of Apple here. Importantly, it has this versioning concept built in. After you change the Apple data, you can go back and see old versions of it.
2. The Basics
I thought I'd talk about the basics of how you make a system like this to begin with, because you can actually get surprisingly far quite simply with a few simple ideas that I think are generally applicable to other systems. Maybe people who don't care about databases could steal some of these ideas. The idea is that we have a tree of objects all in object storage. We're using the same file format for all of them. All of these objects that we're storing are immutable except for the very top layer. When we write some data like this, this is looking at some interesting bit of financial history. This is like made-up prices around the financial crash. We might be writing data for Google and tracking them. What we do is start by taking this data and breaking it up into chunks so that you can read data later without reading absolutely everything. You can do some filtering and pruning and only read blocks that you're interested in. Importantly at this point, none of these blocks are actually readable to a user. They're all stored under opaque, undiscoverable keys. We're free to do what we want until we get to the very end and make all this whole system readable and observable to users.
Then we'd write some flat indexing structure over the top of it. This is just a mapping between, how do you find data within a given range of time? Some extra bit of metadata to do our versioning tracking. At this point, none of this is visible to users. This is how we get some decent atomicity promises. We can build up this tree of several objects all on object storage without any of them being visible to users until we do the very last step, which is writing this single mutable reference key over the top, which is like a pointer to the latest version of the data that you're interested in. It's only this that makes this discoverable to users. Then you might want to add on some data. Here's maybe some data for September 2008 with some more prices for Google. You would do the same exercise again. You'd write just the September block at the bottom, index over it.
Importantly, this index would reuse the blocks from the previous version so you don't have to rewrite everything. Here we have really only the second main idea of this, which is you build this second versioning structure that points back to the previous version. Once you actually finish what you're doing and repoint and publish version one, you'd still be able to traverse back through this linked list of versions and find older data. In a way, we've gotten quite far quite easily. All you really need is this idea of doing writes bottom up and exporting them with some mutable pointer at the top of the tree to get good atomicity promises. The object storage helps you a lot with really good durability. You can get partial writes on object storage.
In some ways, working with object storage to do this is easier than working with a local file system to do this because you're not worrying about things like dirty pages and whether my write has actually been flushed to disk and what happens if someone pulls the power out. The object storage handles all of that for you. An intrinsic limitation of a system like this, is the consistency model. What we're left with is this last writer wins consistency model where if you have two people writing to the same key in the database, one of them will succeed in writing to this mutable layer at the top, the other will lose and their data will be left dangling. I'll come back at the end to some ideas about how we could improve this.
A few more remarks about the basics of this system. A lot of the motivation for this system is actually getting good performance and really good read speeds for users. You care a lot about how you pipeline IO work. You want to have really good scheduling that will be able to kick off tons of IO operations all in parallel. Ideally, you want to use proper async APIs so that you're not tying up a bunch of operating system threads all blocked on IO work, but you do something smart with an event loop or something like that so that you can have very many IO requests all happening in parallel. Compression is interesting. At the moment, all of these blocks are just LZ4 encoded, but there's lots of quite interesting ideas about using different compression algorithms for the metadata layers at the top than for the data layers at the bottom, or maybe doing adaptive encodings and actually inspecting the data that you're writing and trying to figure out what's the best compression algorithm for it, which we're working on at the moment.
My last point about performance, which is maybe specific to our system, but I think in general, there's a big cost to getting the data out of storage and then munging it into a format where you can expose it to the user. I think this would be true with any system like this, but for us, this is the cost of taking objects that you're holding in memory in C++ and getting them into a format where you can map them into Python objects and expose them in Pandas DataFrames to the user. It can be really expensive, and I'll show you a bit why in a second. You really want to have good APIs to do grouping and aggregations and filtering and as much predicate pushdown as you can so that you restrict the amount of data that you actually need to export back to the user and do as much as you can just talking to objects on storage.
To illustrate this, this is all the CPU-bound work that one of our threads is doing, and it's absolutely dominated by just copying strings around and taking them from this nice string pool format out into an exploded format that you can expose up to Python. You obviously wouldn't have to do this if you were able to restrict the data that you were exporting back to the user. Then maybe a downside of working on a system like this is compatibility. This is a downside of any thick client or library-based architecture. You end up very quickly with an absolute zoo of client versions all cooperating with this system, and you can't be opinionated about which versions are still there, so you end up having to support clients that are several years old. You have to worry about backwards and forwards compatibility, like what happened if you write a new data type and an old client can't understand it, stuff like this. My only thought about this really is it would have been great if we had some client version numbers in the objects that we were storing right from the beginning because that's something that's impossible to retrofit, and you could have used that to introduce some restrictions about which clients can read to which databases.
3. Managing Global State
So far, so good. It seems a bit too good to be true so far, so I thought I'd talk a bit about one of the harder problems on a system like this, and that's the problem of managing global state. I like talking about this problem because it seems like it should be incredibly straightforward. We just want to find out what all the keys are in our database, so imagine that you've saved off some data for Apple and some data for Meta. Maybe you delete the data for Meta, and then you want to find out what all the keys are again and just find that you're left with Apple. I know this seems like it ought to be the easiest thing in the world, but if you have no coordination between the processes that are writing data for one key in the database versus writing data for another key in the database, this actually gets pretty difficult pretty quickly, and I think it's illustrative of some of the problems you have with this programming model.
The simplest approach would be to walk through the structures that I already described, which would work, but it's completely unperformant. You could imagine, in practice, some of these databases might have millions of keys in them. That's perfectly normal. You would have to list them all out. Suppose that you have a million keys, you would list them all out. S3 gives you lists back in a page side of 1,000, so then you've already done 1,000 HTTP requests just to find these, and then you need to do further work to work out whether they are actually deleted or not. In practice, this just doesn't work. Then we start thinking about ideas of like, is there some way that we can cache the state of these objects in the database? Is there some way that we can journal them off and have some fast lookup cache to find the total state of the overall thing? This is a description of the most naive way you could imagine doing this, like what if, when we write a new key, we write a little marker into object storage describing that we've done that and the time at which we did it. What if, when we remove one, we write a little marker describing when we removed it, and then you can imagine folding these things together and compacting them into a bigger key that describes the overall state, and then you can clean up these tiny keys that you've written as you go along. There are some pretty obvious problems with this.
The main one of which is that you're completely relying on timestamp ordering. This delete at time 4, you're relying on the time 4 being after time 1, in a sense, to conclude that Apple's deleted. We actually tried this approach for a while. We knew about the clock drift problem, but I think we were really surprised how much of a problem and a big limitation it was in practical systems, especially on some of our research clusters that are maybe maintained less obsessively than production trading servers and stuff like this. We've observed clock drift up to tens of seconds. I want to emphasize as part of this that clock drift is a real thing that you need to worry about. The other problem is respecting this last writer wins consistency model that I spoke about. If the write and the delete were actually racing with each other, nothing about this is resolving which one actually won. It's quite easy for things to get out of sync.
This brings me into the weird and wonderful world of CRDTs, which are this obscure concept. It stands for conflict-free replicated data types. I thought they were obscure, but I think this is actually the third talk that mentions them, so maybe they're becoming fashionable. The prototypical example of CRDTs is like collaborative text editing. You could imagine having three people all trying to edit the same Google document or something, and they want to sync their updates across each other. They'll send little messages to the other editors describing the changes that they've made. There'll be some algorithm to fold together the changes that all these parties are making to the document and resolve it into some deterministic state without any manual intervention. It has to cope with problems like, what if one of the people editing the document is offline for a while and its messages can't get sent to the other? When it comes back online, you still have to be able to sync everything up and end up with all the replicas in the same state.
A more formal definition of what these things are is they're a data type and an algorithm for how you merge these data types together such that you can update any replica in your system without any coordination. You have to have this automatic conflict resolution, and everything has to be eventually convergent onto the same state. A simple corollary of this is that the operations you're doing have to commute with each other because you can't rely on the delivery order between these updates being synced across replicas. To make this a bit more concrete, this is an example of a fairly simple CRDT just modeling a counter where maybe we have one replica and another replica. They're each encoding that I'm incrementing the counter by certain amounts, and eventually anyone involved in this system would agree that the counter ends up with a value of 6. To explain how this might look on object storage, this might look like writing objects into storage, encoding the operations that you're doing.
Then someone interested in the overall state of the CRDT could list out everything under this prefix describing all the updates and merge them together. One thing to mention here is it's important that you encode all the information you need just in the keys in object storage because you need to be able to list them out, and from just that listing, figure out what's going on. You don't want to have to list them out and then read them one by one because that would completely kill performance.
To get back to our problem of just figuring out which are the keys in the database, we become interested in these things called set CRDTs. Sadly, there's bad news immediately, which is that there aren't actually very many CRDTs that are known and proven to be correct. There isn't one for a set. The reason for that is obvious, it's the adding and removing from a set on commutative operations. We've already shown that all operations with CRDTs have to commute with each other, otherwise you can't get convergence. Instead, what we have is this zoo of different weird approximations to sets. We start with this thing, the grow-only set, which is a really primitive construct, but it's useful to build more interesting CRDTs out of, where you can only ever add keys to a set. You can never remove them. This removing of the key at time 3 wouldn't be allowed, which seems useless and not at all like a set.
Then we can build more interesting structures on top of this. This is a two-phase set, and the idea of this is, you can add keys to the set and remove them, but once you've removed them, you can never re-add the key. This uses this common practice with CRDTs, where you build it out of two more primitive CRDTs. Here we have one CRDT modeling this grow-only set of live keys and another grow-only set of keys that have been deleted. You can imagine how this would work quite easily. You can see why it's impossible to re-add Apple once it's been deleted, because once it's in the tombstone set, it's in there forever. This brings me to the last one of these I want to speak about and the inspiration for how we went about solving this problem, which is this quite clever thing called an observed-remove set or an add-wins set. The idea for this one is that when you add a key to the database, or it's not really a database thing at the moment, but when you write something, you also tag it with some unique identifier. Here this 0x53 is just one byte that's meant to be representing a unique identifier.
Then suppose that some other replica also adds Apple, it would generate its own unique identifier. Maybe its state syncs through to the first replica and the first replica knows about both these unique identifiers. Then when the first replica finally decides to remove Apple, the message that it propagates to the other replicas would also include information about the tags that it knew about at the time. This is why it's called an observed-remove set. When you remove stuff, you also publish the state that you observed at the time that you removed from the set. Here you would eventually see that Apple isn't in the set because there's a removal matching the tag for each addition. The reason it's called an add-wins set is this idea of what happens when there are concurrent writes to the same structure. Here if we have two replicas both writing to the same structure, the eventual state will be that Apple is still in the set. The addition operation wins because we can't possibly see a removal for the unique tag generated by replica two.
This inspired a new approach for us for how we manage this set of keys across the database. That was, instead of doing this based on timestamp, we try and record the state of the version history and the state of the structures I spoke about at the time that things are created and deleted. When we first write a symbol, we might write a little mark saying it's just been created and it's new. On the next write, we might say we've changed it again and it was at this particular version number in this chain.
Then when we delete it, we record what version it was at, at the time that we deleted it. What version were we deleting here? This is much better now because this gives us a way to detect when there's conflicting modifications to the same key in the database. Now if you have two people writing and deleting and you need to figure out which one wins and which one ends up valid, you have this way of detecting that this sort of problem has happened and then you have this fallback plan of going back to the golden source structures and resolving and figuring out which of these two states you're in. That's pretty nice. Something I've really glossed over with this though is how you do garbage collection and compaction with structures like this. We have this huge practical problem where we're accumulating tons of tiny objects, modeling changes to this set. We have this quite thorny problem like, how do we actually compact these together into a single bigger object?
In particular, how do we do it without changing the view of reality that other clients are getting of these structures? This is a huge problem with all CRDTs. People working on other CRDT systems also have exactly these problems. I think in a way, our problems with this are worse because all our state is stored on object storage in this global place. You have to really be careful about how you do the compaction without breaking other clients that might be reading it. Whereas if each replica had some local state, you can imagine more easily them being able to block for a bit and not accept any updates while they do some compaction work.
Just to summarize this quick section, one of the main points I want to make is that clock drift is a real thing. I think too often we ignore it or think that it happens on a much smaller scale than we think. Maybe we think it might happen on the order of milliseconds or microseconds, but it's actually observable on the order of seconds in some real systems. I don't think we should ignore it. I think we should be very scared whenever we see code that's comparing timestamps written by different machines. Also, this first approach that I described, this timestamp-based approach at the beginning, we actually kept for quite a long time in the early days of this project.
The evolution to tracking based on instead some causal history of what's going on. It wasn't rocket science, but it was a big mental leap from what we had at the beginning. We spent way too long fiddling around on incremental improvements to this timestamp-based approach and trying to increase wider and wider tolerances and find some way of getting away with things. My last point is that CRDTs are useful. For a problem like this, they're the only option you have by definition, but they're very subtle things. None of the things I described actually model a set correctly. They're all some vague approximation of a set. You need to be really careful with them. You need to read some papers about how they work. You need to have some formalism and some proofs behind what you're doing.
4. Synchronization
I thought I'd talk a bit now about how to do synchronization on object storage. I want to talk about this because object storage vendors are always improving. Last year, Amazon announced support for doing conditional writes onto object storage. They're actually behind some other vendors. Google and Microsoft have had support for conditional writes for quite a long time now. These are things where you can say when you upload an object to storage, upload it only if its state matches some identifier of its original state. Has it changed since I read it out? You can use that to detect conflict. Or you can say, upload this object only if there's no key already in the object store that matches what I'm uploading. Upload it only if it's brand new. This is great for us, and we're quite excited about it because it finally gives us a way to improve this last writer wins concurrency model that I showed you in the beginning where you can have racing writes on the mutable reference layer at the top, which is this model where one writer lost out. You can easily imagine how you might use this if-match thing to detect whether there's been a racing modification. You'd at least be able to fail loudly to users.
The other reason people are excited about it, I'm probably slightly less excited about it for this reason, but is ways of making distributed locks, just using object storage as a communications protocol. For a system like ours, there's a lot of reasons why you might be interested in this. One is maybe thinking about ideas of how to do transactions and modify many keys in the database at once atomically. You can imagine you might need some synchronization to do it. Maybe ways of doing the garbage collection and compaction and cleaning up old objects, doing some re-indexing, compacting old data segments together, stuff like this. You can imagine it would be a lot easier if you know no one else is trying to do the same thing at the same time. Maybe doing leadership election and having some process for hot standby failover just with object storage, without any explicit coordination between the processes that are failing over. Until there were these compare and swap operations in storage, we were left with this.
The only way of doing locking without a compare and swap operation doesn't work. It's this notion of an approximate lock, where you put some UUID into some Sentinel object in storage. You go to sleep for some time until you think you're probably fine and there aren't any racing writes, and then you get the object that you've written out and you hope that it's still yours. If it is, then you just assume you have the lock, which is really error-prone.
People are really excited about ideas about making more reliable locks just using object storage. The idea here is that you might have a bunch of lock files each with some epoch identifier in them, all laid out on storage. You might have lock 0, lock 1, and some process taking the lock would list out all the lock files and conclude that the next file it wants to take is lock 2. It would be able to go and write to storage with us saying, write lock 2 only if it doesn't already exist, and if it does exist, then I assume someone else is taking the lock, and I'll try and get the next one or whatever. Each of these lock files would contain some information in them, tracking whether they're expired or not, which is essential with a distributed lock because you can't rely on processes to elegantly clean up the locks after they're done with them. Even if they're programmed correctly and they are cooperative, someone could rip the plug out of this whole system at any time. They need to have some notion of a time at which they expire.
My point with this is that, this idea of having a lease that expires stops these things from being a useful lock at all. We might say that my lock on this talk expires at 11:25, and you start some operation thinking that you have plenty of time, but baked in is this assumption that what I'm doing is only going to take as much time as I have left. It's very easy to think that in practice you'll be fine here, because maybe you're just doing a single put operation onto storage or something like that, and think, how long can it take? It should take the order of milliseconds. Really, we've observed in practice all kinds of problems. There have been weird, horrible network congestion problems where requests have taken on the order of minutes to get to storage and get a response back. There have been times where there have been bugs on the storage devices themselves and things get stuck. There have been times where you get stuck in some work queue that's implemented on the storage device and you don't get a response for minutes. These assumptions break down very quickly as soon as anything else in the system is going wrong, and you can't rely on it.
The only way really that people do this is to say, if I'm relying on one of these locks that has an expiry mechanism, I also need to have an extra layer of defense, and I would send to the server some token identifying my lock epoch or whatever, and the server would be keeping track of all the different lock epochs, and know what the newest one is, and be able to reject any requests from clients that don't have the latest one. Which makes me think, what's the point of this perfect lock that people are thinking about? Either you're in a situation where failures are ok, and the lock's just an efficiency improvement, or you need a perfect lock and you need this extra layer of defense on top.
My summary about this is like, I'm trying to make the point that object storage is nice to build on in the sense that it's improving, and vendors are giving you new tools to use all the time. These atomic operations are quite a big deal. A few years ago, Amazon also introduced stronger read-after-write consistency models and stuff like this. They're getting more advanced over time. You can build more advanced systems on them over time. I'm also trying to make a more general point about distributed locking, and the idea that once you have a lock with a lease inside of it, it ceases to be a useful lock anymore, or at least a bulletproof lock anymore. I don't know if people identify with what I'm saying, but I've heard a lot of times in my career that you can just take a Zookeeper lock or some distributed locking service, can solve all of your problems and guarantee exclusion between processes. My point is, actually, no, you need other lines of defense besides that, if you're relying on it for correctness rather than just efficiency, and not wasting work.
Conclusion
In conclusion, you can actually get quite far and get quite a useful system out of some pretty simple ideas. You have to be really careful because we've seen clocks and these timestamps across processes that don't really act like clocks. We've seen sets that are actually weird approximations to sets that don't quite act like sets. We've seen locks that don't actually give you mutual exclusion between each other. My more general point here is, I think, quite often we build distributed systems by analogy with a multi-threaded single process piece of software, and it just doesn't work. You need to be much more careful, much more rigorous about what you're doing, or you're making correct stuff. Here's a reading list. I'd recommend especially reading the top paper, which is a really good research review of the state of the art of CRDTs, at least the state of the art a few years ago. It's quite a nice world because there's a fixed number of them that are known, so you can know everything you need to know about them fairly easily. The system I've been speaking about, this database that just connects directly to storage, is available. It's its own tech business now, spun off by Man Group. Its code is available on GitHub, so you can go and have a look if you're interested in what we're doing.
Questions and Answers
Participant 1: You said that you had problems with compaction, but you never said whether you solved the problems or not. Do you compact today?
Alex Seaton: Yes. We compact today. I'm not quite sure if it's all perfect at the moment. It's quite naughty and difficult to describe because you have to do lots of strange things where you do compaction of these small objects that you're writing out, but also leave the things that you've just compacted lying around for some length of time. There's all these heuristics that you have to bake in where you say, ok, it's a day later or a week later, so I'm pretty sure I'm safe to clean up the old objects now. All sorts of horrible problems where, like, what if you have two things doing the compaction at the same time, and you have two of these compacted objects that are there, and you need to fold them together? It gets really naughty.
Participant 2: You mentioned early on in the talk that there are certain things that you wish you'd incorporated early on in thinking. Are there other things that you're now finding that you'd wish you'd considered earlier on, or how is that shaping up?
Alex Seaton: The problem with a system like this is that it's really impossibly difficult to retrofit changes to the storage afterwards. If you're introducing a new concept that relies on changing the contents of your files that you're writing, you have this eternal problem where an older client that you shipped the code for years ago is still out there writing objects that don't respect this new thing you've written. One thing we've been thinking about recently is ideas of how you track how much space these different objects are writing, and do you have ways of forbidding people from using too much space within a given database? We had ideas there about maybe including some notion of size in these objects and percolating that up the tree and making writes explode if someone's used up their allowance of space.
That's another example of something that's really difficult to retrofit afterwards, because you'll always have clients that are writing structures without that extra piece of metadata. I think definitely the client version thing I spoke about is the main thing, where it'd be really cool to be able to say, this database can only be written to by clients from the last six years, and if an older client tries to, we'll blow up with an elegant message explaining what they need to do. That would help us a lot, but again, it's impossible to retrofit now. These kinds of problems are one of the main downsides of this library-based model of development.
Participant 2: That will work a lot with driving version control among your clients, because you haven't retrofitted that in, it's challenging.
Alex Seaton: Exactly. You can't force people to upgrade in practice. You can't be opinionated about what's up there.
Participant 3: You went down this route because you were trying to remove a Mongo database, a very specific component of your overall system. If you hadn't been constrained by keeping your existing system, has anything you've learned taught you what you'd like to do if you were doing it from scratch, or what would you do differently, or what would you like to change in your existing system to make it work better?
Alex Seaton: This basically was a ground-up rewrite of the existing system, so it didn't take much. The Python API was done very carefully to be like a swap-in with the old system. The only other thing we really took from it was this idea of versioning data, but that's really paid off and that's proven to be useful. Apart from that, it has almost nothing in common with the old system now, other than the API over the top of it.
Participant 4: Tell us a little bit more about Arctic for those who don't know Arctic, just at a very high level, and then just a little bit about adoption outside of Man and what you'd hope for the project.
Alex Seaton: The idea of Arctic is that it's an efficient way of storing very large time series index datasets. You can store whatever you like, but it's best for time series data. This can scale out to datasets that are terabytes or even petabytes of scale, billions of rows or hundreds of billions of rows, very wide datasets with hundreds of thousands of columns. Over the last few years, we've been spinning it out from something that's used in Man Group for its trading systems to a commercial venture, working with Bloomberg in particular. Bloomberg are using it for a bunch of their products now, especially Bloomberg Lab, which is like a JupyterLab trading environment. If you're a small hedge fund, you just connect to Bloomberg Lab, and they provide a bunch of tooling and data feeds and stuff, and you can run a hedge fund on top of infrastructure that's all provided by them. A bunch of different investment banks and stuff have taken on a bunch of smaller hedge funds. It's growing now and being used by a bunch of different people.
Participant 5: If this is mostly time series data, then what part of time series data do you need to delete or update? I suppose it's mostly adds, and then the conflict resolve should be easier.
Alex Seaton: Some of the time, it's mostly adds. We have quite a lot of infrastructure that's designed to listen to Kafka queues or some messaging queue of streaming data, buffer it up into reasonably sized objects, and then flush it out to storage periodically, and that's this append-only streaming model that you're probably imagining. In that case, I agree with you. It's all adds. That is a big part of what we're doing. Like I said, it's very common for data vendors to correct data that they've written, so they can say, "Remember this data I sent you yesterday? That was actually wrong, so you need to update some part of the history".
Another thing is, lots of these systems are dealing with derived data and the consequences of the source data that you've written, so you could easily have maybe software bugs or inaccuracies in the systems, writing out data downstream that you'd want to overwrite. This system isn't just dealing with raw data coming off of a feed. It's also dealing with all the intermediate steps along the way and sharing intermediate calculations with other systems.
See more presentations with transcripts