BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Interviews Jon Moore on Hypermedia APIs and Distributed Monotonic Clocks

Jon Moore on Hypermedia APIs and Distributed Monotonic Clocks

Bookmarks
   

1. I’m Charles Humble. I’m here at QCon SF 2015 with Jon Moore from Comcast Cable. Jon, could you introduce yourself?

I’m Jon Moore. I’m a senior fellow at Comcast Cable, and I do a wide variety of things there ranging from managing engineering teams, to software architecture, and direct technical research. In addition, I sit on the committees that look at our contributions to the open source community to figure out how to facilitate people contributing back to the community, as well as a committee that assists Comcast technical folks to come and give talks to conferences.

   

2. I know you’re involved in public APIs and that sort of stuff as well?

Yes, I definitely have done a lot of work in the API space in the past. I spent a lot of time looking into hypermedia APIs in particular. And one of the projects that one of my engineering teams maintains right now is an API management platform that we use for doing access control and capacity management for internal APIs.

   

3. What is a hypermedia API?

A hypermedia API is an API typically delivered over HTTP, what we commonly refer to as RESTful APIs, but one where control information is actually embedded in the responses from the server. So, the easiest way to think of this is your API responses have links and forms in them that indicate to a client what the next set of possible actions are, whether that’s subsequent API call you might want to make, or links to additional data you might want to download.

   

4. Why is the hypermedia constraint important?

Well, one of the reasons the hypermedia constraint is important is that the "REST police" will come and tell you that you don’t have a RESTful API if you’re not using the hypermedia constraint. But in all seriousness, one of the big things that the hypermedia constraint gives you is it introduces a certain degree of decoupling between servers and clients. So, in particular, it means that you have to hard code less a priori information into clients, which means that conversely on the service side you have a little bit more freedom to make changes that you know won’t break clients.

   

5. How does that prevent the client from breaking if the semantic definitions change?

So, one of the interesting things about hypermedia APIs is that of course there’s no magic bullet. If you actually change the semantics of what you’re doing, there is nothing in the world that’s really going to prevent your clients from breaking. If you change really what an API does, if I change it from let’s say changing the channel on my TV to turning my lights on and off in a home automation project, it’s going to be really difficult for a client to realize that you’ve done that.

On the other hand, what it does is there are certain mechanics of APIs, right? So, lots of APIs come with an implicit recipe for how to make calls. So, in the case of web APIs that are delivered over HTTP, you typically need to know a URL that you need to access, what HTTP method you want to use and then, in the case where as a client, you need to send data what the relevant pieces of data to send are. And with a hypermedia format, what you get is that some of that information is delivered at run time.

And so rather than having it hard coded into a client, some of these basic recipes are instead delivered at run time which means that again, as the servers may need to change (we all like to do continuous delivery and constantly evolve our services), then it means that clients can adapt to that without having to ship a release to the client, for example.

   

6. What does HTTP 2.0 mean for the world of hypermedia API?

I’m actually really excited about HTTP 2.0 in particular for the ability to do server push for cached resources. So, prior to that, in an HTTP 1.1 world, it was very common if I went and fetched a resource than maybe it had a list of things in it. It was common that that response would actually not just have, say, the identifiers of all the objects, but would actually also embed the data for those objects in it. And this is because you wanted to just make a single HTTP request in order to build a responsive user experience.

You didn’t want to go fetch a list of IDs and then individually go and fetch each of the individual objects. However, the downside of this is that if the data for just one of those objects changes, then you have to fetch the whole list again if you’ve embedded the data into it. And so, one of the nice things about HTTP 2.0 is that you get to have your cake and eat it too, so you really can just fetch the list of IDs and the server can push you the data for each of the objects and then those can be individually cached, which means that if there’s an update to one of the objects you can actually just download the new data for that single object. And so, it really brings a lot more cache flexibility for clients where you really can start to take advantage of really deep linking and really native linked structures that can still be efficiently accessed.

   

7. How are hypermedia APIs useful in the context of micro services?

I think hypermedia APIs are incredibly useful in a micro services world. Adrian Cockcroft, he often says that your monolithic services have a lot of complexity baked into them. You just can’t see them, because they’re encapsulated in a single process. The problem is that when you break these modules out into their own services, now you’ve introduced a boundary that’s much harder to change. The interface between those things has become harder to change.

And this is a double edged sword, because on the one hand it means that it’s harder to change the interface. So, the interfaces tend to become more stable. While on the other hand, if you didn’t happen to draw them in the right place, it gets difficult to change where that boundary is. Hypermedia APIs, by making the clients make fewer assumptions about where particular piece of functionality might be provided, or which service they might want to talk to in order to access that functionality gives you a little more ability to re-factor your set of services without breaking clients.

And now, with a world where mobile clients are really certainly a first class user experience, if not necessarily, potentially even the primary one. You can’t turn around and necessarily ship a new update through Apple’s approval process every day, for example. And so, you really have to think about how to modify things when you can’t actually do client updates easily, and hypermedia APIs are an important tool that help there.

   

8. Can you briefly explain your solution for time synchronization and causality?

Sure. I’ve been working on a scheme called Distributed Monotonic Clocks. And this is based on some research that some folks out in SUNY Buffalo published on a tech report last year in 2014. And, the key problem that I was interested in solving is that clocks can’t be perfectly synchronized in a data center. And we generate our applications, particularly in a micro services world where you have many interacting services just for a single client interaction. Then all of these things generate log messages, and we, as many other places do, have a log analysis cluster where we bring all those application logs in to review them.

However, if the server clocks are not well synchronized, you can run into cases where log messages for a given client interaction come not in the order that they actually happened. If you looked at their timestamps, the timestamps aren’t in the proper order. So, to some degree, the timestamps don’t reflect the causal history of what things happen before which other things in the overall system.

And so there are approaches like Lamport clocks from the distributed system research community that solve this. They have ways of tracking which things happen before which other things, but lots of these are logical timestamps. So they say things like, “Oh, well that happened at time 42.” Okay that’s great. Time 42 happened after time 41, but really, when was that? When you have to support things in production, 42 is not particularly meaningful to a human operator.

We have this problem where the wall clock time that we get from our system clocks, is useful to humans, but isn’t synchronized so it doesn’t reflect causality. And then the systems we had for causality aren't useful for humans. And so, the research has been based on hybrid timestamps that contain components from both of these things. So from a semantic point of view, they reflect causality just like Lamport clocks do, but they have a component that tracks system time that can still be sensible to a human.

So, it’s not really about better clock synchronization. It’s more about delivering logical timestamps that have a component that’s sensible to a human. So for example, I know there’s a problem around 3:00 A.M., I can go find the logs from that approximate timeframe and still have them come out in the right order. Now, whether it happened at exactly at 3:00 A.M., or two milliseconds later, that’s not the primary concern here.

So, my research in particular has been in trying to address some practical limitations with this scheme that’s been proposed. And so one of the things is the use of something called a population protocol for members of an overall cluster to figure out when their clocks have actually drifted away from everybody else’s. And so, that solves a particular problem where a runaway system clock could actually drag everybody’s logical time forward into the future, in the basic case that was presented in the SUNY Buffalo paper.

   

9. So, can you give some examples of population protocol implementations that are in use today?

That’s a good question. I’m not aware of protocols that take advantage of this. Population protocols, at least right now are more of a theoretical construct that I’ve run into. But they can be used to model things like flocks of birds or schools of fish. And the general principle of population protocols are that individual participants make decisions based on local information. And yet, as a whole, there’s an emergent behavior that happens across the whole thing.

If you’ve ever seen any of the videos of starlings flocking, for example, right? Individual starlings actually make their own "which direction should I fly in" decisions based on the birds that are right around them. But yet they, as a whole, the flocks form this right on the edge of chaos kinds of formation.

I’m not aware of particular production implementations that have taken advantage of this, but this is still very much at the research stage where I’ve been able to use the population protocol to get what you might call "flocking clocks". So, the idea is that, that all of the servers in the cluster can use this population protocol to figure out, “Hey, am I outside the great mass of everybody else’s system notion of what time it is?”

   

10. How do you manage synchronization across some network of devices that are completely out of whack, like a network of sensors for instance?

So that’s very interesting. I think some of these population protocols provide some interesting techniques there. So, in the case of sensor networks, for example, we wouldn’t expect individual sensors to run NTP on a regular basis to keep tight time synchronization. However, they could use these lighter weight population protocols to get a center of mass if you will for an overall estimate of what time it is. And then perhaps when you realized you’ve drifted, maybe then you might expend the effort to resynchronize with an NTP server, or something of that sort.

   

11. How does what you’re doing compare to Google’s Spanner TrueTime API where they use hardware assistance from GPS and atomic clocks to have synchronized masters?

Yes. so the Google’s spanner project, I think, solves a slightly different problem. There, what they are trying to do is to try to get really good clock synchronization and then to use that to narrow a window of uncertainties, so that they can still have consistent database updates without necessarily incurring a lot of back and forth across wide area network coordination.

In this case, the distributed monotonic clocks are not really about getting better synchronization, meaning getting a tighter flock of clocks together in some sense. But rather it’s really just a way of thinking about timestamps. So it’s more about logical timestamps that have a part that’s meaningful to humans rather than getting more and more accurate system clocks.

   

12. What’s next for your distributed monotonic clock research?

There are definitely some open problems that I think are tractable. And I’ve been able to release a simulator implementation of the population protocol that I’ll be talking about here at QCon under an open source license under Comcast's Github account. And I’ve actually had some indications from people who would like to do some joint research to continue this forward, so I expect to see further developments on this. And I’m happy to say that at Comcast it’s possible to do those sorts of things a little further out in the open than you might ordinarily expect.

   

13. Fantastic. How do you keep up to date with the latest in distributed systems?

This is very challenging, because not only do you have to keep up to date with the latest developments in distributed systems. But of course there’s this whole wealth of historical information and research that’s out there, so it’s a little bit of a daunting task. The way that I address it is really, I actually use Twitter a lot, so I primarily just follow really smart people who come and talk at conferences on distributed system topics, and as well as subscribing to their blogs. But I’m sure my news reader is just as full of unread articles as most people are.

But I find that that’s a good way to at least collect the set of things that I should be reading about. There are definitely folks who are avid readers in this space and particular very good writers. So, Peter Bailis for example is a researcher, who obviously actively publishes research and distributed systems. But also has his own blog where he provides really well written descriptions of his research that are easy to follow.

And then, there’s also a blog called the Morning Paper, which is fantastic. I really don’t know how Adrian manages to have the time to read all of those papers in much less write all of those summaries. But he often features papers there that have distributed system of concepts in them. And so those are the main ways that I try to keep abreast of what’s going on.

   

14. You mentioned the Lamport clocks paper earlier. Are there any other academic papers that have maybe particularly informed your thinking?

I have to say the Dynamo paper from Amazon is one of my favorites in the sense that I really feel like it presented a number of distributed systems concepts in a very clearly written way, and for a system that solved very practical problems. And so, I actually think for many folks, this is actually their gateway paper into modern distributed systems work, because there are so many really great concepts that are discussed there that are explained so well that it makes them accessible.

Charles: That’s great. Jon, thank you very much indeed.

Yes, thanks for having me.

Jan 19, 2016

BT