Kevin Montrose on the History and Mistakes of the StackExchange API
Creating a public API for an existing website is always a risky venture, and StackExchange’s open editing policy makes it even riskier than most. In a recent series of articles, Kevin Montrose talks about what decisions went into the StackExchange API and what lessons they learned along the way.
When starting the API design they started with goals and constraints. For example, one of their principal goals was to “eliminate the need to scrape our sites.” The content licensing agreement that StackExchange members agree to, cc-wiki, allows any third party to reuse the content on their sites but before the API there wasn’t a good way to access it in bulk. One of their biggest constraints is the API would be read-only. Kevin explains,
Philosophically, write is incredibly dangerous. Not just in the buggy-authentication, logged in as Jeff Atwood, mass content deleting sense; though that will keep me up at night. More significantly (and insidiously) in the lowered friction, less guidance, more likely to post garbage sense.
We do an awful lot to keep the quality of content on the Stack Exchange network very high (to the point where we shut down whole sites that don’t meet our standards). A poorly thought out write API is a great way to screw it all up, so we pushed it out of the 1.0 time-frame. It looks like we’ll be revisiting it in 3.0, for the record.
Some if the key design points covered by Kevin are:
- Vectorized Requests: “almost everywhere we accept an id we’ll accept up to 100 of them”
- Compressed Responses: A responses are compressed using GZIP, even if the client doesn’t ask for it
- Sorting and Filtering: “Most endpoints accept sort, min, max, fromdate, and todate parameters to craft these queries with.”
But what is more interesting is the mistakes they made. For example, their decision to return totals by default.
Total is useful for rendering paging controls, and count(*) queries (how many of my comments have been up-voted, and so on); so it’s not that the total field itself was a mistake. But returning it by default definitely was.
The trick is that while total can be useful, it’s not always useful. Quite frequently queries take the form of “give me the most recent N questions/answers/users who X”, or “give me the top N questions/answers owned by U ordered by S”. Neither of these common queries care about total, but they’re paying the cost of fetching it each time.
Another problem is the use of implicit types. Instead of explicitly saying what data type is being returned the client developer has to infer it from the fields that are present. This is annoying in any language, but especially problematic for statically typed languages where one wants to map those unnamed types to actual classes.
We will skip over a few of the other mistakes and conclude with the last one discussed by Kevin, wasteful request quotas. In order to prevent excessive API usage they used the same quota system exposed by the Twitter API.
This turns out to be pretty wasteful in terms of bandwidth as, unlike Twitter, our quotas are quite generous (10,000 requests a day) and not dynamic. As with the total field, many applications don’t really care about the quota (until they exceed it, which is rare) but they pay to fetch it on every request.