New Early adopter or innovator? InfoQ has been working on some new features for you. Learn more

Hilary Mason on and Trending Clickstreams
Recorded at:

| Interview with Hilary Mason by Ryan Slobojan on Jan 20, 2011 |

Bio Hilary Mason is the lead scientist at, where she is finding sense in vast data sets. She is a former computer science professor with a background in machine learning and data mining, has published numerous academic papers, and regularly releases code on her personal site, She has discovered two new species, loves to bake cookies, and asks way too many questions.

Sponsored Content

Strange Loop is a developer-run software conference. Innovation, creativity, and the future happen in the magical nexus "between" established areas. Strange Loop eagerly promotes a mix of languages and technologies in this nexus, bringing together the worlds of bleeding edge technology, enterprise systems, and academic research. Of particular interest are new directions in data storage, alternative languages, concurrent and distributed systems, front-end web, semantic web, and mobile apps.


1. is a service which converts long URLs to short URLs and the naive implementation of that seems it could just be hash table. Is more than just a hash table in Apache?

Yes, is more than hash table, and in fact, at this time we don’t even use hash tables anymore because we found that didn’t scale. provides not just the long URL to short URL conversion but we also cache real-time metrics at the second resolution for every click on every link for all time and we make those available through pretty robust API and direct info page. At this time we’re doing tens of millions of new unique URLs per day, hundreds of millions of clicks on these URLs, so as you can imagine, there is quite a bit more to it than just one hash table at that kind of scale.


2. Can you describe the architecture in a little bit more detail? What are some of the major components?

At we take a very careful and thoughtful approach to keeping as much of our system de-coupled as possible, so at it’s very core we have the URL table with all of the URL data and that still lives in MySQL database and it’s quite large (billions) at this point - that is one piece. We queue as much as possible, so around that system, we have other systems: we have a user history system, which is running on MongoDB, we have metric system which is running on our own real-time in-memory database and MongoDB. We also do a lot of spam and malware detection and that’s a separate system that does intelligent classification of content as it comes through and can leave all those things. Spam and our look-ups are milliseconds on every click to make sure that we don’t serve any bad content. We have our very own info API that can preserve metadata and then even more systems around that data as well. We do a lot of queuing.


3. I guess that queuing allows you to absorb spikes more readily?

Absolutely. It allows us to absorb spikes in traffic and it also allows us to deal with pieces of our system failing.


4. One of the recent issues, which affected another major user of MongoDB, FourSquare involved a sharding issue, which ended up taking them down for several hours. Is a similar type of issue a concern for or is the architecture structured in such way that that’s not a possibility?

It is not a possibility that if Mongo fails that will fail. Only two of our sub-systems run on Mongo and that said, those two are fairly robust to date. If either one of them fails, we have systems in place to catch that data and cache it and make sure we can recover elegantly. It might mean that you couldn’t search your encode history for a little while, but all of our links will always keep on working. We have multiple redundant backup systems in place to ensure that that’s the case.


5. What are some of the biggest challenges that faces from an architecture and software standpoint?

The biggest challenges really are at managing the volume of data, managing the flow of data and managing to do it all in "real time" and real time deserves quotes here because the definition I like to use of it is "relevance constrained by the temporal context". That means that a link you shared a day ago might be relevant in a real-time context, but probably one year ago it was not, so the relevance decays every time. Those are our biggest architectural and engineering challenges.


6. What kinds of data volumes are you seeing? You mentioned - was it 100 million URLs a day? What are the kinds of data are there?

The data we work with is records of URLs being shortened and shared and clicks on those URLs and these clicks come with their referral data and IP and all that. We actually call every page that gets shared to We have the actual page content, we cache that, we do analysis of that. The data we are working with - there are a few different types. There are all of these very clean structured "this link got clicked on this many times" and there is the messy web content itself.


7. What kinds of data analyses are performed at

Currently, the first thing we do is spam and malware analysis to make sure that we’re protecting our users from malicious content and we do that very aggressively. Secondarily, as see on the site, we always extract the page title and we extract metadata. We have this site called, where we use the Facebook share spec to extract all the video metadata and show trending videos. We’re also looking at the content of the links themselves to say what topics are these links about, what are the entities, what are the things people are sharing pages about and then "Can we plot those things over time?" I can tell you things like 0.3% of all links are about the weather and 0.02 % of them mention cheeseburgers.


8. You mentioned during your presentation earlier today that you could also do analysis for instance determining the likelihood that certain events have happened based on the click patterns that you are seeing. For instance with the World Cup Game where you could predict who had won based on the click patterns from each of the nations that were playing. How do you approach those sources of analysis?

There is a lot of insight to be gained from looking at really simple signals at high volume. The signal we have at is the content and the people clicking on it sharing that content. The World Cup example is pretty cool, it turns out that if there is a match played in a country, we see the click volume to be pretty much equal, normalized in those countries, but after the match is over, the winners continue to increase the volume of clicking for hours while the losers stop. Then it plummets off and they go back and do other things, hopefully pat their puppies and love their children. We can tell that that’s happening because we’re building this human story around this effect in the data.

Another similar project I did was looking at the kinds of content shared about volcanoes during that big volcano eruption back in April [2010]. We saw this huge spike in content about that particular volcano and that eruption and then we saw a decline as people got bored and then we saw another spike where celebrities travel plans were ruined. Then that human story started declining as well and then at the very end of that we saw a little uptake, but it wasn’t about the same particular volcano. It was actually around things like "10 steps to protect your family from volcano eruptions" and "volcano sightseeing packages". So this one world event changed the nature of the human conversation on the Internet around this whole idea of volcanoes.

I find that fascinating! As to how we actually go about addressing it, it’s a pretty creative and hack-ish process in that one of the research questions I’m interested in, which is not currently part of any of our products is whether we can use the data stream to detect unusual events. So when things happen like when Michael Jackson died, we saw a links about Michael Jackson spread across every network and by "network" I mean both social click and I mean network as in Twitter, the Facebook to email to IM. It happened so rapidly that just by looking at those graphs you could tell that something major had happened in the world. The same thing happened again with the earthquakes in Haiti.

I’m really interested in trying to model when that happens so we can understand what’s going on in the world based on what people are clicking on. It is a very simple signal.


9. It seems that with the growth of the internet and with how pervasive the internet has become, combined with the ability to do large amounts of calculations rapidly on large datasets, it’s almost like we could do something like real time anthropology where you can see what the society is doing and understand that society while its still in the process of happening. Do you see the field moving in this direction?

I feel like this is one of the new superpowers that new technologies have enabled. It’s not so much that we could do more than we used to be able to do before, but it’s that we can do more much faster. Yes, I do absolutely think that we’re gaining an insight into human behavior that we’ve never had before. That it’s going to allow us to learn things about the human society and human communication that we’ve never known and I find that really exciting.


10. What kinds of discoveries do you believe that will lead to?

I hope that will lead to a better cross-cultural understanding, though that’s a huge goal that I hope we all share. More specifically and practically I think it can lead us to products that can absolutely help predict when these kinds of events are happening and help allocate resources for resolving. Things like the project Google did to predict flu outbreaks based on search queries around flu symptoms. They looked at the query volume in different locations for things like "runny nose" or "I’m feeling sick" and then they rolled that up to actually show that it predicted flu clusters in the United States. If we have tools like that, we are then able to allocate our anti-flu drugs much more effectively where they’re needed. If we can do the same thing for other kinds of disasters or even for other more positive events, I think it will be a great tool and very useful.


11. How has the architecture evolved over time and what were some of the big bottlenecks that you encountered?

The architecture even in the time I’ve been studying it has changed quite a bit and grown and become more robust. Obviously, at one time there was only one URL table and it held a lot of data and that didn’t scale particularly well. So that’s still in MySQL, it’s been sharded and split in a much more intelligent way and now it’s working much better. In fact, this was my very first day at, we were experiencing very high load on our database because of collisions and the hash function. actually used to use a Jenkins hash function, which you can look up on Wikipedia. It’s a very simple hash function, but there are often collisions. It’s not particularly secure in that way and necessarily highly distributed.

So, when there was a collision, it was just adding one to value and see if that was free, which meant that actually when there was one collision it could set off this chain reaction in millions of queries against that database. We had to fix that my very first day there. It’s one of those things where when you are writing it in 10 lines of code it seems perfectly adequate, just to call whatever your favorite hash is and then use that as the output and increment if that doesn’t work, but when you are doing it at volume with a lot of stored data, it doesn’t work so well. We’ve since gone to a much more intelligent hash allocation scheme.


12. What types of load and scale testing have are done at and how do those help you to have confidence in the code before you push it out and before users discover problems?

The technique that we primarily use with new systems or revisions to systems is that we will run them in parallel with the old systems, running our production volume and we’ve been making sure that they experience a few bursts in order to make sure that they can stand up to that volume before we will row off the old systems and rely on these new systems. That technique works very well, but of course, it requires you to have lots of extra servers sitting around.


13. With the hardware that’s present at is all of that requisition hardware that’s purpose-built for datacenters or is there some utilization of Cloud services? How do you approach that need?

We have a managed co-located datacenter and use that for all of our core systems and then we use machines that are pretty much in EC2 as necessary for non-core systems. So a lot of the data analysis infrastructure runs in EC2, which allows us to just spin it up and down as necessary and a lot of our non-core but nice to have machines and data storage is up in EC2 as well.


14. What would you say are three of the biggest lessons that you’ve learned since you started at

The biggest lesson I’ve learned by far is that you really have to have a great team that’s very collaborative and made up of people who are smart and know what they are doing and we definitely do. The team at is fantastic. More technically, I’ve learned that a lot of people are starting to address the same kinds of problems and yet we often don’t really collaborate on solutions. I hope that conferences like this one can help with that, but I do a lot of machine learning on web-scale data, if you have to put a title on it, and so are plenty of other people and yet there are very few resources out there to help people who want to start and keep them from making the same mistakes. We all made and exposed some of those dirty secrets into the world.

Finally, I think I’ve learned that working at this scale becomes a very challenging engineering problem. More literally, we’re at the point where we’re moving our logs around. That is an operation that requires thought. To have to think about these things in advance, really requires you to understand the entire system before you make decisions. And so it changes the process with which you build new things.


15. With the large volume of data that you have at, what are some of the numbers around that? How are you storing several petabytes or exabyte? How do you approach that storage?

We’re not storing that much yet. The amount of data we keep on hand to work with is in the terabytes. Fortunately, the data that we keep forever is very small and so it’s not too much of a challenge to work with, but we do a combination of in-memory and on-disk storage. We use a lot of EBS volumes and, to be entirely honest, it’s still a little bit hacky and I really do hope we can improve.


16. Thank you very much.

Thank you.

Login to InfoQ to interact with what matters most to you.

Recover your password...


Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.


More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.


Stay up-to-date

Set up your notifications and dont miss out on content that matters to you