BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Large Scale Event Tracking with RabbitMQ

Large Scale Event Tracking with RabbitMQ

Goodgame Studios is a German company which develops and publishes free-to-play web and mobile games. Founded in 2009, their portfolio now comprises nine games with over 200 million registered users.

To continuously improve the user experience of the games, it is crucial to analyze the impact of new features, prime times and game tutorials. To make this possible, specific player actions and events are registered, stored at a central data storage and evaluated by data analysts. We refer to these as tracking events.

With growing user counts, the number of tracking events per day has grown to an impressive 130 million events per day, or up to 4000 events per second during peak times.

In this article we will describe the tracking architecture that has been developed at Goodgame Studios to deal with this challenge, and outline the technology stack used and problems encountered.

Challenges and Requirements

Tracking events are triggered from various sources, such as game clients (browser or mobile devices), game servers or landing pages. A common requirement for all event sources is that sending the events does not affect performance and is not perceptible by the user.

Whereas on the source side, we can have millions of clients dispersed over the whole world, on the target side, we have only one target, the data storage, where the events are made available for data analysts. As one can imagine, the target could easily become a bottleneck, and a situation could arise where the sources are unable to offload their events fast enough. At some point this could result in reduced performance and thus, reduced user satisfaction.

This means it is necessary to establish a buffer between source and target, in order to decouple event production from event consumption. We opted for a buffer in the form of message broker queues. Due to the geographically distributed sources, we deployed the queues likewise in a geographically distributed manner to keep network latency from the sources to a minimum.

In order to allow quick reaction to peaks we chose a cloud based solution, and deployed our queues to the AWS Cloud.

The advantage of this solution is that even during peak times, sources can offload their events very quickly, without needing to queue them locally. If the target (i.e. the database or data store), cannot keep up with handling the events during the peak time, they are temporarily queued in the message broker, and can then be processed gradually.

Now we have our events in geographically distributed message brokers, and need to get them into the local data storage to make them available for data analysts.

To transfer the messages from the AWS queues to the data storage, we need consumers.

Due to some internal restrictions as well as for performance reasons, we opted not to directly connect the consumers to the cloud hosted message brokers, but to introduce an additional layer of message brokers inside our local data center. We assumed this would allow better performance during peak times due to the two-stage traffic regulation.

Message Broker

As the sources are written in various programming languages (e.g., Java, PHP, Flash), we needed a message broker that could provide a communication protocol that all those languages could handle. Our choice finally fell on RabbitMQ, which is a message broker software providing an MQ Server implemented in Erlang, different client implementations (e.g., in Java or PHP), and uses AMQP (Advanced Message Queuing Protocol) as protocol (but HTTP is also possible).

Additionally, RabbitMQ offers the Shovel plugin, which facilitates the task of transferring events from one broker (AWS Cloud) to another (local).

RabbitMQ allows different types of message routing, such as work queues, publish/subscribe, routing or topics. They are outlined in more detail including tutorials on the RabbitMQ Get Started tutorial page.

At Goodgame, we use topic exchanges, where messages are routed to predefined queues based on their routing key.

In this scenario, the producer sends a message with a routing key to a defined exchange in the broker. Based on the message’s properties, the exchange decides which queue(s) the message should be routed to. This has the advantage that the producer does not need to know which queue/consumer the message is sent to. Furthermore, it allows us to handle different event types flexibly. For example, each event type has its own routing key. So if we would like to temporarily exclude a specific event type from processing, we simply configure the exchange accordingly, and route the specific event to a trash-queue.

The following code snippet shows an example in Java of how simply we can send a message to a RabbitMQ Server using the AMQP basic publish. We assume in this example that the server is running on localhost, with a predefined exchange named tracking and no further setup of authentication.

//...
public class EventProducer {

   private static final String EXCHANGE = "tracking";

   public static void main (String args[]) throws IOException {

      ConnectionFactory connectionFactory = new ConnectionFactory();
      connectionFactory.setHost("localhost");

      Connection connection = connectionFactory.newConnection();
      Channel channel = connection.createChannel();

      String routingKey = "some.key";
      String message = "Hello World!";

      channel.basicPublish(EXCHANGE, routingKey, null, message.getBytes());
      connection.close();

  }
//...
}

Further examples, also in other programming languages, can be found on the RabbitMQ website.

To transfer messages from the brokers in the cloud to the local brokers, we use the Shovel plugin provided by RabbitMQ. This plugin allows the user to configure shovels, which consume messages from one RabbitMQ queue and publish them to another RabbitMQ broker (exchange or queue). They can either run on a separate RabbitMQ server or on the destination or source broker. In our case, they run on the source (cloud) broker.

Putting it all together, the figure below outlines the high level tracking architecture as described above.

Issues encountered

One of the main problems we encountered in production environment was an issue with blocked connections. RabbitMQ comes with a flow control mechanism, which blocks connections either when they publish too fast (i.e. faster than routing can happen), or when memory usage exceeds a configured threshold (memory watermark).

When connections are blocked, clients cannot publish any more messages. During peak times, we would have connections that are under flow control for up to two hours.

Another issue we encountered was that from time to time one of the RabbitMQ servers would crash inexplicably, without notice and without further information in the log files.

At first it was not possible to identify any kind of pattern causing these crashes, except that at the moment of the crash, a considerable number of messages were in the queue, in the state “unacknowledged”. However, the number would vary from several thousand to several million. One day everything would run smoothly with several million messages unacknowledged in the queue, another day, the broker would crash with only a few hundred thousand.

After some rather tedious analysis, we finally found out that the cause for the crashes was actually the Shovel plugin. In our configuration, we did not set the prefetch_count1.

With the version of the plugin we were using, not setting this value implies that an unlimited number of messages are prefetched (i.e. all that are in the queue). This meant that the shovel had an unlimited number of messages in its memory, but could not publish them, as the destination broker was blocking connections due to flow control.

If at the same time the connection between source and destination broker got lost (e.g. due to network instability), the shovel crashed and took the whole RabbitMQ server down with it as well.

Once we set the prefetch_count to a value of 1000 the brokers ran with much more stability.

Setting this count also did some good to our blocked connections problem. At a given memory threshold, RabbitMQ starts paging messages from memory to disk. However, this seems not to happen with unacknowledged messages. Having such huge numbers of unacknowledged messages thus filled our memory, which then triggered flow control and resulted in blocking connections.

Finally we also had to improve the performance of our final consumer, the data storage, to avoid too many messages queuing in the broker, as we realized that the fewer messages we had, the better the RabbitMQ queues would do their job.

Outlook and Conclusion

In this article, we have presented Goodgame Studios’ architecture for event tracking. The benefits of gathering these events are diverse. On the one hand, they provide game designers and game balancers with a valuable tool for their work. The event data helps them answer questions such as whether players regularly quit the game at a specific quest, or how a new feature that has been implemented is performing. The insights gained are used to improve the gameplay and user experience.

On the other hand they are a powerful tool for marketing specialists. Specific events make it possible to identify which marketing channel a new player is gained through, and thus allow a constructive adaptation of marketing strategies and channels.

Finally, they can be used by the developer, for example to measure and improve performance of loading times or to identify and adapt to the mobile devices used.

The architecture described provides the following advantages:

  • Two stage traffic regulation: if an outage (planned or unplanned) occurs at the local data center, the brokers in the cloud can still buffer events.
  • Brokers in the cloud can be scaled much more easily than in a local data center. This allows fast reactions on short-term traffic augmentation.

However there are also some disadvantages:

  • Near-real-time data analysis difficult to achieve: due to the many steps in the process, it takes some time before events actually arrive at the final data storage.
  • Error analysis difficult due to the numerous intermediate steps.

Out of the lessons learned, we will further adapt our architecture to satisfy upcoming needs.

A first step will be to replace the local brokers and shovel plugin with custom-built Java consumers, consuming the events directly from the AWS cloud and storing them directly to HDFS.

These will be implemented in order to provide easy scalability, high availability and performance.

We opted for custom consumers as we need to apply some transformation and validation to incoming events.

However, in the future one could also consider other solutions, for example the Apache Flume framework, which integrates smoothly with RabbitMQ and the Hadoop framework.


1 This value indicates how many messages are consumed by a consumer (shovel in our case) without sending acknowledgements for successful treatment.

About the Author

Dr. Claire Fautsch is Senior Server Developer at Goodgame Studios, where she works in the Java core team and is also involved in the data warehouse project. Previously, she was employed as an IT Consultant in Zurich and Hamburg and as a Research and Teaching Assistant at the University of Neuchâtel (Switzerland). Here, Dr. Fautsch also obtained her PhD in Computer Science on the topic of information retrieval as well as her bachelor’s and master’s degree in mathematics. She enjoys exploring new technologies and taking on new challenges.

Rate this Article

Adoption
Style

BT