From Monolith to Multilith at ticketea
ticketea is an online ticketing platform present in Spain, Germany and the UK, among others. We like to think of ourselves as technological partners of event organizers, helping them and their attendees through the whole lifecycle of an event. For understanding some of the architectural reasons behind ticketea, it's important to note that the ticketing business has sudden spikes in traffic. When a hot event goes on sale fans are willing to crush your servers.
ticketea’s product team currently counts with 16 members in charge of its development and maintenance. Out of those 16 members, there are 3 designers, one QA and the rest are developers. Developers are usually full-stack or quite multifaceted, although there are some specialists in the team. We don't have any sysadmins and this is mainly due to two things, we rely on some AWS services for hosting our projects and we all do our own devops. It is nobody's responsibility to deploy or provision machines, to work in the orchestration system or to improve internal development tools, because it's everyone's responsibility.
Initial satellites, the big monolith
ticketea is now in its sixth year of life and, like many startups, started with a simpler, more modest platform. However, over the years and its constant growth some parts had to be redone to cope with the increasing needs of a growing demand. Other parts have been refactored to improve robustness and quality. This has allowed us to reach high availability which makes organizers rely on us for selling large events.
Three years ago ticketea was basically a monolith, an all-in-one solution that was designed this way due to some constraints and advantages at that time. Basic constraints were size of the team and money, and some of the advantages were reduced time to market for new features, deployments were easy, the infrastructure necessary to run all this was small and cheap and most members of the team at the time had a full picture of the platform.
We basically had an API and a frontend web application, which is better than having all in one single web application. Having a separate API was already a big head start. In the beginning of 2013, we had to create a business intelligence solution that fit our needs and thus we created Odin, which was more like a Satellite to this monolith. It wasn't really using the API to an important extent. We later realized that satellites are usually the first sign of needing to move towards a service oriented architecture (SOA).
After Odin, we developed Heracles, a background task execution system that relied on a RabbitMQ cluster and was using Python Celery at that time. This way we deprecated our previous custom-made Ruby task system, which wasn't up to the number of tasks and granularity needs of the new upcoming workload.
Change of architecture, going distributed
However, these previous projects, aside of setting the precedent of using other programming languages and starting distributing some parts, were only scratching the surface. One of the main challenges that the team faced was to start breaking apart this monolith, which wasn't at all easy. We now internally call this "from the monolith to the multilith". A multilith - we are not sure if we made this term up or heard it somewhere else - would be like when you start breaking your huge stone into smaller stones, however usually the biggest one at the beginning still is the biggest one at the end.
We like to emphasize that what we had at our beginning wasn't bad or wrong - we've seen developers in conferences state that everything was crap and they brilliantly had to start pretty much from scratch. Even if this was true, you are actually starting over with a lot more knowledge of your problem domain which will usually have a positive impact in your new design. Obviously, years of startup development involve legacy code, technical debt and other issues, but the engine had to keep running while we were tearing apart this monolith. Therefore rewriting the whole system wasn't an option for us. We had to tackle problems one at a time.
There were some profuse discussions about where to start and how to do this and finally we decided to create a new project for our venue access control system. Ticketea sells tickets that contain a QR code and also we provide a venue access control app named Checkpoint (for iOS and Android) that is able to download your validation session, scan these codes and help you dispatch your event lines. Checkpoint was running against our API. This API at the time was one api to rule them all, one single repository, one PHP software project and it was big, really big.
However, the inventory (events, sessions, tickets, etc.) and the access control system had stuff in common which needed to be sync’d, but how? One of the major components we relied on for building Thor was our queue system. We decided that the venue access control API would receive notifications of events happening in the inventory API through RabbitMQ. Thor therefore had workers consuming some RabbitMQ exchanges and synchronizing.We created a new repository project named Thor (yes, we got this thing for Norse and Greek Gods). This was our first Python API in ticketea, built entirely in Django and Rest Framework. You've probably read a lot about rewriting software and what it implies, but for us rewriting this component turned out very well, it was a major success. To be honest, we didn't simply do a full rewrite. The data model didn’t change much and the API endpoints were the same, but the internals were overhauled to better handle concurrency and large events such as festivals in Spain, with hundreds of thousands of attendees. This split was quite easy, because validation and inventory were quite separate, and they were good candidates for semantically different APIs.
High Availability concerns
High availability is lately a hot buzzword, many developers discuss it often as a sign of how their team is handling growth. However, being highly available in a distributed system implies a mind shift and comes at a cost. For example, the architecture implementation discussed before posed several questions:
- What if the API thread that queues a message dies right after queueing it but before committing its transaction? Well distributed ACID is not easy, so you end up in a new world named BASE (Basically Available, Soft state, Eventual consistency), which is uglier than ACID, but necessary. To be eventually consistent, you need to have a way to re-sync systems and/or find and clean inconsistencies.
- What if RabbitMQ is down? Well, one of the often quoted advantages of distributing systems is that you can prevent a single point of failure. However, distributed systems are complex and when they fail they sometimes do so in domino effect or other times multiple parts fail together. So we decided that Thor should have a synchronous REST API too. In case our inventory API couldn't queue, it would simply call using http as a fallback.
- What if everything fails? No RabbitMQ and no HTTP? That probably makes for a day full of fun. You need to be ready for the worst. That's why we created a migrator. A piece of software able to scan part of the inventory database and check against Thor's database finding and fixing inconsistencies. That's why we always say that even if you have an A team you need a plan B.
Not only do developers need to be aware of how to handle these situations in a high availability system, but also your devops toolchain becomes more complicated and your running costs raise.
Technical implications of going distributed
It’s a good idea to have monitoring in place before going distributed. In fact, we would say it’s a prerequisite. If you are not measuring your system, you will hardly be prepared to handle a distributed system.
Be aware that you will be handling more projects, which usually implies more machines than before, more technology stacks, more inter-dependencies and other problems. Without monitoring you will be in the blind spot to know how your system is behaving, as now you have more moving parts than ever before. When you have everything bundled together, it’s usually a matter of pinging the machine to check your system’s heartbeat. However, when you have a distributed system it’s not enough to know that your system is running, you should also be aware of network problems, status of services you rely upon, etc.
Even if this is the case, and you are aware of all the moving parts, you can’t easily state that the platform is fully working. Maybe your queues are working but are overflooded or maybe your workers cannot cope with the tasks they receive. To have everything under control you’ll have to take care of:
- Centralized logging: In a distributed system, requests tend to go through different services. In order to track issues among different systems or trace a user for metric purposes, we use a unique token that is stored in all the logs. All these logs are collected by rsyslog and processed by Logstash, allowing us to search through bazillions of logs.
- Error handling: At ticketea we use Sentry for code stacktraces logging, this allows us to find and fix bugs proactively.
- Graphing: We also log metrics in Graphite using Statsd, these metrics are displayed in friendly dashboards made with Grafana. For example, we can easily know what's the response time of our payment providers over time and check if there have been issues, sometimes leading us to switch providers.
- Alerting: We didn't want to be staring at these fancy dashboards all day. Since we had plenty of good metrics (number of purchases, background jobs queue sizes, cronjobs heart beats, etc) and knew what thresholds to use to find issues,we looked for an alerting system and found Cabot. Now, we could get woken up at night by an SMS when things were failing.
With all of the above in place you will be able to tell when things fail but not always why. Debugging and fixing issues is now harder than in a monolith. Some issues might not stem from a specific project but somewhere in “no man’s land”, for example the connectivity or maybe the issue is in the software you are communicating with. In those cases developers need to go out of their comfort zone and look into other projects.
A distributed system is also harder to replicate in a development environment. In ticketea we use Vagrant with Ansible to provision our development environment as close as possible to production systems.
For example, at ticketea whenever a user buys a ticket we log this event with detailed information. We have graphs that show the average time taken by a user to buy a ticket during the day, so we can tell if the system is slowing down and how long the payment system takes to answer back. We track if the ticket was free or paid, or if the ticket belongs to a numbered event or not. Our purchase system should be working as long as the platform is available, but it is now a complex piece of software that handles several use cases.
Whenever we do a deploy we track unhandled code exceptions in Sentry. The alerting system we have in place has some tolerance in its thresholds. This means if the purchase system is stuck due to erroneous logic and no exception is caught, the alerting system would take some time to notify us that something is wrong. For that reason we carefully monitor our metrics dashboards after a deploy, so we can tell as soon as possible if we’ve broken something or if there’s been a regression and rolling back if necessary.
Team implications of going distributed
Going distributed makes it more difficult for the team to have the full picture of the system and, at the same time, makes it optional, so it becomes easier to hire new people and get them to be productive in less time by focusing only in a small part of the system.
However, we believe it’s really important to have a highly product focused team so we established some practices to keep everyone in the loop:
- 2 week SCRUM: This methodology helps us to keep the team focused on the product. We follow most of the practices like daily standups, retrospectives or product demos, where the full team can have a look at how the product is evolving in other areas.
- Discussions for important architectural changes: When we do an important change in our architecture, it’s no longer a one man decision, but a team decision. Everyone is going to have to live with it and should at least understand the reasoning behind it and be convinced of the way we are going to implement it. During these discussions everyone can point out flaws in the new concepts and propose alternatives. The team ends up with a mutual agreement based on this discussion.
- Roadmap meetings: We have at least have one meeting per quarter to inform everyone about what is going to be done that quarter from a business point of view. Although everyone knows the roadmap of the year, we’ve found this to be a good time to get everyone on track and focused. This way developers know why they are working on something and what it implies for the company.
We know that having all the team involved doesn’t scale when the team grows over a certain number, but it’s working pretty well with our current size. Adapting to your current needs and size is usually a good indication of a mature team.
While becoming distributed, our way of working also adapted. Instead of all developers touching the same code base, some people moved to specific projects. After a while, they became experts in those projects and we soon realized we needed developers to switch between projects, reducing the bus factor and sharing the knowledge.
Because projects use different technology stacks, this sometimes implies training the developer in a different language. We try to keep the number of technologies reasonable. Sometimes one technology might supersede an existing one and they can coexist for a while, but this rarely happens.
Most of ticketea’s developers are full stack. For example, several of them can do frontend or backend indistinctly, some also even mobile development. Obviously some developers are specialists in one area, and these people usually get asked by others in case of questions around that technology. Team members not only know their strengths but also the others’ strengths.
The development organization has changed too while the team continued to grow. At the beginning, there was only the CTO and the developers. After the number of developers kept growing the “design lead” and “lead developer” roles were born, to unload the CTO’s responsibilities while keeping the rest of the hierarchy quite flat.
Now that the team has surpassed 15 members, we are starting to organize ourselves in small non-isolated teams of 5-7 members. Every team has a leader close to the team members that helps them to be more productive, for example when they need help from other developers to debug issues or just to be sure they are doing things right.
To sum up
Distributing systems comes with several benefits. We can now use different programming languages for different projects, with different requirements. Also, it’s easy for members of a project to understand it although having a full picture of the whole architecture in mind is harder. That's why interfaces really matter. APIs and their versioning are crucial as they establish protocols and communication points, allowing to scale not only your infrastructure but your team - with less code conflicts, smaller releases and multi-pace release cycles. If you do it carefully you can fail gracefully. For example, we can now sell tickets even in the unlikely case that our access control system is down.
Despite all these benefits, they come at a cost. Distributed architectures are harder to maintain, they are more difficult to deploy, orchestrate, etc, so you have to be conscious about the benefits and the costs of implementing this kind of architecture.
About the Authors
Miguel Araujo works as lead developer at ticketea. After studying computer science in Madrid, he worked as a freelance in the startup world, turning into a full stack developer and contributing actively to open source software. He joined ticketea three years ago to modernise ticketea’s technology stack and help it scale and open new markets in Europe. He loves learning, working on challenges and tinkering with electronics.
Jose Ignacio Galarza currently works as CTO at ticketea. After finishing his computer science degree in Madrid, he stayed researching at the University until he started to work as a developer at different product-focused startups. He joined ticketea three years ago to help improve the product and transition ticketea’s tech stack to a more scalable and distributed system. He also loves burgers.