InfoQ Homepage Podcasts Handling High Demand Ticket On-Sales with Anderson Parra and Vitor Pellegrino

Handling High Demand Ticket On-Sales with Anderson Parra and Vitor Pellegrino

May 30, 2022

Live from the venue of the QCon London Conference we are talking with Vitor Pellegrino and Anderson Parra. They will talk about how SeatGeek is handing ticket on-sales where a large amount of users use their service in a short time, and which engineering challenges this brings.

Key Takeaways

SeatGeek is handling ticket on-sales where a large amount of users uses their service in a short time, which brings massive engineering challenges.
A next step to improve their architecture is understanding whether their services are in "on-sale" or normal operations mode.
The vision is to run the system completely automated. Reducing human interference would make the process smoother.
Storing data and encoding logic at the edge improves the runtime of the system. However, maintaining a global state with a lot of users is a challenge.
Virtual waiting rooms help you with traffic surges to your website, but they are no replacement for properly scaling your infrastructure.

Subscribe on:

Transcript

Introductions [00:05]

Roland Meertens: Welcome, everybody, to the InfoQ podcast. My name is Roland Meertens, and today I am interviewing Anderson Parra and Vitor Pellegrino. Anderson is a senior software engineer at SeatGeek, and Vitor is the director of engineering at SeatGeek. I am speaking to them in person at a venue of QCon London. And in this podcast, we are discussing how SeatGeek is handling high-demand ticket sales, and how they deal with spikes in their traffic.

The day before we recorded the podcast, they discussed this topic in their presentation at QCon. And in this podcast, we are going to dive a bit deeper into some of the future ideas they want to work on. At the moment of recording, you can still register for QCon Plus, if you want to see their talk live, and you can ask them any questions yourself.

Now on to the podcast. Welcome, Vitor and Ander to the InfoQ podcast. We are here at QCon London. We just had our lunch, and how is your day so far?

Vitor Pellegrino: Well, it's been great. So second day of the event, we had amazing talks. I mean, it's been very good to be back in person in a live event. So yes, lots of good energy so far.

Roland Meertens: Good to hear. You actually had your talk yesterday. Can you maybe tell a bit about who are you guys actually? Where are you working?

Vitor Pellegrino: My name is Vitor Pellegrino. I run the cloud platform teams at SeatGeek, so we are responsible for all the components of the platform that other SeatGeek engineers use. So think about like compute, storage, networking, but also capabilities for the whole company, such as SRE instant response.

Anderson Parra: I'm Anderson, Ander. I work as a senior software engineer at SeatGeek. I work on that app platform team and we're responsible to build the virtual waiting room. That was the topic of our presentation yesterday. Today is less stressful, enjoy the rest of the conference, then yes, it was good. We received good feedbacks there.

The problem of ticket on-sales [01:54]

Roland Meertens: So what are you guys currently working on? What was your talk about? What are the challenges you're having?

Vitor Pellegrino: At SeatGeek, we handle very large on-sales, so for the folks that do not know that space, so on-sales are typically large events, that they're also met with significant market push. They usually happen in specific times. I mean, you can imagine you want to attend a soccer match from Liverpool, and then you want to buy tickets for it. So maybe the club is going to announce, "Okay, at 2:00 PM, Wednesday, everybody's going to be able to finally buy their tickets." so there is a lot of demand, and that's part of an everyday thing for us, so we have to build systems that are capable of handling that. And our talk was defining that problem space, talking a little bit more about how we reason about that problem, when we are actually developing systems. So as I said, that's an everyday thing for us, so our systems must also handle that as if it were an everyday thing for them, as well.

And then we dive deeper in what Ander is talking about, the Vroom, that's how we call virtual waiting room, Vroom. We also like to give funny names for stuff, I guess. And I think that is a core component that allows us to handle those situations, those on-sale situations.

Roland Meertens: To summarize it, if I want to buy a ticket for event, I go to your webpage. Of course, 1000s or 10,000s of people do this all at the same time, so I assume that normal auto scaling techniques are not applying here anymore. So everyone enters the virtual waiting room. And I think what you especially mentioned really nicely in your talk was how do you kind of prioritize people? I think you were talking about fairness, right?

Vitor Pellegrino: Yes.

Roland Meertens: And maybe explain that for a bit.

Anderson Parra: As you mentioned, it was well-described it by Vitor yesterday. Sometimes the auto scale doesn't help, when we receive the high traffic in one second. Then you try to avoid errors, then the idea that you need a queue to control the traffic to the infrastructure. But we are selling tickets, and the idea that everybody should have the opportunity to purchase tickets as fair as possible.

The way that you guarantee that, you manage the state of the queue, then everybody, when getting settled in the queue, is associated with timestamp. And based off this timestamp, you can sort, and then you are draining the queue, in first-in first-out approach. Then the idea that who arrived earlier should have the opportunity to try to purchase earlier.

Roland Meertens: So it's really all about kind of replicating the experience of buying something in real life, and the person who comes there first gets the tickets?

Anderson Parra: Exactly. Yes. I mean, the queues are bad in the real world, in the virtual world as well. We know that. And we're trying to drain the queue as fast as possible. Try to make the on-sale going as fast as possible. That's a good on-sale, when people can purchase the tickets earlier, then there is no bad experience with some errors, and the idea that we're controlling the traffic to avoid errors for the user.

Roland Meertens: I think you also mentioned a bit like what's actually the limiting factor? Why can't everyone buy the ticket at the same moment?

Vitor Pellegrino: Yes. We're talking about something that actually exists in the real world, there are seats, and there is a lot of demand for the same seats, so there is a physical limitation there. So imagine that we have a specific seat, premium or not, doesn't really matter, but one specific seat that you have 10 people trying to buy at the same time.

How do you tie break? So how do you actually resolve? Was the person that first saw the seat has a priority? Was the person that actually submitted the first successful card payment for it? Or the first one to reserve? So these are the kind of things that we need to design for, to your point. I mean, that's why we cannot just allow everybody to buy at the same time.

Roland Meertens: So basically, everybody tries to go for the front row seats.

Vitor Pellegrino: Yes.

Roland Meertens: But of course, there's a limited amount of front row seats.

Vitor Pellegrino: Yes.

Anderson Parra: For the business model, when those service starts, the race condition starts as well. Then there are a lot of people trying to purchase, sometimes the same seat, then the way that we're operating, you can reserve the seat, and you have a time to finish the purchase. But a lot of bad things could happen that purchase phase, credit card could be denied, or we realize they're too expensive, that you give up. Then the seat becomes available again. Then another user has the opportunity to try to purchase this ticket. Then you try to control the traffic, as well, you can try to maximize the change for people who get a ticket. Then you need time to finish the purchase sometimes, then the queue helps in the business side as well.

The thundering herd problem [06:15]

Roland Meertens: And kind of how many people are we talking about? What's the scale of the system? I think you had some graphs where you showed the normal usage of your site versus those kind of thundering herd events.

Vitor Pellegrino: Usually our flat line is relatively stable, but it can go several orders of magnitude, two or three or sometimes four, depending on the event. I mean, you can expect that for a large stadium, let's say, imagine your largest stadium ,as for an American football club or something, like you were talking about, several tens of thousands of users, and then you were going to have perhaps hundreds of people interested to buy each seat. So that is that magnitude we were talking about.

Roland Meertens: So you really have massive peaks.

Vitor Pellegrino: And I think one important thing, and that was something that I tried to stress, and was one of the key points for us proposing this talk in the first place, is it isn't enough to just say, "Okay, I'm going to always sustain that kind of load at all times." You need to be able to also reduce and shrink your infrastructure when you do not have those events.We could be always prepared to handle that kind of load, but that wouldn't be economical. It wouldn't make sense for us as a business.

Anderson Parra: We try to predict, when there is an on-sale, try to predict the traffic for that on-sale. Then the good part should have the queue we're collecting metrics for the on-sales. And then we are using those metrics to predict the next seasons, to see, "Okay, I have the seasons of the on-sales. Then I collect metrics for that, and how is going to be the next one? How can I use the [inaudible 00:07:47] that I have seen trying to purchase the tickets, to predict the next event, in terms of traffic?"

Roland Meertens: And this is something which you do manually right now? Do you say, "Okay, these teams are really the top league teams, so they tend to sell out. And these teams are like the lower level teams, so they have a bit more time." Or is this something you're also trying to learn from the data?

Vitor Pellegrino: A lot of that is already automated. It's still a next step for us to increase the amount of automation that we have. We have close relationship with our customers. So one thing that I forgot to mention in the beginning, like SeatGeek, I would say that most of our listeners are going to be following the consumer category, which means somebody trying to attend an event, but we also design for the folks offering those events in the same place. So these are our customers as well. So the enterprise customers, as we call them. So we work in conjunction with them, we help them with whenever they're about to do one of these large on-sales, we're typically in close contact with them. So we do have systems that understand when an on-sale is about to happen, but we're increasing, even more, the amount of automation from that starting point.

Stateful versus stateless architectures [08:55]

Roland Meertens: So in terms of scaling, I think you guys were mentioning the stateful versus stateless architectures. Maybe you can talk a bit about that, what kind of decisions are you making? What kind of options do you have?

Anderson Parra: Well, that was the main topic that, when we started to build our virtual waiting room, is how we're going to control the traffic, in terms of to reduce the latency. So the best way, I mean, in the ideal world, you could run in the edge part. Then, for example, we are using Fastly as our CDN provider. Then we can try to create a mechanism to control the traffic on the CDN, but that environment is completely stateless, completely state ... Well, I'm going to talk about that. Then you have this idea that it's stateless, and then if it's stateless, you can have rate limit, but you cannot control the order. And as I mentioned, the order matters for it. You can like to create a fair approach, then we need to manage the state of the queue. Then you need the state, a stateful model, then we have traditional back ends, when you can start. In controlling the state of the queue in our database, we are using DynamoDB as our primary data store.

But also, we have a hybrid mode that we have part of our logical running in the CDN. And when I said that's completely stateless environment, then that's the chain that the CDNs are making right now. So there is some small data stores on the CDN, we're taking advantage of that. Fastly offers, as far as adjunctionary, is a simple key value store that we are using as our primary cache.

Then we have the problem to sync two data stores. We have that data store running the CDN, and also with our primary data store on the back end. Then we have all the mechanisms to keep those data stores synced. Then we can try to take that advantage, when it's possible, to run the logic on the CDN to remove the latency. And then, if you don't need to send a request to the back end, then we avoid that and the CDN takes that part.

Roland Meertens: So I think that's a good summary of your talk, or most of the things you taught in your talk. And so if people are listening, it will be online on InfoQ, so you can re-watch it. But I think, also, what I may want to do is go a bit deeper into some of the suggestions you had for the future. So what are the next steps with your system? Or what are the things you are thinking about, about scaling it even better, or even further?

Vitor Pellegrino: Yes, that's something that we're spending a lot of time. And as I mentioned in the call, as well, it's a topic we're actively working on. We don't have all the answers yet, but one thing that is important for us, is really having the systems understanding which mode they are operating. So we have a lot of metrics. We made a lot of investments in observability. So every system, they provide logging, extensive logging, tracing, metrics. I think a lot of our listeners, they probably are used to, but we aren't able, yet, to say, "Okay, I want to see how my system behaved outside of an on-sale, versus how it behaved during an on-sale."

We can infer that, we can see that the graphic is pretty obvious, but I would like to be able to say, "Okay, that was the latency of this endpoint, when we were under an on-sale. Oh, that's the amount of request that happens throughout my entire system for this particular on-sale," not only in the front end, but actually all the stuff that happened, to be able to categorize each one of the requests and say, "That's an on-sale request."

Roland Meertens: I can imagine that in this case, if you're P99, it's kind of irrelevant, because it's really about, 99% of the time, you're not having an on-sale.

Vitor Pellegrino: Exactly, yes. A problem that we have, I mean, going even further, we talk a lot about SLOs. It's a very common thing that I see. For us, SLOs are very difficult to be used, as we normally see in the industry, precisely for that. If I have an error budget of, for the sake of the argument, 100 errors, if I have 100 errors outside of an on-sale, that's not a big deal. But if I have two during an on-sale, might be disruptive enough. So how can I think about SLOs for a specific time of the day? So I'm not interested to know how many errors I had in the last 30 days. I'm far more interested to know how many errors I had in on-sales in the past, I don't know, 30 days, right?

Roland Meertens: I think, especially for users, it's always, when you want to buy those tickets, you want to buy it effortless.

Vitor Pellegrino: Yes.

Roland Meertens: And I think I have cases where you went from place 10,000 in the queue to place 300,000. You're like, "What's going on?"

Vitor Pellegrino: Yes.

Anderson Parra: The on-sale is the critical window for us. I mean, what Vitor said makes total sense, in terms of, if you have the critical window, when everybody was looking to you to try to do an action in the product, to purchase the tickets, that's the moment that you need to avoid error, that you need to care more about our system. And then you need to know that the on-sale is there for the size of the [inaudible 00:13:37], for what's the impact of the error? How many people are going to be affected?

And with all that information, you can try to react for that. And you have process, we are training people that they can try to render errors better, because it's really hard to say that it is completely error free. Then, it's common that you can have errors in the applications, but I think the most important thing is, if you have an error, how can you render that? And what's the lessons in learning that you can have from the error, to try to prevent, because you are always in that process, in the continuous improvement. Then you can try it, "Okay, I've seen an error. I prevent that. Then, what's the next one?"

Roland Meertens: So how does it work? Do you lock a lot of data, also, during the on-sale, when you're actually putting tickets on sale? Or I can also imagine that at some point, you want the most bare-boned structure as possible, to actually handle everything, right?

Vitor Pellegrino: Yes. I mean, we log ... And that's the whole point. Right now, we make no distinction whether we are in either mode. And that's something that we would like to change in the future. I mean, we're going to still log everything, but we would like to categorize. I mean, maybe a way to think is just, "Okay. Well, I want to place things into different logical buckets. And I want to be able to reason about either bucket differently."

So another thing that is important for us is understanding our non-functional trade-offs. So I think, if I'm browsing right now, if I want to see what's happening in London, for tonight, I actually would care much more about the site feeling very snappy, I'm getting access to what I need, latency is very important to me. But if I'm in the middle of an on-sale, and I'm already in that stressful situation, I care far more if I press the button, I actually get the ticket. I don't mind as much if I have to wait 200, 300 milliseconds, or even a second, for the sake of the argument. So that's the kind of stuff that we are building that knowledge inside of our application. So I'm spending a lot of time thinking about that. "Okay. How do we get teams to design their systems with that in mind?" So how can I perhaps pass that information using, I don't know, a notification system that each microservice is able to understand.

Roland Meertens: So your non-functionals are really, indeed, changing them over three of these different bases.

Anderson Parra: I think the key point is automation. That's something that we're trying to make our systems a little bit more sophisticated, that they can understand what mode the system is running. If it's on sale, how can you just trigger the alerts different, if you have an error, how can recover as best as possible on that moment? Then, when we're not in on-sale, then it's a less stressful situation, then you can say, "Okay, I have time to see what's going on, and to provide a fix for that." You know?

Automating the ticket sales process [16:17]

Roland Meertens: You were mentioning kind of trying to run the system, at some point, by robots, or running everything automatically. How is it currently? Is there a lot of manual work involved into putting each ticket online?

Anderson Parra: When we started the virtual waiting room, you have a lot of manual work to set up the protected zones, and to see the paths of the events that are going to be on sale. Nowadays, everything is automatically. Then, when the event is created, in terms of, okay, someone was designing the event, the stage, and how many tickets are going to be available, and say, "Okay, this is going to be on sale in a certain day." Then the protected zone is created automatically. You have over than 2000 protected zones running in production. It means that all the events are protected by that queuing system, and then reduces completely, the mental work. And we are still working to reduce even more. The idea that, okay, we know that the time for the engineers are really important to do engineer things. And you're trying to reduce that engineer using operating systems, you know?

And then, we can automate it. You can see that "Okay. If the CPU is going up, then you can take decisions," like something is looking to the chart, to the graph, they spike in the CPU, and decide to reduce, for example, the edge rate of the protected zone. I mean, if it's not manual work, you could do the automation as well, because you can try to understand what's going on with the CPU. Then, you can take a decision, and this sees that. And that's the idea, the next steps for the evolution of our own sales, we can try to reduce the number of people operating it, you know?

Roland Meertens: Yes. Because right now, a lot of people are still looking at how many people are buying tickets at the same time. So can we allow more in, or less in, right?

Vitor Pellegrino: Yes. That's what Ander was mentioning with the exit rate. So right now, people have to make a decision like, "Okay, it seems like we're able to sustain more loads, so instead of allowing fictional numbers, like 300 people every minute, let's allow around 500, 1000." Or maybe it's going the other way around like, "Actually we're not able to sustain as much. So let's reduce that, to avoid a bad experience to everybody that is already buying." So that's the kind of stuff that we want to allow for much more automation. So the system, they're able to adjust their thresholds automatically.

Roland Meertens: And I think you are also thinking about the alerting. How does it currently work? Do you wake people up at night to push more tickets?

Vitor Pellegrino: No, no. I mean, our customers define how they want to buy. So I think the thing that wakes up engineers at night is more when things, like most companies, when they don't work as intended. But I think, in the future, we would like to adjust the priority of these alerts. Again, coming back to the overall theme about on-sale or not. So right now, if people are having any service disruption, we're going to treat that the same. But I would like to be able to say, "Okay, if it's an on-sale, it's actually something that I can wake up, fully refreshed and take a look, with fresh eyes in the morning. But if that's on sale, please wake me out of, I don't know, whatever I'm doing." So that's the kind of things we're looking to do.

Fraud detection [19:13]

Roland Meertens: I think the other thing you mentioned at some point during the talk was fraud detection, that someone could maybe, very quickly, buy a single ticket automatically. Is that fraud? Or someone, maybe, buying 100 tickets for the entire group of friends, is that fraud? How do you handle this at the moment?

Vitor Pellegrino: It's a good point. That was within the topics of we need to think about these things, every on-sale. So we leverage a lot of machine learning and fraud detection systems, throughout the entire stack, so sometimes people will execute some actions, and then, post-factum, realize that they could have been problems, and we have systems to care for that.

We use a lot of different tooling around bot protection and all of that, but it comes with the question, if I am trying to buy a ticket, and I use, I don't know, selenium to automate that task, where do we draw the line? Is 10 tickets okay? Is 1 ticket? Is 100 tickets? So that is the kind of things, I would say, we work very closely with our customers, and then we define, "Okay, that's what we believe is an acceptable behavior."

Roland Meertens: And the queue helps?

Anderson Parra: The queue helps on that part, because the way that you're guaranteed controlling the traffic, then you try to identify real users, and bots, and remove bots from the traffic. Then you can try to guarantee that people that get into the protected zones, that has the opportunity to purchase the tickets, they are real users, but it's hard. The same way that we're working to prevent, people are working, also, to buy best. That always is that way.

Roland Meertens: So as you also mentioned machine learning, what are some of the best features to detect if someone is a bot or not?

Vitor Pellegrino: It's a good question. I think we use systems that provide that, almost as a kind of standalone service. So they analyze the usual patterns, like how fast people will navigate through webpage, just one example. There are a lot of signals involved.

Anderson Parra: Well, there are systems that create fingerprints in the request. Fastly helps, as well, to create a stamp in the request, say, "Okay, that's a bot, or not a bot." And we don't rely on only one bucket, because as I mentioned saying, people are trying to buy past that. Then you have the combination to try to identify that's a bot traffic or not. Then we try to guarantee, as fair as possible, the user experience, for real users that are trying to purchase the tickets, because that's the most important part.

Roland Meertens: At the end of the day, you want real people to sit there, and not this call person making money of your tickets.

Anderson Parra: Exactly.

Vitor Pellegrino: Exactly. Yes.

Edge processing [21:28]

Roland Meertens: And you were mentioning Fastly as your content delivery network, so how does edge processing, how does a content delivery network work here? Because I can imagine, that because the state of your database and available tickets changes so often, you can't cache too much at the moment.

Anderson Parra: Well, the problem that if you are caching, then you need to have a way that purging the cache. Then, you are thinking in event orchestration, because if you cash in the CDN, then you can see the latest going down. But if there's a change in the event, for example, then you need to have a way to purging the cache that was made in the CDN, for example. The way that we're thinking about that, systems could react for the chains, true events, and then you can orchestrate, choreograph the events, in terms of, "Okay, if something changed in the even model, then I know that I need to change the protected zone." That's the queuing system. "And also I need to purging some cache." Then, again, it's connected with the automation part. We would like to keep our systems as smart as possible, in terms of reacting for chains without manual intervention.

But you have to have users of CDN, in the end, for caching as an example, and part of our virtual waiting room works there. Then, we have logic to validate visitor tokens, access tokens, because in the end, the virtual waiting room is the exchanger of visitor tokens to access tokens.

And also, we need to maintain a state of the protected zones in the edge adjunctionary. That's the way that we can control how we're going to route the traffic. It should go to the queue, it should go to the target, it should go to ad block page. And then we have that part of the logic running the CDN, and then you don't need to communicate to our back end. That's good, as well, in that case, that you can reduce the cost that we have with our back end, because we're distributing how we are executing the computing in the different layers.

Storing data at the edge [23:13]

Roland Meertens: And I think you were also mentioning storing data at the edge. What are your ideas around that? Is it something you're already doing, or is this something you're planning to do?

Anderson Parra: That's something new. Then this adjunctionary, Fastly, is something new we're using, we are taking advantage of that. We can see in other companies, like AWS with the cloud front, they have the Lambda edge, and they're using the DynamoDB as the edge data store, with the global tables, because of new edge running in different regions, then you can try to make the data available for all the regions.

For us, Fastly works quite well. Then, the times to replicate the data store is around 30 seconds. DynamoDB is around two minutes. Then, I know that Akamai is working in the data, in a data store as well as in the edge. But my opinion, that looks like we are going to have more logic running the edge, in terms of to reduce the latency. Not planning to complete the systems on the edge, but the idea that you can have our first layer, and try to avoid to file request to the back end, when it's possible.

Vitor Pellegrino: This is something that I would have loved to have, I would say, about seven, eight years ago, when I was working for a company that had a very heavy usage of social graph. So you would have users, their followers, people that they interact with. One of the things that I could see, if I were to rebuild that system, is actually trying to figure out how can I get some of the key users, that have huge crowds that follow them, actually store some of that information already closed to the visitors, like in their edge locations.

So these are the kind of maybe hinting towards new technologies that we're excited about too. That's the kind of things that I think could be so useful to solve that use case, when we're investigating also. So storage at edge is definitely something that unlocks a lot of possibilities for us.

Anderson Parra: Thinking in the idea of the edge, I think we are in the moment that we are going to expand what's the edge in the end. We have the 5G right now. Then, we are going to have more device with nice connection, where you can sync with some back end systems, then the edge will not be only the CDN. The edge will be the gateway that was in the stage on that we need to open when we can check it.

Then we validate that, check that it's valid, and open the gate for that person, because it's a lot that you're getting. Then, I think, going for that direction, that the edge will be everywhere. The idea for the internet of things is going on. Then finally, the problem with the connection is going to be figured out. Then we're going to have massive data, and you can try to think, and how can you just improve our business, because you have the opportunity to run software connected everywhere.

Roland Meertens: And especially for what you said with Stadium, which you are basically proposing, you said, "Maybe the database is at the stadium, so that even if there would be an outage outside of the stadium, you can still keep running."

Anderson Parra: Or imagine that you have the database in the gateway. You have a ticket, and you need to go to the gate seven. And imagine for that gate seven, you have all the tickets available to getting on that one. I mean, if the gate is working, then you don't care if there's an outage, but what's the problem for that. You need to sync, right?

Then how can you sync for that event. Then if you can sync in the right moment, then you allow people to get in fast.

Roland Meertens: And of course, you don't want people to check into two gate at exactly the same time.

Vitor Pellegrino: Yes. And also reporting that customers do, and all of that, I think the main thing that kind of technology unlocks, and again, hinting towards what we could be doing, we don't only do ticketing. That is, I would say, our bread and butter, but we also help customers, enterprise customers I'm talking about here, manage their food stamps, manage their convenience stores inside a stadium. So we can see expanding the devices to actually where the work is happening. So closer to the users, like visiting a stadium, so that in-stadium experience is also important for us.

Roland Meertens: And maybe, as the last thing you were mentioning, the elasticity as all the layers of the infrastructure, what are your thoughts on that? What's the future of that?

Vitor Pellegrino: I think that's something we're working on right now. Our architecture, as any other company, we have things that we built in a different time. And I don't think we're able to grow all the systems in a lockstep. So I would like to be able to get to a point where, let's say there's an on-sale, while users are in a queue, perhaps you can add more database computing, power, or increase. I don't know, even storage. I don't know, making up an example here, but I can take the idea about auto scaling throughout all the layers of my architecture. Perhaps I add different components as I need, activate other vendors to add extra redundancy. So sometimes people focus only on the compute part, and sometimes only from one component, and then forget to also scale the downstream components of it. And for us, that adds, tremendously, the amount of time that people are waiting, for all the auto scaling components to kick in.

So the whole flexibility in all layers, we would like to be able to say, "Okay, the same way that people nowadays do, to increase the amount of processing to..." Let's say there is a backlog in a Kafka infrastructure, or a wrapped-end queue, I don't know. There, you add more consumers. We would like to reason the same way like, "Oh, we have more people in the queue. Therefore, I want to scale all the vertical that is serving that on-sale at the same time, so everything is ready." And then, we can then tie back to what we talked about, increase the amount of people that we allow in, because now we have more capacity. Waiting to have, "Okay, first my computer restart increased, then my second service," and so on, that is too long for us.

Roland Meertens: I can also imagine that it's a bit hard to... Like the front end is maybe easiest to scale, but the database will be way harder, right?

Vitor Pellegrino: Yes.

Anderson Parra: It's hard. And that's the reason that they're monitoring everything. We're trying to avoid blind spots, then you can use those metrics to identify bottlenecks, and sometimes the bottleneck is on the database. It means that you need to go to the whiteboard again and rethinking the solution, and provide a different one that can support the traffic. And that's a constant improvement. I mean, there is no right answer. What works today, maybe tomorrow, with a different traffic, is not going to work. Then, for our side, that we are always [inaudible 00:29:22] the details, you have dashboards, then we know when something is going bad. And when something's going bad, we are refactoring to support what's going on.

The database part is just an example. I think what we know is a little bit hard to change that particular layer, in any architecture, but I think we can do a lot of progress already, just by increasing and scaling the dependencies closer. For instance, if I have a back end for front end, which leaves closer to my front end, that talks to 10 different services, if I can scale all of that at the same time, looking at the same amount of queue, perhaps I'm going to be able to increase the amount of people that I can let in at once, in an on-sale. And most importantly, once the on-sale is over, I am able to scale all of that back down, because our traffic follows that kind of movement, and we will like to keep the efficiency of our infrastructure as well.

Roland Meertens: In this case, the vertical scaling is easy, but scaling to the right vertical size is the problem.

Vitor Pellegrino: Yes, yes.

Roland Meertens: All right. Any other things you wanted to talk about?

Vitor Pellegrino: We're very happy to be here. Maybe it's something that, I don't know, just to leave to the listeners, if they want to hear more about any of them, we're open to have that kind of discussion. I think a lot of that is something that we're still thinking of. We don't claim to have all the answers, but something we're very excited. And it's great to be here in an event like this, lots of energy. I've been spending a lot of time in a breakout rooms, and between talks, talking to people. And then I'm just like, "I can wait to come back," and then actually get a lot of these things in practice.

Roland Meertens: You can definitely talk to a lot of people who are struggling with the same problems, or have maybe already solved it.

Anderson Parra: And also try that in the have great problems, then we're looking for engineers. If you like to working that kind of challenge as well, you're more than welcome that you can talk about it.

Roland Meertens: Then thank you very much for being here.

Anderson Parra: Thank you.

Roland Meertens: I hope you enjoy the rest of the conference, and I'll see you in breakout rooms. [...]

Anderson Parra: Thank you.

Vitor Pellegrino: I appreciate it. See you.

Roland Meertens: So this was the interview with Anderson and Vitor. I really hope you enjoyed this in-person interview recorded at QCon London, and thank you very much for listening to the InfoQ podcast.

About the Authors

Anderson Parra

Show moreShow less

Vitor Pellegrino

Show moreShow less

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.