BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Presentations Challenging Google Analytics: Building a Scalable, Cost-Effective User Tracking Service

Challenging Google Analytics: Building a Scalable, Cost-Effective User Tracking Service

46:32

Summary

Alina Krasavina explains how Delivery Hero successfully deprecated Google Analytics and migrated to an internal user tracking platform. She discusses how a simplistic, highly scalable architecture allowed them to handle 10 times more load while capturing 97% of tracking data.

Bio

Alina Krasavina is an experienced Engineering Manager at Delivery Hero, currently leading the design and development of a user tracking service that supports millions of users worldwide. With 14 years in software engineering, she has contributed to both internal and customer-facing products, delivering complex, scalable solutions and building strong, collaborative engineering teams.

About the conference

InfoQ Dev Summit Munich software development conference focuses on the critical software challenges senior dev teams face today. Gain valuable real-world technical insights from 20+ senior software developers, connect with speakers and peers, and enjoy social events.

Transcript

Alina Krasavina: My name is Alina. I'm an engineering manager in Delivery Hero. I'm working on the internal user tracking service, like Google Analytics, but internal. We deprecated Google Analytics at some point. I'm going to be talking about how we did that. First, a couple of words about the company. Delivery Hero is a food delivery company, but it's not like one brand in one country. It's a central office located in Germany, and many local brands around the world. In the central office, we are providing centralized services for those local delivery brands. My service is one of those centralized tools for local delivery companies.

MVP Overview

Has anybody used Google Analytics? Think a little bit about why you don't like Google Analytics, even if you're using it. Maybe you come up with something, some reasons. We had our reasons too. I'm going to be talking first about our MVP rollout, where I'm going to introduce the problem, tell about architecture, how we did testing and the rollout, and how we approached further development. The second part is how we did that further development, which is mostly some optimizations, like cost optimization, solving some problems. This pretty much is going to be two parts.

Intro to the Problem - Depreciation of UA, GDPR, Costs, Capabilities

Let's start with the intro of the problem. Why did we choose to deprecate Google Analytics? The first reason was that we needed to do migration anyway, because Google had Universal Analytics, which is the previous version of Google Analytics, and we needed to do an effort of migrating anyway. We chose to migrate to our internal tool, not to GA4, because anyways, you need to invest this effort into some migration. GA had actually some limitations that we needed to deal with. We needed the real-time data, and it didn't provide the data for real-time use cases. We have some billable data from advertisement, and we just couldn't do that with Google Analytics, because it was providing the data once or twice per day, and we needed it just real-time. Also, we have many different events, and we have, at some point, reached the limit of events to be defined in Google Analytics. In our system, that's unlimited. We have as many event types as we want.

The second reason was GDPR, because Google is a third party, and we cannot store all the data in the third party. We have some data to be stored in our infrastructure for legal reasons, so our privacy officers are happy. We don't have any legal concerns. The third reason was cost, but at first, it was just a condition that, when migrating, we are not paying for our service more than for Google Analytics. Later on, it improved a lot, and we were just happy that we moved to our internal tool, and further capabilities. We, later then, introduced data validation that allowed us to have more quality data, and Google Analytics just couldn't provide that functionality. Then, I'm going to be talking about those challenges that are listed below on the slide, how we have dealt with data quality, what we did with cost, scalability, testing, and rollout process.

Architecture

A little bit about what our tracking system is. Just like any other user tracking service, we have a mobile SDK, and a frontend SDK, in TypeScript, to collect the data and to send to our API. Then, we have our internal infrastructure that is streaming the data to consumers and to our data storage, which is BigQuery, because we are using Google infrastructure. We have two types of consumers: real-time consumers that are consuming from Pub/Sub, and everyone else who are consuming from the storage, from BigQuery. How it started? We started with just a very simple service. It contained just an API and two processors that are reading messages from Pub/Sub. Processor, and fallback processor, just for fallback cases. That's it. Very simplistic architecture, but it allowed it to be scalable, and it caused zero problems. We were very happy with that.

How is it going now? We have now many additional services, and this simple thing, the first one, API and processors. Everything else is to solve some of our problems, like reliability, like data validation, for instance, data validation service. We have a lot more curation jobs and more SDKs. To serve our data producers and data consumers, we have built additional services that are all around our simple, tiny little API. After the rollout, we had a little bit success, because with Google Analytics, we had an order match rate that's a data quality metric that we invented. We had data quality at 85%, and we got, after the rollout, 6% more, which, in large scale, when it comes to billable data, that's a lot of money. Also, we have improved the cost by 25%. Got twice more load than we had with Google Analytics. During testing and rollout, we had zero incidents with data loss, so that was free for us in terms of revenue loss when you have a data loss incident.

Data Quality Improvement

Let's now talk about those previously listed challenges. The first requirement was that Perseus Tracking should track at least equal data as GA, but it would be perfect if it would be more. After the rollout, I mentioned brands, we started with those four brands. On charts, we see it's a chart of one year after the rollout. You see that at some point, it becomes first equal, two lines of GA in Perseus, it's first equal, and then we have more data than the GA. How we did that? Basically, we were fixing our SDK so it doesn't lose the data, and we were building our infrastructure reliably, so it also doesn't lose the data if it already arrived, because that would happen also. About order match rate. That was the metric that we defined because we have a source of truth, because when people order food, the backend of the food delivery, it knows that, actually, a user has paid for their food.

We have 100% a source of truth of some data. There's no other data that could be tracked that good. We were comparing orders from backends to orders that have arrived with our user tracking. That was a good metric because that is very much predictable because even if it's a high season of Christmas or whatever, public holidays, you get instantly a higher amount of orders for no reason, but we have a source of truth about orders. We know that it's not our service or something, it's just behavior of people. We can compare it really well to our user tracking. The second challenge was that we are not paying more for our internal tool than we pay for Google Analytics. We were comparing cost. We have defined a simple metric. It's a cost per message. We were optimizing it, of course, and we have continued to optimize it further.

One year after rollout, we have reached that point when we were paying 25% less. That was also an observable metric, and a set of fixes in our infrastructure. Also, what helped us was the load testing. We were testing with the real data. We just actually took our peak load and tested it three times more, the data, and that helped us to survive that load of that day. In one of our countries, it was a public holiday, the situation I told you. Instantly, the amount of orders was insanely high. We had this peak load and successfully survived. Load testing is good. Please, do load testing. Another part was testing. We haven't invented anything new. That's a pattern, actually, for every hour testing of big software pieces, like SDKs, or in that case, GTM removal. With Google Analytics, we used Google Tech Manager to enrich the data and to filter out some data with the rules configured in Google Tech Manager.

When deprecating Google Analytics, we were also deprecating Google Tech Manager. We were testing that removal with just a doubled pipeline. That is very expensive because you are doubling your load and receive just twice as more data than you could, and you actually need to pay for that. That you pay not only twice more, for instance, for a month, because you roll out first a new application version with the new SDK version, which is doing that doubled pipeline thing, which is sending with our SDK and using GTM, too. It's sending the data twice. Then, your test is over, and you need the user to update the application on their phone, and that doesn't happen instantly. For some time, they are still sending the data twice, and you are receiving the data twice if you haven't deprecated your flow. I think at least half a year ago, there were some users that were still using that old version with this testing thing.

That should be taken into account because it's very expensive. We had the same approach for testing our competitor. There is a Snowplow SDK. That's an open-source SDK for user tracking, and we were basically deciding whether we are going with our SDK or Snowplow SDK. It was just like head-to-head, but we decided to take our internal tool because we had a control on it. Because if you need to do some development, even on open-source, of course, you can make a pull request, and you can talk to the maintainer, but if you are not a maintainer of the SDK, everything becomes very slow. If you need to implement features really fast and have control over the features, you would, in the end, just fork this SDK and maintain your own version. Why do that if you can just develop the whole thing from the very beginning? That's what we have gone forward with.

The last thing, at least for MVP, it was a rollout. That thing that I already described, the deprecation of the second pipeline. We have tested it very well, and we have reached the point where we are getting at least the same amount of data, and only then we deprecated the second pipeline of GTM dispatching, and we were continuing the later application versions. They were continuing with our SDK without Google Tech Manager. Because we tested everything very well, we didn't lose any data because it was doubled. Key takeaways. Surprisingly, we did better than Google. That was a real surprise because that was actually an experiment. It's not like nobody believed that it could happen. We were doing our job and tried to be better in every iteration, and that magically happened. Because it's like, we were all thinking, it's Google, they're the best, if they have 85% of data, no way we're getting more. I will talk about that later, but we have reached even better quality.

The second takeaway is that if you choose your KPIs carefully, you can very well track how your changes are affecting those KPIs, and if you are actually doing better with your fixes or any other improvements you're introducing, if your features are working well. Load testing helped. Also, parallel testing helped. We didn't lose any data. We survived the peak load. Very much a win. Also, we introduced a progressive rollout strategy for our backend and also for our SDK. We were rolling out our SDK first on smaller brands and then on bigger brands, because if you test your changes on smaller markets, you have obviously less impact, and if something goes wrong, it's just cheaper to roll back on a smaller market. Lessons learned, also, takeaways. Of course, you need to add logging. If you're testing, you need to add logging everywhere. If you are doing an experiment, you should add logging.

We were thinking about it, but still, when we saw every data loss, we couldn't debug it that easily. We were lacking logging. We were lacking also alerting because we had further data loss incidents, and some of the data loss incidents, they were also hard to debug. Better observability is better than no observability or worse observability, so you need monitoring and alerting at key points of your system. Also, progressive rollout, that is coming also from incidents. We had a data loss incident because of our SDK rollout, and we introduced that progressive rollout to that thing with small markets that I mentioned, because it's just cheaper if you have an incident. Also, once we had a GCP outage, so we introduced also chaos testing to test if Google infrastructure doesn't work very well.

Challenges, after the Rollout

We are done with the first part about MVP. MVP is rolled out, the service is working, and the question is what to do next. The next, we basically optimized everything that we could optimize, and we were solving any reliability problems, application not responding problems, so on mobile and on the backend side. We were later just solving any problem possible and optimizing infrastructure usage and optimizing also cost. A couple of years after that, we have the following results. The first on Google Analytics, we had 85% of data receiving, and now we have roughly 97% of data. Also, it is now not 25% more cheap, it's three times cheaper than Google. Just in our case, there is a way of further improvement. There are several cases when you can do even more. Now we have 10 times more load than during that period when we were testing against Google Analytics. We are doing pretty well, I think.

We have next challenges after the rollout. We were never thinking about data completeness, for instance. We were thinking about data accuracy, if all the data arrives at our storage, but we were never thinking about how good this data actually is. It wasn't. There was a huge data governance problem there because, for instance, null values, they could be anything. They could be a character of space, it could be an empty string, it could be a string null, it could be zero, it could be anything. We needed somehow to reach that point when our data doesn't require normalization after that. We still have such problems, but at least it's not bigger than before. We started measuring data completeness. Of course, when you introduce new metrics and you carefully think about it, the metric is probably not that well, but that's a point of improvement and you can do things to improve, and we did.

The second challenge was backend reliability because we had a couple of problems to be solved. For instance, one huge problem was that we were losing the data on pod restarts. Sometimes our pods were out of memory and just killed, and when the pod was killed, it was losing all the data it has. We needed to do something around so it doesn't lose the data. Or somebody is sending it again if the pod died. We had also further cost improvements, that were our key results for the last two years. Every quarter, we were looking at the metrics of cost and constantly were optimizing it. We had a bunch of SDK improvements. Let's talk about that. I described a little bit about data completeness. There was this problem with nulls, with inaccurate data. We needed this metric to actually improve the data, first, to convince producers to produce the data more carefully.

Of course, that somehow works, but it actually doesn't. Because if you just tell people, "If you're sending null, please send it as null and not as a string," sometimes they just won't listen to you because they are humans and there is a human factor. After some attempts to convince people to send accurate data, we introduced code generated event models that were stored in the third-party event schemas, so the developers could receive those errors in the compile time of the application. They are writing the code, then they compile, and they see, it is supposed to be null, but it's string null, ok, and you receive a warning. You see it in your console when you are warned, and you go and fix your thing. That worked much better than just showing metrics, and like, it is supposed to be numeric value, not a number of a string, please fix it. If you have an error on compilation time, that works much better.

We used code generation for that from the schemas. Also, the problem to solve with those schemas was that we had, and we still have, global requirements for events and some local properties that are needed only for the local brands. We require some things for central statistics, and we don't require anything else. This could be documented in one way, and local brands, if they need something, they document it in their own way. Storing those event models at some place that helped with the data governance. You wouldn't need a data governance person carefully looking at the global requirements and keeping it synchronized. You could just store it in one place and just be not thinking about it at all. That was about automation of data governance.

This slide is about reliability. As I mentioned, we had some problems, that big problem with pods that we had. We resolved it with just making every request synchronous. It had increased latency like seven times maybe, but that was considered acceptable because our SDK works in a non-blocking way on the mobile application. It can be some latency, but with this latency, at least we're not losing any data. Kind of an obvious solution, but we still needed to experience some data loss incident to figure that out, surprisingly. Sometimes you figure out obvious things through incidents and losing money. Also, we introduced gRPC and we have a doubled flow now, not only our processors, but also gRPC writing to our BigQuery. Cost improvement, that is, I think, the most obvious one. We were not archiving data at all, and we introduced data archival. You have some data that you don't expect to be frequently accessed, so you can archive it after some time.

After some time, we are archiving our data. If you use a cheaper storage, your cost is lower, pretty obvious, but we needed to think about it to save cost. Another thing, we were storing our JSON field as a text and it consumes more storage, and we then just started storing it as a JSON, and it lowered the cost 20%. If you are using less storage, your cost is lower. Of course, if you are a data-intense application. One more thing, also very obvious, if you use standard nodes, that is more expensive than on-demand nodes. What we did, basically, is that we used everything cheaper but still reliable, and that's it. The last thing were our SDK improvements. We had several decisions around our SDK. First of all, how it works. It is not that our developer calls send event, and it's instantly sent. First, it goes to a queue, and then a work manager, in a synchronous manner, on the device, somewhere in the background, it reads from the queue, takes a batch of some events, and is sending to our API.

Not that straightforward, but it is working in non-blocking manner. We had some discussions around how our queue should work, if it should be last-in, first-out, or first-in, first-out. There are still discussions about that. It appears, I think, every year, a couple of times, they're still talking about it, how it should work. Probably, we should make it configurable. Maybe not. Also, we introduced monitoring. It helped us a lot, because we had a couple of data loss incidents when we have launched some A/B tests, and they were very data-intense A/B tests. We started receiving less billable data because of that experiment overflow. We needed to think about monitoring of our queue. We introduced, also because of the same thing, event prioritization, because not all your data is equal. You have data that's less important and more important. For instance, more important, the billable data. The less important, everything else. We have several layers of this importance of data, but the idea is that some data is important, some data is not that important. Of course, we are using exponential backoff, but that's, I think, an industry standard for sending data like that.

At the moment, what we took away. The data validation is very important, and it has changed our support line, because now we have different types of errors. Most errors are caught at development time. We are thinking about developers, because our developers are also our stakeholders. Because we introduced this, for instance, event modeling, and of course, developers also use the SDK. Before, we were building our product for product analysts, and we were thinking about analysts the most. What data are they receiving? What use cases the data is serving? Now had to think also about developers, because developers are using our product, and we needed to think about how to make developers produce the better data, and that helped a lot. Also, we were very thankful for the previous, us, for building a very simplistic, but very scalable architecture that saved a lot of time. It could have gone a lot worse.

Summary

To summarize what we did from the very beginning. We created an MVP, and we kept it as simple as possible. We have tested it, load tested it. We tested all the possible outcomes, if we are using GTM or not using GTM, if we are using this SDK or this SDK. Figured out what works the best, and then rolled it out gradually after a successful test. For further improvements, we started receiving more load. We added more features, like this data validation, like event forwarding through third parties. For instance, when we are receiving some of the data, this data goes to some other third parties that work for marketing use cases. We are not only receiving the data, and let our consumers to basically select the data from our storage, but we are also serving as a source for marketing campaigns on Facebook, for instance. We refactored our SDKs as a further development. Interesting moment, that for some markets it's extremely important that the application is as small as possible, because their phones don't have much memory.

We had to make our SDKs as small as possible. We had a couple of challenges around that, and especially a couple of challenges around our code generation thing. Because if you have many event types, obviously you generate more code, and this code should be very small for some markets. There was one challenge around that. We started measuring data completeness, and improved cost, and we're still improving. I think that's a never-ending story. We could improve cost all the time. The benefits from an in-house solution. We have full control on our implementation. We have our own priorities. We are serving our needs. We don't have that challenge when you use a third party and you request them to implement something, and they just don't implement it because they have other priorities, other bigger customer, more important feature. We are just doing what we should do. We have full control on that. Full control also on cost, on cost improvement, on ways of cost improvement. Our compliance officers are also happy because we are not sending data to any third party.

What's Next?

What's next? We are reusing, actually, our tool for more use cases, like this marketing thing, and also, we are reusing it to store application metrics data, like Firebase, for instance. That's a completely different type of data, but still we can use similar storage to serve those additional use cases. Also, we are going to be doing some developments around experimentation data because that's very data-intense, and the data is a little bit different. Also, we are going to be refactoring our mobile SDK further because there is some room for improvement that could give 1% or 2% of the data. We are still losing the data on the phone, but we could improve it. On a large scale, 1% or 2%, it could save much money.

Questions and Answers

Participant 1: With Google Analytics you also get this analytics dashboard. How did you visualize your data and act upon it or filter it? How did you manage to draw some conclusions?

Alina Krasavina: We are using Looker Studio for that. That's pretty ugly, but it serves the purpose.

Participant 2: How did you manage to solve the issue when the pods died and the data is lost inside of the pods.

Alina Krasavina: We made it synchronous. If the pod dies, it just answers 500 something. The client is resending the data.

Participant 3: Most use cases were for your mobile application and your web app, but do you have use cases where maybe you need to track internal application user data, like internal tools and stuff like that. Is there something else that you use or is it the same API?

Alina Krasavina: We are tracking data for the help page, is that what you meant?

Participant 3: Something like internal tools. I'm pretty sure you have internal tools.

Alina Krasavina: Yes, we have internal tools. That depends on what exactly are you calling an internal tool? Because the help page is also an internal tool.

Participant 3: Let's say some management application for some billing maybe.

Alina Krasavina: We are tracking data from the application that the deliverers are using. If that's considered an internal tool? That's kind of an internal tool.

Participant 3: Then in terms of like events that then you have to do with mobile application and stuff, you have very fixed events with internal applications. The kind of tracking that you need for a user journey is very different if you want to figure that out. Is there some event validations, data federation stuff there?

Alina Krasavina: Yes, now we have a similar thing with experiment data and for some other use cases when you actually don't need that validation. Probably we're going to be building some other pipeline that just skips the validation thing.

Participant 4: Before you started building your own, what other alternatives did you consider?

Alina Krasavina: GA4, so moving from the previous version just to GA4, and also using that Snowplow SDK. We were also looking for other mobile SDKs. Google Analytics and the Snowplow SDK were the main things.

 

See more presentations with transcripts

 

Recorded at:

Jun 22, 2026

BT