BT

Event Sourcing to the Cloud at HomeAway

| by Srini Penchikala Follow 40 Followers on Nov 05, 2018. Estimated reading time: 6 minutes |

Adam Haines, Data Architect at HomeAway, recently spoke at the Data Architecture Summit 2018 Conference about how his team leverages the event sourcing cloud design pattern to accelerate the big data initiatives in their organization.

Data architecture patterns like event sourcing help address new challenges teams face when migrating to the cloud. These challenges include eventual consistency, legacy service dependencies, elastic scale, and unreliable networking. The HomeAway team's main goal in the cloud migration was to minimize technical debt in the cloud platform. They had data architecture challenges like limited replay of the events, little auditability, copies of data everywhere, limited discoverability & lineage, data drift and some resiliency challenges as well.

Event sourcing changes the paradigm of data storage by storing data as a sequence of domain state changes. Data is stored like a log that services can consume. Event sourcing is based on an immutable log of change and offers maximum auditability. It comes with challenges like eventual consistency, event iteration, and data growth.

Haines also discussed the other design pattern that goes hand in glove with Event Sourcing, the Command Query Responsibility Segregation (CQRS), which separates the write and read concerns and allowing the applications and services to scale independently.

At HomeAway, the event sourcing solution includes making events canonical, capturing delta changes (change data capture), "exactly once" semantics for messages, guaranteed ordering and unblocking services.

They also used the Strangler pattern successfully for incremental migration to the cloud, asset capture, event redirection and routing the requests.

InfoQ spoke with Adam Haines about event sourcing pattern, lessons learned from using it in their organization, and recommended practices.

InfoQ: Can you discuss why event structure matters in event sourcing solutions?

Adam Haines: Event structure matters because this is the basis for being either data-centric or application-centric. As long as the data shape is dictated by the service writing it, the further we get from true data democratization. Each and every event should be very granular and bound to a known real world thing. For example, a reservation is not a tangible thing, but has a very well-known set of attributes/properties that define it simply. A reservation can have a person, check-in date, check-out date, a property, a price, etc. If we have done our job right, the service/platform/function writing the event is irrelevant because the reservation domain is already interoperable with all other platforms and services. This is the essence of a data-centric architecture.

InfoQ: What are event sourcing benefits and challenges?

Haines:

Benefits: Event sourcing allows services to separate their read and write concern and truly allows services to encapsulate data. Having full encapsulation not only prevents a death star architecture, but reduces integration cost for each microservice. One of the biggest advantages of an event sourcing architecture is data democratization. Having data in the center of the architecture allows services to easily discover and subscribe, which is essential for developer velocity and implementing near real time experiences. Event sourcing also opens the door for pattern based programming. If the pattern and libraries are set in place, the goal should be to have an entry level engineer execute the development lifecycle with very little ramp up time, or training. Event sourcing provides a great audit trail as the entirety of history is persisted, which makes auditing and visualizing what happened very easy. I think this is a very critical aspect as services become more asynchronous, as customers need real time updates or feedback about the state of their transaction.

Challenges: Due to the nature by which we have been programming for the last decade, there is a certain expectation that I can read my own write consistently. While read/write separation of concern counts as a benefit, it also counts as a detriment. With an event sourcing model software engineers now have to really think through idempotent operations and how to handle inconsistent results. Another challenge is data growth. An event source will start off fast and manageable and this is because everything is fast with little data. It quickly becomes apparent that as data grows, so does the storage footprint and the time it takes to get back to current state. I think it is important to have a strategy (recovery and/or snapshotting) around minimizing the cost and impact of replays.

InfoQ: Can you talk about the event sourcing architecture at HomeAway?

Haines: HomeAway has really increased the flexibility and elasticity of its architecture by transitioning into a an event driven architecture. It has been a long road (that we are still on), that started with a messaging architecture; however, we saw that data needs to be the center of what we do. We needed to be more agile and adaptive, in how we code. The first step was to make consumption and production of data easier. Our goal is to root events at the very core of our business. Leveraging events and event sourcing has given us greater insights, more flexibility, and the ability to take greater advantage of the cloud, while aiding our effort to deprecate legacy data center services.

InfoQ: What technologies do you use in your solution?

Haines: We have two technologies that we use for event sourcing at HomeAway, Kafka and Photon. Photon is a highly distributed write optimized event source platform created internally to solve event sourcing challenges.

A key point that I want to drive home is that technology should be agnostic; however, that is not to say without trade-offs. My goal is to lay down the foundation of the event sourcing pattern and to present the pros and cons for each technology choice. With that said, careful consideration should be taken when deciding what technology backs the event source. Not all persistence is created equal and not all platforms offer the same properties or guarantees, as a database.

InfoQ: You talked about Strangler Pattern in your presentation. Can you discuss how your team used this pattern in developing event source architecture?

Haines: The data engineering team quickly realized very early in our cloud journey that services and database were so integrated that it would be nearly impossible to unwind.

The solution we had to solve was how to move. We had to find a way to take mutation events from a legacy platform and push those to the cloud.

The easiest way to do that is to tap in a change data capture (CDC) stream. A CDC stream allows our services to subscribe to legacy system mutations and push them into a destination.

Due to the nature of what we were storing and how we were storing it, the only real option was an event source. Event sourcing allowed us to take a data-centric view of the world.. A view where it did not matter if a cloud service or change data capture wrote the event, as each service is writing the same thing.

After trialing the strangulation pattern, we realized the value and power of the events. This is when we really began to push event sourcing as a means to make the company more data-centric.

He also talked about how CQRS and Event Sourcing pattern work together to make the systems more resilient.

Command Query Responsibility Segregation (CQRS) is a big part of delivering value for event sourcing and achieving cloud scale. CQRS is really about separating reads services from write services. Decoupling these services allows us to scale reads and writes independently and adds a layer of resiliency. In a traditional microservice world, services may delegate to other microservices; however, this comes at an integration cost and negatively affect resiliency. HomeAway leverages CQRS to truly feed and encapsulate data, so the loss of a service is less impactful. In most cases, the worst case scenario is latent data, which puts services in a more degraded than down state. In the cloud, one should expect to see services and resources constantly going down and coming back up. The cloud simply is not as stable as what we are used to in the data center, and event sourcing + CQRS helps us to minimize this risk.

The query layer (the Q in CQRS), we have found to be best represented by graph as this allows for better exploration of data across domains, which is typically a requirement for encapsulated microservices, or business objects.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Read my own Writes by Adrian Ivan

"Due to the nature by which we have been programming for the last decade, there is a certain expectation that I can read my own write consistently."
Any concrete ideas on how you solved this?

Re: Read my own Writes by Adam Haines

This is a tough problem to solve and comes with trade-offs. I think the most immediate solve is to have a write-through cache (in a persisted data platform like Cassandra,MongoDB, etc.) and then leverage change data capture to feed to the long term system of record.

The way I like to think about this is an experience cache, where the experience is localized; however, until something significant happens we don't need to persist to the system of record. Something significant may be the end of a user workflow, like registration.

This pattern allows apps/services to read their own writes and allows for CQRS and eventual consistency downstream.

Re: Read my own Writes by Adrian Ivan

Interesting. The major downside is that the more complex business rules get (plus other concerns like Authorization), the more duplication is required, both in the main SoR and the Experience Cache. I guess it depends case by case and by type of application.
Thank you for sharing.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

3 Discuss
BT