BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage News Slack Leverages Bespoke Tracing Architecture for Message Notifications

Slack Leverages Bespoke Tracing Architecture for Message Notifications

Slack leveraged its bespoke tracing architecture to help with investigating notification-delivery issues. Tracing helped resolve notification issues 30% faster and reduced escalations to the development team. It also simplified the analytics pipeline and unlocked new use cases for the data science team.

Message notifications are a key element of Slack’s user experience. However, since the notification flow spans many components of Slack’s overall platform, both server-side and client-side, they can be tricky to investigate in case of any issues reported to the customer experience team. Development teams quite often had to spend many days looking through multiple systems with different logging backends and formats.

Source: https://slack.engineering/tracing-notifications/

Slack previously created a bespoke SlackTrace tracing architecture and uses it for tracing regular message delivery, where one percent of client requests are traced. The company decided to create its own tracing solution as it concluded that none of the available 3rd party solutions met its needs fully.

For tracing message notifications, the team mapped the flow to a trace by identifying notable events and determining attribute mappings. They decided to separate notification traces from the message request traces. This way, they could support 100 percent sampling for notification flows, which Slack’s customer experience team requested.

Notification tracing has improved issue triage and debugging. Customer experience team members can use trace data themselves to understand what went wrong and answer a customer’s query without involving the development team. The new functionality also helped iOS and Android engineers to start using Grafana to monitor notification delivery in mobile applications. Lastly, the data science team has derived insights from the tracing data. They computed funnel analytics to understand notification open rates better and identified bugs in the application and the instrumentation code using historical notification traces.

Suman Karumuri, the senior staff software engineer at Slack, summarizes the benefits of tracing:

Modeling product analytics data as traces provides high-quality data in a consistent data format across all of our complex stack. Further, the built-in sessionization of trace data simplified our analytics pipeline by eliminating additional jobs to de-dupe and sessionize the trace data.

SlackTrace architecture consists of a Go webserver application publishing trace span events to Apache Kafka and a Go consumer service responsible for persisting events into the real-time store (ElasticSearch) and the data warehouse. Backend services use Zipkin and Jaeger instrumentation libraries to report spans that are converted into the internal span representation, while desktop and mobile apps use the span API directly.

Source: https://slack.engineering/tracing-at-slack-thinking-in-causal-graphs/

Slack has opted for a simple representation of trace spans, which makes the solution more flexible and less centered around the request and network tracing. A simple span structure, which allows the data to be stored in a single table, also supports a wide range of querying options, where engineers can extract the data they need to answer specific questions.

About the Author

Rate this Article

Adoption
Style

BT