Email Classification at Slack: Designing an Eventually Consistent Custom Classifier

Slack recently published the details of how it built an email address classification engine that can determine if an email address is internal or external. Slack engineers utilized an eventually consistent near real-time representation of the data in its system and implemented a drift detection mechanism to fix erroneous data, keeping the engine's operation in order.

Sarah Henkens, a staff software engineer at Slack, describes the motivation behind the team's architecture decisions:

To provide smart suggestions to our end-users, we developed a classification engine that can determine if an email address is internal or external. To support this, we need a near real-time representation of the total number of users each team has grouped by domain and role.

A backfill to perform aggregation for millions of users would have been too expensive to execute on a daily basis. By designing an eventually consistent data model, we could roll out this feature without backfills and with real-time data updates to provide a highly accurate data model for predicting email types in context to your team.

When Slack users invite colleagues to their Slack workspace, Slack needs to determine whether they intend to invite internal or external collaborators. Slack then recommends using standard Workspace Invites for internal employees and Slack Connect invites for external ones. The following diagram describes the process.

Source: https://slack.engineering/email-classification/

First, Slack engineers compare each invitee email domain against the tenant's domain settings and the inviter's domain. This comparison is made in O(1), as it relies on data within the execution context and is quick to perform.

In contrast, the Team Context lookup requires a database round-trip per domain. Henkens explains the problem:

Since teams at Slack can grow to over a million users, the query to perform this aggregation check in real-time during the classification process would be too expensive. We had to design a data model that reflects the aggregate dataset in real-time.

We use Vitess to store the aggregate data in a simple table shared by team_id. Since classification always happens in the context of a single team, this ensures that we always only hit a single shard.

Each shard maintains a materialized view of each email domain's aggregate number of appearances in a specific tenant per role. Then, the classification engine can query this view and make a heuristic decision based on this data. Below is the view's schema.

CREATE TABLE `domains` (
   `team_id` bigint unsigned NOT NULL,
   `domain` varchar NOT NULL,
   `count` int NOT NULL DEFAULT '0',
   `date_update` int unsigned NOT NULL,
   `role` varchar NOT NULL,
   PRIMARY KEY (`team_id`,`domain`,`role`)
)

Whenever users, for example, join a workspace or change their email address, an event is generated in the system. A mutation job reads these events and applies appropriate mutations on the materialized view instead of constantly recalculating it.

Source: https://slack.engineering/email-classification/

However, with asynchronous job queues, it's typically not guaranteed that the mutation job will handle each mutation event exactly once. This fact causes a drift of total counts over time which the engineers must handle. A possible solution is to periodically rebuild the view from scratch.

Nevertheless, Slack engineers decided to build a Healer component. The Healer component is triggered whenever a domain's aggregate count falls below zero to indicate false data. It then effectively compares the aggregate data with the real-time data and performs an update to fix it. Since heal operations are simple UPSERTs that do a +N or -N with the existing count value in the database, any mutations performed during the healer's execution are never lost without pausing processing.

Slack's next step is to utilize deep learning to enhance the email classification engine further and provide more accurate results.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the MySQL topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter