Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage News Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture

Grammarly Replaces its in-House Data Lake with Databricks Platform Using Medallion Architecture

Grammarly adopted the medallion architecture while migrating from their in-house data lake, storing Parquet files in AWS S3, to the Delta Lake lakehouse. The company created a new event store for over 6000 event types from 40 internal and external clients and, in the process, improved data quality and reduced the data-delivery time by 94%.

Grammarly has been using a bespoke data lake solution called Gnar since 2015. Last year the company decided to replace it because it was approaching scalability limits and was making new capabilities challenging to implement. The team chose the Databricks Platform as their new data platform, which consisted of Apache Spark for data processing, Delta Lake for storage, and Unity Catalog for data governance.

The team used a medallion architecture to organize active data tables and backfill historical data from the old data lake. This data design pattern, also known as 'multi-hop' architecture, logically organizes data in the lakehouse by creating three distinct layers, each improving data quality and structure.

Medallion architecture for data ingestion and transformation (source: Databricks Documentation)

The bronze layer contains raw data ingested from external source systems. It focuses on providing a historical archive of source data, which then can be used for data lineage, auditability, and reprocessing if necessary. The silver layer stores conformed and cleansed data from the bronze layer and the company’s reference data. It provides the enterprise view of the data and enables self-service for ad-hoc reporting, analytics, and machine learning (ML). The gold layer of the lakehouse is organized into project-specific databases and uses read-optimized data models. These are used to power customers, products, inventory, sales analytics, etc., or for specific ML/AI use cases.

The team had to reconcile differences between Delta Lake and their old platform that lacked schema enforcement. Previously, they stored all data in a single table-like structure, partitioned by event name and time into chunks of 2-3 million records. The solution stored data files in the Parquet format in AWS S3 and metadata in MySQL. The system used the metadata for routing data to correct S3 files and to execute queries expressed in the custom domain-specific language (DSL).

Christian Acuna, technical lead manager at Grammarly, summarizes the learnings from the project:

One of the big lessons we learned was the importance of schema management and enforcing schemas early on to improve data reliability and cost. Our existing system allowed for rapid experimentation and ease of use for clients, but scaling [...] has been costly in terms of cloud resources and the engineering effort required to maintain it.

As Delta Lake requires data schemas for all its tables, the team had to work out target schemas based on inferred schemas stored in MySQL and handle any schema incompatibilities, which has proven challenging. Furthermore, each event type had to be stored in a dedicated table, so the team created a Spark Structured Streaming job to route events to their target tables. A similar approach was used to backfill historical data, and it required addressing data quality and processing scalability issues, requiring an iterative process and a lot of close collaboration.

Event Store design with bronze and silver layers and ingestion jobs (source: Grammarly Engineering Blog)

About the Author

Rate this Article