InfoQ Homepage Articles DynamoDB Data Transformation Safety: from Manual Toil to Automated and Open Source

DynamoDB Data Transformation Safety: from Manual Toil to Automated and Open Source

Nov 23, 2022 15 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Data is the backbone of many SaaS-based services today.
With the dynamic nature of data and cloud services, data transformation is a common need due to changing engineering requirements.
Data transformation remains a continuous challenge in engineering and built upon manual toil.
There is a current lack of tools to perform data transformations programmatically, in an automated way and safely.
The open source utility Dynamo Data Transform was built to simplify and build safety and guardrails into data transformation for DynamoDB based systems - built upon a robust manual framework that was then automated and open sourced.

When designing a product to be a self-serve developer tool, there are often constraints - but likely one of the most common ones is scale. Ensuring our product, Jit - a security-as-code SaaS platform, was built for scale was not something we could embed as an afterthought, it needed to be designed and handled from the very first line of code.

We wanted to focus on developing our application and its user experience, without having challenges with issues and scale be a constant struggle for our engineers. After researching the infrastructure that would enable this for our team - we decided to use AWS with a serverless-based architecture.

AWS Lambda is becoming an ever-popular choice for fast-growing SaaS systems, as it provides a lot of benefits for scale and performance out of the box through its suite of tools, and namely the database that supports these systems, AWS’s DynamoDB.

One of its key benefits is that it is already part of the AWS ecosystem, and therefore this abstracts many of the operational tasks of management and maintenance, such as maintaining connections with the database, and it requires minimal setup to get started in AWS environments.

As a fast-growing SaaS operation, we need to evolve quickly based on user and customer feedback and embed this within our product. Many of these changes in application design have a direct impact on data structures and schemas.

With rapid and oftentimes significant changes in the application design and architecture, we found ourselves needing to make data transformations in DynamoDB very often, and of course, with existing users, it was a priority that this be achieved with zero downtime. (In the context of this article Data Transformation will refer to modifying data from state A to state B).

Challenges with Data Transformation

In the spirit of Brendon Moreno from the UFC:

Maybe not today, maybe not tomorrow, and maybe not next month, but only one thing is true, you will need to make data transformations one day, I promise.

Yet, while data transformation is a known constant in engineering and data engineering, it remains a pain point and challenge to do seamlessly. Currently, in DynamoDB, there is no easy way to do it programmatically in a managed way, surprisingly enough.

While there are many forms of data transformation, from replacing an existing item’s primary key to adding/removing attributes, updating existing indexes - and the list goes on (these types are just a few examples), there remains no simple way to perform any of these in a managed and reproducible manner, without just using breakable or one-off scripting.

User Table Data Transform Example

Below, we are going to dive into a real-world example of a data transformation process with production data.

Let’s take the example of splitting a “full name” field into its components “first name” and “last name”. As you can see in the example below, the data aggregation currently writes names in the table with a “full name” attribute. But let’s say we want to transform from a full name, and split this field into first and last name fields.

Before

`Id`	`FullName`
`123`	`Guy Br`

After

`Id`	`FirstName`	`LastName`
`123`	`Guy`	`Br`

Looks easy, right? Not so, to achieve just this simple change these are the steps that will need to be performed on the business logic side, in order to successfully transform this data.

Scanning the user records
Extracting the FullName attribute from each record
Splitting the FullName attribute into new FirstName and LastName attributes
Saving the new records
Cleaning up the FullName attribute

But let's discuss some of the issues you would need to take into account before you even get started, such as - how do you run and manage these transformations in different application environments? Particularly when it’s not really considered a security best practice to have access to each environment. In addition, you need to think about service dependencies. For example, what should you do when you have another service dependent on this specific data format? Your service needs to be backward compatible and still provide the same interface to external services relying on it.

When you have production clients, possibly one of the most critical questions you need to ask yourself before you modify one line of code is how do you ensure that zero downtime will be maintained?

Some of the things you’d need to plan for to avoid any downtime is around testing and verification. How do you even test your data transformation script? What are some good practices for running a reliable dry run of a data transformation on production data?

There are so many things to consider before transforming data.

Now think that this is usually, for the most part, done manually. What an error-prone, tedious process! It looks like we need a fine-grained process that will prevent mistakes and help us to manage all of these steps.

To avoid this, we understood we’d need to define a process that would help us tackle the challenges above.

The Rewrite Process

Figure 1: Rewrite Process Flow Chart

First, we started by adjusting the backend code to write the new data format to the database while still keeping the old format, by first writing the FullName, FirstName and LastName to provide us some reassurance of backward compatibility. This would enable us to have the ability to revert to the previous format if something goes terribly wrong.

async function createUser(item) {
   // FullName = 'Guy Br'
   // 'Guy Br'.split(' ') === ['Guy', 'Br']
   // Just for the example assume that the FullName has one space between first and last name
   const [FirstName, LastName] = item.FullName.split(' ');
   const newItemFormat = { ...item, FirstName, LastName };
   return dynamodbClient.put({
       TableName: 'Users',
       Item: newItemFormat,
   }).promise();
};

Link to GitHub

Next, we wrote a data transformation script that scans the old records and appends the FirstName and LastName attributes to each of them, see the example below:

async function appendFirstAndLastNameTransformation() {
  let lastEvalKey;
  let scannedAllItems = false;

  while (!scannedAllItems) {
    const { Items, LastEvaluatedKey } = await dynamodbClient.scan({ TableName: 'Users' }).promise();
    lastEvalKey = LastEvaluatedKey;

    const updatedItems = Items.map((item) => {
      const [FirstName, LastName] = splitFullNameIntoFirstAndLast(item.FullName);
      const newItemFormat = { ...item, FirstName, LastName };
      return newItemFormat;
    });

    await Promise.all(updatedItems.map(async (item) => {
      return dynamodbClient.put({
        TableName: 'Users',
        Item: item,
      }).promise();
    }));

    scannedAllItems = !lastEvalKey;
  };
}

Link to GitHub

After writing the actual script (which is the easy part), we now needed to verify that it actually does what it’s supposed to. To do so, the next step was to run this script on a test environment and make sure it works as expected. Only after the scripts usability is confirmed, it could be run on the application environments.

The last phase is the cleanup, this includes taking the plunge and ultimately deleting the FullName column entirely from our database attributes. This is done in order to purge the old data format which is not used anymore, and reduce clutter and any future misuse of the data format.

async function cleanup() {
  let lastEvalKey;
  let scannedAllItems = false;

  while (!scannedAllItems) {
    const { Items, LastEvaluatedKey } = await dynamodbClient.scan({ TableName: 'Users' }).promise();
    lastEvalKey = LastEvaluatedKey;

    const updatedItems = Items.map((item) => {
      delete item.FullName;
      return item;
    });

    await Promise.all(updatedItems.map(async (item) => {
      return dynamodbClient.put({
        TableName: 'Users',
        Item: item,
      }).promise();
    }));

    scannedAllItems = !lastEvalKey;
  };
 };

Link to GitHub

Lets quickly recap what we have done in the process:

Adjusted the backend code to write in the new data format
Created a data transformation script that updates each record
Validated that script against a testing environment
Ran the script on the application environments
Cleaned up the old data

This well-defined process helped us to build much-needed safety and guardrails into our data transformation process. As we mentioned before, with this process we were able to avoid downtime by keeping the old format of the records until we don’t need them anymore. This provided us with a good basis and framework for more complex data transformations.

Transforming Existing Global Secondary Index (GSI) using an External Resource

Now that we have a process––let’s be honest, real-world data transformations are hardly so simple. Let’s assume, a more likely scenario, that the data is actually ingested from an external resource, such as the GitHub API, and that our more advanced data transformation scenario actually requires us to ingest data from multiple sources.

Let’s take a look at the example below for how this could work.

In the following table, the GSI partition key is by GithubUserId.

For the sake of this data transformation example, we want to add a “GithubUsername” column to our existing table.

Before

`Id`	`GithubUserId`
`123`	`7654321`

After

`Id`	`GithubUserId`	`GithubUsername`
`123`	`7654321`	`Guy7B`

This data transformation looks seemingly as straightforward as the example with the full name, but there is a little twist.

How can we get the Github username if we don’t have this information? We have to use an external resource, in this case, it's the Github API.

GitHub has a simple API for extracting this data (you can read the documentation here). We will pass the GithubUserId and get information about the user which contains the Username field that we want.

https://api.github.com/user/:id

The naive flow is similar to the full name example above:

Adjust our code to write in the new data format.
Assume that we have the Github username when creating a user.
Scan the user records (get `GithubUsername` by `GithubUserId` for each record using Github API), and update the record.
Run that script on the testing environment
Run it on the application environments

However, in contrast to our previous flow, there is an issue with this naive flow. The flow above is not safe enough. What happens if you have issues while running the data transformation when calling the external resource? Perhaps the external resource will crash / be blocked by your IP or is simply unavailable for any other reason? In this case, you might end up with production errors or a partial transformation, or other issues with your production data.

What can we do on our end to make this process safer?

While you can always resume the script if an error occurs or try to handle errors in the script itself, however, it is important to have the ability to perform a dry run with the prepared data from the external resource before running the script on production. A good way to provide greater safety measures is by preparing the data in advance.

Below is the design of the safer flow:

Adjust our code to write in the new data format (create a user with GithubUsername field)
Create the preparation data for the transformation

Only after we do this, we scan the user records, get GithubUsername for each of them using Github API, append it to a JSON Object `{ [GithubUserId]: GithubUsername }` and then write that JSON to a file.

This is what such a flow would look like:

async function prepareGithubUsernamesData() {
  let lastEvalKey;
  let scannedAllItems = false;

  while (!scannedAllItems) {
    const { Items, LastEvaluatedKey } = await dynamodbClient.scan({ TableName: 'Users' }).promise();
    lastEvalKey = LastEvaluatedKey;

    const currentIdNameMappings = await Promise.all(Items.map(async (item) => {
      const githubUserId = item.GithubUserId;
      const response = await fetch(`https://api.github.com/user/${githubUserId}`, { method: 'GET' });
      const githubUserResponseBody = await response.json();
      const GithubUsername = githubUserResponseBody.login;

      return { [item.GithubUserId]: GithubUsername };
    }));

    currentIdNameMappings.forEach((mapping) => {
      // append the current mapping to the preparationData object
      preparationData = { ...preparationData, ...mapping };
    });

    scannedAllItems = !lastEvalKey;
  };

  await fs.writeFile('preparation-data.json', JSON.stringify(preparationData));
};

Link to GitHub

Next we scan the user records (get GithubUsername by GithubUserId for each record using Preparation Data), and move ahead to updating the record.

async function appendGithubUsername() {
  let lastEvalKey;
  let scannedAllItems = false;

  while (!scannedAllItems) {
    const { Items, LastEvaluatedKey } = await dynamodbClient.scan({ TableName: 'Users' }).promise();
    lastEvalKey = LastEvaluatedKey;

    const updatedItems = Items.map((item) => {
      const GithubUsername = preparationData[item.GithubUserId];
      const updatedItem = currentGithubLoginItem ? { ...item, GithubUsername } : item;
      return updatedItem;
    });

    await Promise.all(updatedItems.map(async (item) => {
      return dynamodbClient.put({
        TableName: 'Users',
        Item: item,
      }).promise();
    }));

    scannedAllItems = !lastEvalKey;
  };
};

Link to GitHub

And finally, like the previous process, we wrap up by running the script on the testing environment, and then the application environments.

Dynamo Data Transform

Once we built a robust process that we could trust for data transformation, we understood that to do away with human toil and ultimately error, the best bet would be to automate it.

We realized that even if this works for us today at our smaller scale, manual processes will not grow with us. This isn’t a practical long-term solution and would eventually break as our organization scales. That is why we decided to build a tool that would help us automate and simplify this process so that data transformation would no longer be a scary and painful process in the growth and evolution of our product.

Applying automation with open source tooling

Every data transformation is just a piece of code that helps us to perform a specific change in our database, but these scripts, eventually, must be found in your codebase.

This enables us to do a few important operations:

Track the changes in the database and know the history at every moment. Which helps to investigate bugs and issues.
No need to reinvent the wheel - reusing existing data transformation scripts already written your organization streamlines processes.

By enabling automation for data transformation processes, you essentially make it possible for every developer to be a data transformer. While you likely should not give production access to every developer in your organization, applying changes is the last mile. When only a handful of people have access to production, this leaves them with validating the scripts and running them on production, and not having to do all of the heavy lifting of writing the scripts too. We understand it consumes more time than needed for those operations and it is not safe.

When the scripts in your codebase and their execution are automated via CI/CD pipelines

other developers can review them, and basically, anyone can perform data transformations on all environments, alleviating bottlenecks.

Now that we understand the importance of having the scripts managed in our codebase, we want to create the best experience for every data-transforming developer.

Making every developer a data transformer

Every developer prefers to focus on their business logic - with very few context disruptions and changes. This tool can assist in keeping them focused on their business logic, and not have to start from scratch every time they need to perform data transformations to support their current tasks.

For example - dynamo-data-transform provides the benefits of:

Export utility functions that are useful for most of the data transformations
Managing the versioning of the data transformation scripts
Supporting dry runs to easily test the data transformation scripts
Rollback in the event the transformation goes wrong - it’s not possible to easily revert to the previous state
Usage via CLI––for dev friendliness and to remain within developer workflows. You can run the scripts with simple commands like `dynamodt up`, `dynamodt down` for rollback, `dynamodt history` to show which commands were executed.

Dynamo Data Transform:

Quick Installation for serverless:
The package can be used as a standalone npm package see here.

To get started with DynamoDT, first run:

npm install dynamo-data-transform --save-dev

To install the package through NPM (you can also install it via…)

Next, add the tool to your serverless.yml Run:

npx sls plugin install -n dynamo-data-transform

You also have the option of adding it manually to your serverless.yml:

plugins:

- dynamo-data-transform

You can also run the command:

sls dynamodt --help

To see all of the capabilities that DynamoDT supports.

Let’s get started with running an example with DynamoDT. We’ll start by selecting an example from the code samples in the repo, for the sake of this example, we’re going to use the example `v3_insert_users.js`, however, you are welcome to test it out using the examples you’ll find here.

We’ll initialize the data transformation folder with the relevant tables by running the command:

npx sls dynamodt init --stage local

For serverless (it generates the folders using the resources section in the serverless.yml):

resources:
 Resources:
   UsersExampleTable:
     Type: AWS::DynamoDB::Table
     Properties:
       TableName: UsersExample

The section above should be in serverless.yml

The data-transformations folder generated with a template script that can be found here.

We will start by replacing the code in the template file v1_script-name.js with:

const { utils } = require('dynamo-data-transform');

const TABLE_NAME = 'UsersExample';

/**
 * The tool supply following parameters:
 * @param {DynamoDBDocumentClient} ddb - dynamo db document client https://docs.aws.amazon.com/AWSJavaScriptSDK/v3/latest/clients/client-dynamodb
 * @param {boolean} isDryRun - true if this is a dry run
 */
const transformUp = async ({ ddb, isDryRun }) => {
  const addFirstAndLastName = (item) => {
    // Just for the example:
    // Assume the FullName has one space between first and last name
    const [firstName, ...lastName] = item.name.split(' ');
    return {
      ...item,
      firstName,
      lastName: lastName.join(' '),
    };
  };
  return utils.transformItems(ddb, TABLE_NAME, addFirstAndLastName, isDryRun);
};

module.exports = {
  transformUp,
  transformationNumber: 1,
};

Link to GitHub

For most of the regular data transformations, you can use the util functions from the dynamo-data-transform package. This means you don’t need to manage the versions of the data transformation scripts, the package will do this work for you. Once you’ve customized the data you’ll want to transform, you can test the script using the dry run option by running:

npx sls dynamodt up --stage local --dry

The dry run option prints the records in your console so you can immediately see the results of the script, and ensure there is no data breakage or any other issues.

Once you’re happy with the test results, you can remove the --dry flag and run it again, this time it will run the script on your production data, so make sure to validate the results and outcome.

Once you have created your data transformation files, the next logical thing you’d likely want to do is add this to your CI/CD. To do so add the command to your workflow/ci file for production environments.

The command will run immediately after the `sls deploy` command, which is useful for serverless applications.

Finally, all of this is saved, as noted above so if you want to see the history of the data transformations, you can run:

`npx sls dynamodt history --table UserExample --stage local`

The tool also provides an interactive CLI for those who prefer to do it this way.

And all of the commands above are supported via CLI as well.

With Dynamo Data Transform, you get the added benefits of being able to version and order your data transformation operations and manage them in a single place. You also have the history of your data transformation operations if you would like to roll back an operation. And last but not least, you can reuse and review your previous data transformations.

We have open-sourced the Dynamo Data Transform tool that we built for internal use to perform data transformations on DynamoDB and serverless-based environments and manage these formerly manual processes in a safe way.

The tool can be used as a Serverless Plugin and as a standalone NPM package.

NPM
GitHub

Feel free to provide feedback and contribute to the project if you find it useful.

Figure 2: Data Transformation Flow Chart

About the Author

Guy Braunstain

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

DynamoDB Data Transformation Safety: from Manual Toil to Automated and Open Source

InfoQ Article Contest

Key Takeaways

Related Sponsored Content

Challenges with Data Transformation

User Table Data Transform Example

The Rewrite Process

Transforming Existing Global Secondary Index (GSI) using an External Resource

Dynamo Data Transform

Applying automation with open source tooling

Making every developer a data transformer

Dynamo Data Transform:

About the Author

Guy Braunstain

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter