BT

InfoQ Homepage News Plaid.com Cuts Their Deployment Times on Amazon ECS with Custom Process Relaunching

Plaid.com Cuts Their Deployment Times on Amazon ECS with Custom Process Relaunching

Bookmarks

Plaid's engineering team cut their deployment times on AWS ECS by 95% with a custom wrapper to relaunch their node.js processes without recreating the containers.

Plaid.com - a financial technology company that enables applications to connect with users' bank accounts - has integrations with over 9600 different financial institutions, from which it pulls and processes data that can be analyzed later. Plaid runs over 20 internal services with 50+ code commits per day for their core services. The bank integration service, which runs as node.js processes in containers running on ECS, faced slow deployment startup times which in turn affected overall code ship time. Multiple environments in the pipeline added to the slowdown. The long term plan was to move to Kubernetes. A short term solution was found by writing a custom process wrapper to relaunch the application in the same container, thus avoiding container recreations.

Plaid runs 4000 node.js processes in containers. A profiling exercise done by the team exposed some possible areas for optimization in the deployment process, during application startup. ECS health checks - similar to Kubernetes liveness checks - were tweaked, but to not much avail. Reducing the number of containers was another option, but it needed a re-architecting of the service. Spinning up more instances was not a cost-effective approach. They managed to shave off a few minutes with these approaches. InfoQ got in touch with Evan Limanto, engineer at Plaid, to learn more about the internals.

The team came up with a hot reloading technique by writing a process wrapper. Internally called Bootloader, the wrapper runs in the containers and launches the actual application as a sub-process. Bootloader also traps and forwards signals, and handles logging output. The application was modified to listen on a gRPC endpoint for a message sent from the Jenkins deployment pipeline. Limanto says that "each container advertises its own address on a Redis set with an application level heartbeat." This Redis set is used to keep track of all healthy containers at any given time.

The gRPC message has the commit hash in its payload, so it's possible to perform a rollback by sending an older hash, explains Limanto. The message triggers a download of application code from AWS S3, and the app exits with a special status code. Bootloader traps this code and relaunches the app, thus loading the new code in memory. How does the reload happen across all containers? Limanto explains:

The reload happens in a phased manner according to a simple formula:

Reload the current container if the hash of its address is less than `min(TargetPercentage, MaxUnhealthyPercentage + % of containers on new commit)`. Some background job runs this reloading logic on an interval.

It is possible that a reload can be triggered for a process while it is processing requests. How is this handled? Opinions on this differ. While Plaid keeps track of requests being processed and exits only after they are all done, another view endorses writing the app so that it can recover from abrupt shutdowns.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Just curious, why bother?

    by Richard Clayton /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    From my experiences, upgrading ECS deployments are painfully slow and after years of support, Amazon doesn't seem to care. Why spend the effort of creating technology on top of ECS to speed up deployments and not use a different scheduler (like k8s or Nomad)?

  • Re: Just curious, why bother?

    by Hrishikesh Barua /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hi Richard,

    The long term goal for Plaid's engineering team is to move to Kubernetes - this was a short term solution. See the linked article - blog.plaid.com/how-we-reduced-deployment-times-...

    Regards
    Hrish

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.