InfoQ Homepage Presentations Building Tech at Presidential Scale

Building Tech at Presidential Scale

View Presentation

Speed:

Download

29:52

Summary

Dan Woods discusses the unique challenges of building and running tech for a presidential cycle, highlighting the digital duct tape that held the pieces together and the data flowing.

Bio

Dan Woods was CTO for Biden for President during the 2020 election. Previously he worked building tech for Hillary Clinton's 2016 presidential campaign and before that he was a senior software engineer at Netflix. Dan is an O'Reilly author, an avid public speaker, and technologist. He's currently a Distinguished Engineer at Target where he focused on enterprise-wide technology initiatives.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.

Transcript

Woods: I'm Dan Woods. I'll be covering some of the more elaborate architectures built by the Biden for President 2020 campaign tech team. Overall, this is going to be a fairly technical conversation with deep dives into specific tools we built to solve a variety of problems.

Background

I was the primary CTO for Biden for President 2020, which means I built and led the technology organization within the campaign, and was overall responsible for the technology operations as a whole. I had a brilliant team along with me that built the best tech stack in presidential politics. We covered everything from software engineering, to IT operations, to cybersecurity, and all pieces in between. I'm currently a Distinguished Engineer at Target, which is also what I did before the campaign. I focused on infrastructure and operations with an emphasis on building high scale, reliable distributed systems. Prior to joining Target, I was a member of the team that worked on the Hillary for America 2016 technology platform by way of The Groundwork, which was a campaign adjacent technology company. Before my foray into presidential politics, I worked as a senior software engineer at Netflix, where I worked in the operations engineering organization, and helped create Netflix's open source continuous delivery platform, Spinnaker. I'm also a member of the Ratpack web frameworks open source team, though haven't been as involved lately as I've been a little bit busy. I wrote a book a couple years ago on Ratpack, and it's published under O'Reilly. Be sure to go and check that out.

On Campaign Structure

Before we dive right into the technology, I want to spend a couple of minutes to discuss the mechanics of the organization of a campaign. Hopefully, this will show the intricacies of the moving parts of the campaign and help to underscore some of the specific avenues we had to follow when considering tech in this environment. I've laid out the different departments within the campaign to give you a sense of the varying degrees of responsibility across the organization. There were a number of teams across the campaign, each with their own specific concentration on a day-to-day basis. Although campaign tech gets a lot of attention in the press, it is suffice to say that tech is not the most important thing in politics. Obviously, reaching voters, driving the candidate's message, and connecting as many people as possible to the democratic process is the North Star for everyone on the team.

The Role of Tech on a Campaign

What does tech do on a campaign? Quite literally, whatever needs to be done. Nearly every vertical of the campaign needs tech in one form or another. In practical terms, my team was responsible for building and managing our cloud footprint, all IT operations, vendor onboarding, and so forth. Our approach to building technology during the presidential cycle is simply to build the glue that ties vendors and systems together. This is not the right environment to delve into full-blown product development. As you'll soon see, the needs of the campaign very much drive toward a similar outcome.

When a vendor or open source tool didn't exist, or quite frankly, when we didn't have the money to buy something, we would build tools and solutions. Overall, a huge portion of the work that needs to be done by a campaign tech team is simply getting the data from point A to point B. Strictly speaking, this involves creating a lot of S3 buckets. In addition to all that, we ended up developing quite a wide range of technology over the year and a half of the campaign cycle. We built dozens of websites, some big and some small. We built a mobile app for our field organizing efforts. We built Chrome extensions to help simplify the workloads for dozens of campaign team members. Likewise, though unglamorous to build, Word and Excel macros were developed to drive better efficiency. On a presidential campaign, time is everything, and anything that we could do to simplify a process to save a minute here or there, it was well worth the investment. Much of what we've done distills down to automating the tasks needed to reduce the load on the team any way that we could. It's desirable at times to try to distill campaign tech down to one specific focus, data, or digital, or IT, or cybersecurity. In reality, the technology of the campaign is all of those things, and more, and a critical component of everything the campaign does.

What We Did

What did we do? To spell it out, plainly, we were a small and scrappy team of highly motivated technologists. Whatever the ask, we'd take it on and do our best where we could. Over the course of the campaign, we built and delivered over 100 production services. We built more than 50 Lambda functions that delivered a variety of functionality. We had more than 10,000 deployments in the short time we were in operation, all with zero downtime and a focus on stability and reliability. We built a best-in-class relational organizing mobile app for the primary cycle, the Team Joe App, which gave the campaign the leverage to connect thousands of eager voters and volunteers directly with the people they knew.

When we won the primary, we wanted a new, fresh branding on the website in time for the general election. We dug in and designed and delivered a brand new web experience for joebiden.com. In addition to all that, we implemented a bespoke research platform with robust automation that built on cloud based machine learning that ended up saving us tens of thousands of hours of real-world human time, and gave us incredible depth of insight. We made a commitment early on as a campaign, that we would hold ourselves to a higher standard, and make sure that we were not accepting donations from organizations that harm our planet, or from individuals who may have ulterior motives. To hold us true to that promise, we built an automation framework that vetted our donors in near real-time for FEC compliance and campaign commitments.

A huge aspect of every campaign is reaching directly to voters to let them know when an event is coming to their area, or when it's time to get out to vote. For that, we built an SMS outreach platform that facilitated massive nationwide scale, and saved us millions of dollars in operational expense along the way. Beyond that, we operationalized IT for HQ, states, and ultimately facilitated a rapid transition to being a fully remote organization at the onset of quarantine. Through all of this, we made sure we had a world class cyber security posture. Indeed, it was a core focus for everything we did. We built a number of services on top of robust machine learning infrastructure, and did we do a lot of that? Ultimately, though, no one job on the campaign is single responsibility, which means everyone needs to dig in and help where help is needed. Whether that's calling voters and asking them to go and vote, sending text messages, or collecting signatures to get on the ballot. We did it all.

Infrastructure and Platform

A little bit more under the tech. We relied on a fully cloud based infrastructure for everything we did. It was paramount that we didn't spend precious hours reinventing any wheels for the core foundations of tech. For the main website, joebiden.com, we made use of Pantheon for hosting. For non-website workloads, APIs, and services, we built entirely on top of AWS. In addition to that, we had a small Kubernetes deployment within AWS that helped us deliver faster and simple workloads, mostly for cronjobs. We still needed to deliver with massive scale and velocity overall. For a small team, it was critical that our build and deploy pipeline was reproducible. For continuous integration, we used Travis CI. For continuous delivery, we used Spinnaker.

After services were deployed, and once they were up and running in the cloud, there's a core set of capabilities they all need, like being able to find other services and securely accessing config and secrets. For those, we made use of HashiCorp's Consul and Vault. This helped us build fully immutable and differentiated environments between Dev and prod, with very few handcrafted servers along the way, not zero, but very few. A huge part of the technology footprint was dedicated to the work being done by the analytics team. To ensure they had best-in-class access to tools and services, the analytics data was built on top of AWS Redshift. This afforded a highly scalable environment with granular controls over resource utilization.

For all of the services that we built and deployed, we utilized Postgres via RDS as the back-end datastore. From an operational perspective, we desired a centralized view into the logging activity for every application, so that we could quickly troubleshoot and diagnose, and not spend unnecessary time on MTTR when there was a problem. For that, we deployed an ELK Stack and backed it by AWS Elasticsearch. The logs are very important to applications. Metrics were the primary insight into the operational state of our services, and critical when integrating with our on-call rotation. For service level metrics, we deployed Influx and Grafana, and wired it into PagerDuty to make sure we were never experiencing an unknown outage. Many of the automation workloads and tasks don't fit the typical deployment model. For those, we'd favor Lambda where possible. Also, for any workload that needed to integrate with the rest of the AWS ecosystem, and fan-out jobs based on the presence of data. All in all, we ended up creating a truly polyglot environment. We had services and automation built in a wide variety of languages and frameworks, and we were able to do it with unparalleled resiliency and velocity.

We covered a lot of ground during the campaign cycle, and it would take a lifetime to dive deep on everything we did. I'm going to pick out a few of the more interesting architectures we ended up putting together on the software engineering side of the tech team. By no means should that take away from the impressive array of work that was done on the IT and cybersecurity side of things. We'll dive deep on the architectures. Just know that for each one of these, there are at least a dozen more conversations, covering the full breadth of what we accomplished across the entire tech team.

Donor Vetting

Let's get started by diving into the donor vetting pipeline. We made a commitment early on as a campaign that we were not going to accept donations from certain organizations and individuals. The only way really to do this at any measure of scale, is to have a group of people comb through the donations periodically, usually once a quarter, and flag people who might fit a filtering criteria. This process is difficult, time consuming, and is prone to error. When done by hand, this usually involves setting a threshold for the amount of money an individual has given, and research whether or not they fit into a category we flagged as problematic. Instead, what we were able to do was build a high scale automation process that would correlate details about a donation against a known set of criteria we wanted to filter on. For example, we committed to not accepting contributions from lobbyists and executives in the gas and oil industry. Every day, a process would kick off that would go to ngpvan, which is a source of truth for all contributions, export the contribution data to CSV, and dump that CSV into S3. The act of the CSV file appearing in S3 would trigger an SNS notification, which in turn would activate a sequence of Lambda functions. These Lambda functions would split the file into smaller chunks, re-export them to S3, and would kick off the donor vetting process. We could see a massive scale in Lambda while this workflow was executing, up to 1000 concurrent executions of the donor vetting code. It would only take a matter of minutes to go through the comprehensive list of donors and check them against lobbyists' registries, foreign agent registries, and block lists for gas and oil executives.

Once the process completed, the flagged entries were collated into a single CSV file, which was then re-exported to S3 and made available for download. We then used SES to send an email to Danielle, the developer who built the system indicating that the flags were ready for validation. After validation, the results were forwarded to the campaign compliance team to review and to take the appropriate action, whether that be refunding a contribution or investigating it further. To say this process saved a material amount of time for the campaign would be a dramatic understatement. The donor vetting pipeline quickly became a core fixture in the technology platform and an important part of the campaign operations.

Tattletale

Let's talk about cybersecurity. Tattletale was one of the most important pieces of technology we built on the campaign, especially very early on when we were small and very scrappy. Essentially, because we had so many vendors and so many cloud services and not nearly enough time or people to constantly go and check the security posture on each of them, we needed a way that we could easily develop rules against critical user facing systems to ensure we were always following cybersecurity best practices. Tattletale was developed as a framework for doing exactly that. We developed a set of tasks that would leverage vendor system APIs to check to ensure things like two factor authentication were turned on, or to notify us if a user account was active on the system but that user hadn't logged in in a while. Dormant accounts present a major security risk, so we wanted to be sure everything was configured toward least privilege spec.

Furthermore, the rules within Tattletale could check that load balancers weren't inadvertently exposed to the internet, or that IAM permissions were not scoped too widely, and so forth. At the end of the audit ruleset, if Tattletale recorded a violation, it would drop a notification into a Slack channel for someone to go and investigate further. It also had the capability to email a user of a violation so they could take their own corrective action. If the violation breached a certain threshold, a metric would be recorded in Grafana that would trigger a PagerDuty escalation policy, notifying the on-call tech person immediately. Tattletale became our cybersecurity eyes when we didn't have the time or resources to look for ourselves. It also made sure we were standardizing on a common set of cybersecurity practices, and holding ourselves to the highest possible standard.

Conductor and Turbotots

I want to take a moment and remind everybody that the hardest problem in computer science is naming things. To be frank, we didn't spend a lot of time getting creative on names. Joe Biden is known well for his affinity toward Amtrak trains, and aviator sunglasses, but we exhausted all those metaphors pretty quickly. That leads me to a service that we built early on called CouchPotato, which was one of the tools where we heavily leveraged machine learning. Once we got the potato theme down, that was the end of it. We ended up mostly calling everything some derivative of potato. Conductor affectionately named for the trains theme, and Turbotots, a very fast cooked potato, represented a very important part of the platform we built. When we reached a certain point of complexity and span of internal tools, we knew we would need to better organize the API footprint and consolidate the many UIs we had built to manage those systems. We also needed to standardize on a unified security model, so we wouldn't have bespoke authentication and authorization all over the place. Conductor and Turbotots were born. Enter our internal tools UI and platform API.

This is a dramatically simplified diagram of a very complex architecture. The broad strokes are good enough to give you an idea of what Conductor and Turbotots did and represented. Conductor became the single pane of glass for all internal tools we were developing for use by people on the campaign. In other words, this was the one place anybody on the campaign needed to go to access services we'd made. Conductor was a React web app that was deployed via S3 and distributed with CloudFront. Turbotots was a unified API that Conductor talked to and provided a common authentication and authorization model for everything we did. We built the AuthN and AuthZ portions of Turbotots, on top of AWS Cognito, which saved loads of work and provided a very simple SSO through G Suite, which we were able to configure for only our internal domains. Authentication was achieved by parsing JWT tokens through the workflow. For managing this on the front-end, we were able to make use of the React bindings for AWS Amplify, which integrated seamlessly into the app.

To the right is where things get a little bit trickier to understand, but I'll do my best to simplify. The forward facing API was an API Gateway deployment with a full proxy resource. API Gateway has a very easy to use integration with Cognito, which gave us full API security at request time. The Cognito authorizer that integrates with API Gateway also performs JWT validation prior to a request making it through the proxy. We could rely on that system to know that a request was fully validated before it made it to the back-end. When developing an API Gateway proxy resource, you can connect code running in your VPC, by way of a VPC link network load balancer. This is all very complicated, but as I understand it, this effectively creates an ENI that spans the internal zone in AWS between API Gateway and your private VPC. The NLB, in turn is attached to an autoscaling group that has a set of NGINX instances that served as our unified API. This is the actual portion that we called Turbotots, and is, essentially, a reverse proxy to all of our internal services that were running inside the VPC. Following this model, we never needed to expose any of our VPC resources to the public internet. We were able to rely on the security fixtures that were baked into AWS, and that made us all feel a lot more comfortable overall.

Once the requests have made it through to Turbotots, within Turbotots, a lightweight Lua script extracted the user portion of the JWT token and passed that data downstream to services as part of a modified request payload. Once the request reached the destination service, it could then introspect the user payload, and determine whether or not the user was validated for the request. Essentially, users could be added to authorization groups within Cognito, which would then give them different levels of access in the downstream services. The principle of least privilege still applies here, and there are no default permissions. At the end of the day, Conductor and Turbotots represent a single user interface for the breadth of internal tools that was fully secured with seamless SSO to the user's campaign G Suite account. We use this exact same architecture in the next section to open the door on exposing a limited subset of the API to non-internal users as well.

Pencil and Turbotots

Allow me to introduce you to Pencil, the campaign peer-to-peer SMS platform. Pencil grew to be a bit of a beast of an architecture from a very simple subset of early on requirements. It's suffice to say that Pencil ended up being an area where we dedicated a significant amount of time. Originally, the idea with building our own P2P SMS platform was a cost savings play. I knew that obviously, we could send text messages through Twilio for cheaper than any vendor would be able to resell us the same capabilities. Since, originally, we didn't need the broad set of features that vendors were offering, it was fairly easy to put together a simple texting system that met the needs of the moment.

Eventually, the scope of the project grew dramatically as popularity increased and the value was seen at scale. In the end, Pencil was used by thousands of volunteers to reach millions of voters and became a critical component in our voter outreach workflow. If you received a text message from a volunteer on the campaign, it was likely sent through Pencil. The external architecture for how we pulled this off is going to look very familiar because we were able to reuse everything that we did with Conductor and just change a few settings without having to change any code.

The user facing component to Pencil is a React web app deployed to S3 and distributed via CloudFront. The React app in turn talks to an API Gateway resource which is wired to a Cognito authorizer. Pencil users got into the platform, they were invited via email, which was a separate process that registered their accounts in Cognito. At first login, Pencil would automatically add them to the appropriate user group in Cognito, which would give them access to the user portion of the Pencil API. That's admittedly a little confusing to reason about. I think it's enough to say that from a user perspective, they clicked a link to sign up to send text messages. After a short onboarding process with the campaign digital field organizing team, they were all set up and able to get to work without any additional steps. It was very easy to do and worked great because it was built on top of the same Turbotots infrastructure we'd already developed.

Tots Platform

Tots represents a very different slice of the architecture than Turbotots, but with something of a proceeding reference implementation for what Turbotots eventually became. Tots existed first, and therefore was able to keep its name. Tots is actually a direct downstream path from Turbotots. On those prior slides where I'd listed the services beneath Turbotots, a number of them were parsed by way of tots and are a part of this architecture. Tots is a very important platform, because as we start to build more tools that leverage machine learning, we quickly realized that we needed to consolidate logic and dry up the architecture. In effect, our biggest use case for machine learning in various projects was extracting entity embeddings from blocks of text. In turn, we were able to organize that data in an index in Elastic and make entire subjects of data available for rapid retrieval later on. We use AWS Comprehend to extract entity embeddings from text. It's a great service. Just give it a block of text, and it'll tell you about people, places, subjects, and so forth, about what's being discussed in that text.

The blocks of text came into the platform in a variety of forms, be it multimedia, news articles, or document format. Figuring out what to do before something went off to Comprehend is much of the mechanics in the Tots platform that you're looking at. The process codified in this platform saved the campaign, quite literally, tens of thousands of hours of equivalent human time on organizing and understanding these various documents. This includes time spent transcribing live events like debates, and getting them into a text format that could be read later on by various people on the campaign.

CouchPotato

Last we'll cover today but certainly not least, is CouchPotato. CouchPotato became very important to us because of the sheer amount of time savings it gave us on audio, video material. It's also one of the most technically robust platforms, and required us to solve the actual hardest problem in computer science, making audio work on Linux. In spite of the lines connecting independent services here, CouchPotato is the main actor in this architecture, but it needed a lot of supporting characters along the way. From the bottom up, CouchPotato's main operating function was to take a URL or media file as input, open that media in an isolated X11 Virtual Frame Buffer, and listen to the playback on a pulseaudio loopback device, FFmpeg, then recorded the contents from the X11 session. Produced an mp3 file, which then was later able to send over to Transcribe to perform an automated speech recognition.

Once the ASR process was completed, it would correct the transcript for some common mistakes. For example, it could rarely get Mayor Pete's name correct. We had a RegEx in place for some common replacements. Then once all that was done, it would ship it over to Comprehend to extract the entity embeddings. Finally, the text would be indexed in Elasticsearch. That was that. The real key to CouchPotato is that it was able to make use of FFmpeg's segmentation feature to produce smaller chunks of audio that could be reasoned about. Then it would run the whole process for each segment, and make sure they all remained in uniform alignment as they were being indexed in Elasticsearch. The segmentation was an initial feature in CouchPotato, because we used it to transcribe debates in real time.

Normally, you'd have an army of interns sitting there watching the debate and typing out what was being said throughout. We didn't have an army of interns to do that, and so CouchPotato was born. However, the segmentation also created a real-world serialization problem. Sometimes Transcribe would finish subsequent segments, ASR, before the prior segment had completed, meaning all the work needed to be done asynchronously, and the order recompiled after the fact. It sounds like a reactive programming problem to me. We invested a lot of time in solving the problem of serializing asynchronous events across stateless workers. We made it work. That's also where docsweb came in handy for the debate. Docsweb connected the transcription output from the segments produced by CouchPotato to a Google Doc, which allowed us to share the transcribed text in near real-time, except for the lag from Transcribe to Elastic, which wasn't too bad. We were able to share the output of that with the rest of the campaign. We transcribed every single debate and loads of other media that would have taken a fleet of interns a literal human lifetime to finish.

Part of solving the segmentation serialization problem is figuring out a way we could swap a set of active CouchPotato instance for a new one, say, during a deploy or when a live Transcribe event was already running. There's a lot more we can go into on that subject itself even alone, like how we made audio frame stitching work, so we could phase out an old instance and phase in a new one. It's too much to go into for just this presentation. Just know, there was a lot of work in HotPotato to make it so we could do hot deployments of CouchPotato with zero deployments and zero state loss.

KatoPotato is the entry point into this whole thing. It's an orchestration engine that coordinated the movement of data between CouchPotato, docsweb, and Elastic, as well as acting as the main API for kicking off CouchPotato workloads. Furthermore, it was responsible for monitoring the state of CouchPotato, and deciding if HotPotato needed to step in with a new instance. The best thing I can say about that is sometimes audio on Linux can be finicky, and CouchPotato was able to report back to KatoPotato, whether it was actually capturing audio or not. If it was not capturing audio, KatoPotato could scale up a new working instance of CouchPotato and swap it out via HotPotato. At the end of the day, the marketing line for CouchPotato is it's a platform that built on Cloud Machine Learning that gave us rapid insight into the content of multimedia without having to spend valuable human time on such a task. It's a great piece of engineering, and I'm glad it was able to deliver the value it did.

Conclusion

We covered a lot of ground in this presentation. There's so much more we can talk about like the unmatched relational organizing app we built, or the data pipelines we created to connect S3 and RDS to Redshift, or the dozens of microsites we built along the way, or even simpler things like how we manage work collaboratively in an extremely fast-paced and dynamic environment.

See more presentations with transcripts

Recorded at:

Jan 31, 2021

Dan Woods

InfoQ Software Architects' Newsletter