BT

InfoQ Homepage Articles Rewriting an API Gateway Service from Clojure to Golang: AppsFlyer Experience Report

Rewriting an API Gateway Service from Clojure to Golang: AppsFlyer Experience Report

Leia em português

This item in japanese

Bookmarks

Key Takeaways

  • AppsFlyer processes nearly 70+ billion HTTP requests a day, and is built using a microservices architecture style. The entry point to the system that wraps all of the frontend services is a mission-critical (non-micro) service called the API Gateway.
  • The original API Gateway that was written in the AppsFlyer default language, Clojure, started accumulating technical debt. 
  • Golang was selected as the language to benchmark against Clojure for the proposal for a newly designed API Gateway service.
  • Benchmarking was conducted with NGINX (enhanced by Lua) as an option, alongside Golang and Clojure. Go delivered improved throughput versus Clojure, and this was selected as the language of choice for the implementation.
  • The fact that the API gateway is now built in a typed language provides the ability to plug in diverse functionalities and introduce new technologies much more easily with Golang’s library support and community. 
  • The newly deployed solution is capable of supporting exponentially more traffic than it does today - and with traffic and requests growing in scales of 10X this was important from a forward-thinking perspective.  
     

AppsFlyer a leading mobile attribution and marketing analytics platform, processes nearly 70+ billion HTTP requests a day (approximately 50 million requests a minute), and is built using a microservices architecture style. The entry point to the system that wraps all of the frontend services is a mission-critical (non-micro) service called the API Gateway. This essentially serves as a single point for routing traffic from customers to our backend services, simplifying authentication and authorization exponentially for our clients, but with the tradeoff of also potentially being a single point of failure.

This article explores why and how the engineering team migrated from a Clojure-based API gateway implementation to a Go-based implementation.

Accumulating Technical Debt within the API Gateway

We’ve talked previously about how technical debt originates, and many times it happens, just as it happened with our API Gateway service.  

Originally, AppsFlyer’s services were a Python monolith, which required a single solution for authentication and authorization as part of the monolith itself. As time went by, traffic and complexity grew and we migrated to a Microservice architecture. As such, we needed to create a unified API gateway solution that will serve as our authentication and authorization provider.

We started by just rolling up our sleeves and writing this in Clojure, skipping the design phases, and building the service largely in proof-of-concept mode. Our company is one of the largest Clojure shops in production in EMEA, and therefore Clojure is many times by default the language of choice without many more considerations of the specific project at hand. While this is good for velocity, and a “get stuff done” mindset, it’s less ideal for the long-term maintenance of a project.  We quickly realized as traffic grew - that the code for the newly rolled out API gateway was too complex, and needed constant refactoring to enable the throughput required.  

We eventually came to a crossroads where the service was too unstable, and we realized that we needed to rewrite the project completely - either in Clojure (but with a better design), or explore other language options as well. With this iteration, we decided not to embrace our cognitive biases and revert to our Clojure comfort zone, but instead do the proper design work required to build the service we need, and not just rework a service we already have. 

We eventually selected Golang as the language to benchmark against Clojure for this API Gateway service, which also brought with it the added benefits of language diversity and contributed to our mentality of code craftsmanship, by mastering additional syntaxes. 

We understood the flip side of adding another programming language to our stack. We are strong believers in CI/CD mentality, and introducing a new language, which is not JVM based (as opposed to Clojure) had its operational costs, but we were able to resolve that in short time.

There were also, of course, learning curves with mastering a new language, and the need to ensure that the code would be stellar and robust enough for the long-term, which is hard to know before actually writing your first project in a specific language and seeing how it performs in production.

I’ll provide a brief aside on why we selected Go for this specific service -- just for some context. Go has very strong support for building network services and specifically for proxy-like services with the built-in reverse-proxy. Its biggest advantage versus other solutions like the http-kit that we’ve used in Clojure, is the ability to stream the data through the proxy instead of storing it in-memory, and return it to the client only after the last byte was received from the server. This feature alongside the support for efficient I/O without the price of overly complicated asynchronous code that we would have to write in other platforms like the JVM, made the choice of Go very compelling.  An additional advantage that became apparent while we started to implement the service, was the fact that a statically typed language makes it a lot easier to refactor the code and reason about it, since the types are an excellent way to self-document your code.

Evaluating Our Options

We understood that to be able to properly evaluate the different languages suitability, we would need to examine a few aspects - performance as well as specific benefits of each language for the specific task at hand.  To measure performance, we understood we would need to properly benchmark Clojure vs. Go in as close of a production simulation as possible. 

To do so, we started by doing stress testing, with NGINX (enhanced by Lua) as an option, alongside Golang and Clojure. Go delivered improved throughput versus Clojure.

The basic statistics of the test:

  • We used WRK as our benchmarking tool
  • 3-minute bursts
  • 64 threads
  • 1000 connections pool
  • 2-minute request timeout
  • Each request returned a static file weighing 500kb
  • All traffic was fired from the same AZ to mitigate network noise using c4 xlarge instances

Proxy solution

Req/Sec

Trans/Sec

Total requests

Total transaction size

Bad Req

Avg. Latency

Direct

190

72 MB

34500

12.8 GB

~ 400 (drop:200)

4.41 Sec

NGINX

185

73 MB

33486

12.7 GB

~ 300 (drop:37)

7.95 Sec

Clojure (basic Http-Kit implementation)

190

72 MB

34412

12.8 GB

~ 100 (drop:600)

8.48 Sec

Golang (native reverse proxy & http layer)

185

73 MB

33443

12.7 GB

~ 200 (drop: 0)

5.42 Sec

We moved away from re-writing the service in Clojure not only because Go showed better performance but also because we wanted to challenge ourselves and be exposed to a different language and a different way of thinking.

The design phases started by outlining the functionality we required the service to have, and after having the basic concepts specified, we examined backward compatibility considerations and potential pitfalls with migrating our production user base to the new service.  Once we ensured that we had covered all of our bases we started to get to work by assigning an architect and developer to the project.

From Concept to Delivery

We were surprised by how quickly the coding part of the project was completed, with approximately only two months of work required.  Because this was the first time we introduced Go in-house, we were very careful with the coding part of the project.  We did two iterations on each function to ensure we were doing it right, and did manycode reviews.  This is because we knew that this code had to be crafted and clean, as it would serve as a source for other Go projects going forward.

Despite this being the first project introduced in Go, we had the opportunity to really get a good grasp of the language and it’s core functionality, as we had to compensate for libraries used in Clojure for communication with additional parts of the stack including Redis (persistent state of user login counters to prevent DDoS and bots) and Kafka (we manage a CQRS of domain events, one of which is successful or unsuccessful logins), which required creating similar libraries in Go.  

In order to match the ecosystem we have in Clojure, we needed to integrate a whole range of libraries like a metrics collection library, a logging library, a JWT library, among others, and we were very happy to find all of them at a maturity level which is a very strong indication of the level of adoption of the Go language by the community - which is an important consideration when making the decision to migrate to a new a language.  Its community sustainability and maturity play an important role in such a decision.

We were ready for the basic migration after approximately two months, having the basic functionality covered and tested.  We started migrating services iteratively within the parent group (our domain group) in a controlled way to the new API Gateway, which was basically a canary release.

We decided to do a controlled rollout with the first few services over the course of the first few weeks, so we could discover the bugs and flaws in production, and have the time to properly fix them before rolling out all of our services.  We wanted to learn from the mistake of moving too quickly with the original API solution, which eventually led to delivering low quality.  

Once we felt we were ready and fixed all the flaws, we started the migration plan for all of our services.  This included a migration guide PDF for each service including the exact steps needed to transfer over to the new service, and the benefits included in such a move, and the optimal way to perform the migration based on its specific stack and dependencies.

To roll out the new reverse proxy in a gradual manner, we used an application load balancer (ALB) to route the traffic based on a set of predefined URLs that indicate the services we want to be exposed via the new API gateway vs. the old one.

This enabled a very controlled approach to how to route traffic with minimal effort and risk.  We took our time, tested each migrated service and worked hand-in-hand with all the other teams that were responsible for their user-facing services. It took us six months, but we managed to migrate ~40 microservices to use the new API gateway with zero downtime.

Results

The end result enabled us to reduce 25 instances (c4 xlarge) running Clojure code - able to process 60 concurrent requests, to two instances (c3.2xlarge) running Go code able to support ~5000 concurrent requests a minute - a huge improvement. The new architecture design was also robust enough of a solution for our next phase growth by giving us both a powerful service that can withstand high scale and grow in business complexity easily due to its procedural approach, and also a new language to add to our toolbox when dealing with high scale. 

Let’s take for example our reverse proxy solution in Clojure and in Go.

Clojure:

;; Creating a connection manager


(let [cm (clj-http.conn-mgr/make-reusable-conn-manager {:timeout 1 :threads 20 :default-per-route 10})])


;; Creating a proxy server using cm (connection manager)
 (client/request {:method	:get
                           :url	(service/service-uri service-spec uri-match)
                           :headers	(dissoc (into {} (:headers req)) “content-length”)
                           :body	(when-let [len (get-in req [:headers “content-length”])]
                                                     (bs/to-byte-array (:body req)))
                           :follow-redirects   false
                           :throw-exceptions   false
                           :connection-manager cm
                           :as	:stream}))

And in Golang:

func NewProxy(spec *serviceSpec.ServiceSpec, director func(*http.Request), respDirector func(*http.Response) error, dialTimeout, dialKAlive, transTLSHTimeout, transRHTimeout time.Duration) *MultiReverseProxy {
	return &MultiReverseProxy{
		proxy: &httputil.ReverseProxy{
			Director:       director, //Request director function
			ModifyResponse: respDirector,
			Transport: &http.Transport{
				Dial: (&net.Dialer{
					Timeout:   dialTimeout, //limits the time spent establishing a TCP connection (if a new one is needed).
					KeepAlive: dialKAlive,  //limits idle keep a live connection.
				}).Dial,
				TLSHandshakeTimeout:   transTLSHTimeout, //limits the time spent performing the TLS handshake.
				ResponseHeaderTimeout: transRHTimeout,   //limits the time spent reading the headers of the response.
			},
		},

Notice how Golang has many features that are oriented towards better management of connection pools and reverse proxy capabilities baked into its core classes.

In Summary

Choosing to write the new version of the API Gateway in Go has proven to be a very good decision. The minimal learning curve of Go made it an excellent language to learn “on the fly” while working on a real production service. Its support for low-level networking constructs such as a reverse-proxy, and a general mindset towards performance, made the final result both a real measurable improvement, as well as more robust. All of the production issues that we had as a result of the previous code are now obsolete, it is much easier to add new features to the gateway and the increased traffic we can now support enables us all to sleep better at night. 

This article was updated 15 February 2019 to clarify several minor points raised in the comments discussion.

About the Author

Asaf Yonay is the R&D Group Manager at AppsFlyer, who is passionate about taking managerial and technical challenges and turning them into success stories by adding the human element into the mix. Asaf is a firm believer in defining processes that help R&D teams grow and scale without losing their velocity, and taking a Hands-on, Full-stack approach to staying in touch with the challenges - believing that's what evolves managers into leaders. He has been working in the start-up in various roles, ranging from support, QA and various R&D roles, building scalable, robust systems in Clojure, Golang, Node.js and Python to power-up a React and Angular services, while working with Kafka, Aerospike and Neo4J to handle large scale or complex business logic states.

Rate this Article

Adoption
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Community comments

  • Question about the stats

    by Jeffrey Costa /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Asaf: In the table above, you have a column called "Trans/sec" that is measured in MB. What is a "transaction" in your testing, and why are you measuring weight here instead of transactions completed? #confused

  • Quetions about evaluated options

    by Hoang Tran /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    What are all the options (beside Go) that were evaluated?

  • Re: Question about the stats

    by Asaf Yonay /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hey Jeffery - The reason we're measuring weight is to try and simulate a reverse proxy. We wanted to make sure we're not sending light weight requests as our use case does include some heavier post requests and large responses.

    Overall, the benchmark was to make sure our 99 percentile isn't more than 10 seconds, that we are dropping a reasonable amount of requests (if any), and that we can handle payloads. Golang answered all of those.

  • Re: Quetions about evaluated options

    by Asaf Yonay /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hey Hoang - We started by benchmarking a simple request / response scenario using basic HTTP (that's the first line). Then we moved on to try and use Nginx with some LUA code on top of it (to simulate our business logic layer which we use for authentication and authorization). We followed up with Clojure and Jetty/Netty and finally we tested Golang.

  • Very informative! and a question

    by Matan Safriel /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Sounds like great project execution and definitely a great post that really sheds a valuable informative view for the community! I wonder what you mean by saying you found golang better for maintenance and scalability just because it is typed (!?). Care to delineate this a little?

  • Re: Very informative! and a question

    by Asaf Yonay /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    Hey Matan - We found it easier to work in a team on the same codebase (even same functions) in a typed language - it helped tighten up the borders of each change by making sure you don't break the basic API (input & output) per each function. It's a bit harder in functional languages such as Clojure or Javascript.

    In addition, the community support felt like it was more oriented towards the problem we were trying to solve (concurrency, HTTP handling, reverse proxy).

  • Don't let yourself draw incorrect conclusions from this article

    by Piotr Owsiak /

    Your message is awaiting moderation. Thank you for participating in the discussion.

    I really don't mean to be mean or rude, but the article gives the two technologies unfair comparison by describing how unsuccessful you were with Clojure and how much more successful you are with Go.
    Comparing a poorly designed solution in Clojure to a well designed one in Go can lead readers into believing that Clojure is a bad technology choice while Go is a great technology choice. This obviously is not true, Clojure is definitely not 83+ times slower than Go (60 vs ~5000 concurrent connections).

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

BT

Is your profile up-to-date? Please take a moment to review and update.

Note: If updating/changing your email, a validation request will be sent

Company name:
Company role:
Company size:
Country/Zone:
State/Province/Region:
You will be sent an email to validate the new email address. This pop-up will close itself in a few moments.