Edgemesh is a P2P web acceleration service based on the WebRTC protocol suite that offloads some of the the traffic normally handled by traditional CDNs to browser-based caches shared over a P2P network. They rolled out their release to production in the last few months and shared their experience.
Edgemesh’s technology depends on an "overlay" network that is created by the end users' browsers using the WebRTC protocol suite. Traditional CDNs have edge locations around the globe which lower latency for end users by serving content from the location that is closest to them. Edgemesh caches content on the user's browser, as browsers usually do but in a secondary cache, and also enables browsers to talk to each other to serve content rather than fetch it from a CDN edge location. Enabling Edgemesh involves including a Javascript snippet in the web page. In production, this would mean that Edgemesh's client side Javascript would potentially be on thousands of browsers, across multiple geographies. This overlay network is the "mesh".
Edgemesh uses Docker containers as their basic unit of deployment. The infrastructure runs on the Joyent public cloud in SmartOS zones, on the metal instead of on virtual machines. Edgemesh utilizes Joyent's Autopilot pattern for container management which allows the containers a high degree of operational autonomy. The database systems run inside the containers as well via Triton. State information like database files, log files etc are stored on stable storage via Joyent Manta with a secondary backup to Google's Storage Platform.
The engineering team identified some primary challenges before they moved into production on April 1st. These included the ability to debug production errors without client inputs, automatic time-constrained release of updates, mesh across 500 networks, offloading of up to 1TB of data into the mesh and spanning 50 countries. By May 26th, they enabled full traffic inflow into the mesh. Within a week over half a terabyte of traffic from the client’s origin servers had been offloaded into the mesh and the customer's mean page load time decreased by 33%. Metrics were continuously measured and monitored - including the size of the mesh as well as individual page load times. When the Edgemesh client hit an error, it fell back and allowed the browser to resume normal operations. The data collected from the error was then reported back to Edgemesh for analysis.
Behind the scenes, the team used three points as guiding principles for deploying their software into production - the first two being maintaining consistency via automation and reduction of the attack surface from malicious entities by periodic recreation of their infrastructure. The third - economical scaling by starting from the smallest possible infra - was more of an outcome of the other two.
InfoQ got in touch with Jacob Loveless, CEO, Edgemesh Corporation to find out more about their DevOps challenges while rolling out to production. On being asked about the kind of monitoring tools being used at Edgemesh for their Docker infrastructure, Loveless responded:
We use Prometheus for system stats and we have an internal system as well which emits things like errors and message type notes (e.g. WebRTC statistics from the clients). These metrics are used to help each application determine when they should request a scale-up event.
Due to the distributed nature of Edgemesh where most of the code lives on the client side, ensuring pre-production sanity is a key challenge. Loveless shed more light on their staging (pre-production) setup:
We have a standard CI/CD platform where we do our Docker builds and those deploy into development environments. When we reach 'staging' we tag an image as staging and those will then get rolled into clean slate. We have one data center which runs staging and handles approximately 10% of the global traffic on that datacenter. Once we are comfortable with tagging a release, we modify the Docker image tag to 'master' and it rolls into production across all data centers. If we need to roll back, we simply change Docker image tags in our registry and things will reset on the next clean slate run. So basically, we do a deployment every day but most times it's not a new release. I would say we get a full release into production every 5-10 days.
The "clean slate" referred to above is Edgemesh's way of recreating their datacenter containers every day. This ensures that they are able to stick to their three deployment principles.
By allowing the staging setup to absorb production data, coupled with an easy mechanism to roll back changes, Edgemesh simulates some of the production load and traffic patterns that its code will see in production. However, this is usually not enough, so Edgemesh's strategy is "making sure you can rapidly move versions of software into production", says Loveless.
Part of the clean slate strategy is a daily reset of the instance sizes that might have auto scaled up from their base sizes. This happens on a per-datacenter basis. If the datacenter is experiencing high traffic and the baseline (starting) state cannot handle that after a reset, Edgemesh would need to ensure that clients don't see errors. Loveless explains how this is achieved:
When datacenter A starts the clean slate, the first thing it does is deregister itself from the DNS entry. We run low TTLs on the DNS records (30 seconds) and a five minute pause gives it time to ensure all traffic is redirected to B and C. After five minutes A then begins the clean slate. When it comes back online, it re-registers with the DNS servers - starts receiving traffic and then datacenter B begins the clean slate (B begins when A sends a message letting B know it's back online and receiving traffic). During the transition, datacenters B & C will often scale up to account for the additional load.