GitHub recently upgraded its MySQL infrastructure from version 5.7 to 8.0. The motivation behind this upgrade was 5.7 reaching the end of life, and a need to leverage the latest security patches, bug fixes, and performance enhancements offered by MySQL 8.0.
GitHub’s MySQL infrastructure consists of over 1200 hosts, a blend of Azure Virtual Machines and bare metal hosts, managing more than 300 TB of data, and handling 5.5 million queries per second across 50+ database clusters. The setup ensured high availability with a primary-plus-replicas cluster configuration. GitHub's data partitioning approach involves horizontal and vertical sharding, catering to specific product-domain areas and larger-domain areas outgrowing the single-primary cluster.
Preparation for this major upgrade started in July 2022. Recognizing the criticality of their MySQL infrastructure, the database team had certain requirements for service level objectives (SLOs) and agreements (SLAs). Due to the diverse workload and the size of the operation, the upgrade phase involved infrastructure readiness, ensuring application compatibility and maintaining transparent communication within the team.
The upgrade began with a rolling replica upgrade, monitoring each stage carefully before moving to the next. The team then updated the replication topology, setting up an 8.0 primary candidate under the current 5.7 primary. Promoting a MySQL 8.0 host to primary was executed without direct upgrades on primary hosts. Ancillary servers were subsequently upgraded for consistency, followed by a thorough cleanup phase.
Source: Upgrading GitHub.com to MySQL 8.0
An important aspect of GitHub's upgrade strategy was the ability to roll back to MySQL 5.7 if necessary. To manage this, GitHub tackled compatibility issues, such as character set differences and role management features, ensuring a seamless transition between versions. During the upgrade journey, the team faced challenges with Vitess upgrades. The team encountered replication delays and bugs, necessitating the deployment of specific MySQL versions and careful workload management. They also faced the real-world problem of queries passing in Continuous Integration (CI) environments but failing in production.
The upgrade process went on for over a year, involving multiple teams and highlighting the importance of an effective observability platform, robust testing plans, and reliable rollback capabilities. The team learned the value of consistency in client-side configurations, which was vital in successful rollbacks. This experience underscored the benefits of having well-understood and standardized connection configurations.
We came across an interesting discussion on Lobster, where we saw users discussing MySQL scalability and how large organizations make MySQL work at scale.
One of the key takeaways from this upgrade was the pivotal role of previous efforts in data partitioning. This strategy allowed for more targeted upgrades and minimized the risks associated with unknown variables during the process. The expansion of MySQL fleet from five clusters to the current 50, also highlighted the need for enhanced observability, advanced tooling, and efficient processes for managing a larger fleet.