Upgrading InfoQ.com's Database: Mostly Smooth, A Couple of Bumps
Recently, InfoQ.com upgraded the backend database that we've been running the site off of since launch. However, everything did not go according to plan, and even though the vast majority of the migration was smooth we encountered some unexpected issues along the way (which have now been resolved). This post will discuss in detail what our plan was, what worked and what didn't, and how we detected and recovered from the errors we saw.
Planning for this upgrade was quite involved and spanned a few months. Our existing database server had been running at almost full utilization since the hardware hadn't changed from the initial launch in 2006, and this was starting to result in performance issues on the site, particularly during regular reporting jobs. In addition, this upgrade was a good opportunity for InfoQ to migrate from a single master MySQL database to a master/slave MySQL setup which would allow for hot failover and the running of reports and backups against a slave rather than the master database, thus isolating InfoQ.com users from the performance ramifications of those processes.
A multi-step migration plan was prepared for this move, and the steps included:
- Preparation: This included teleconferences between the ISP (Contegix) and our development and operations teams, creating an operational plan, advance preparation of all necessary configuration files, and ensuring that the ISP knew what the plan was
- Service Downtime/Data Transfer: This included performing a controlled shutdown of MySQL (stopping the web servers, ensuring that transactions were all complete, setting the DB to read-only, validating the transaction log, and shutting down MySQL), physically connecting the old database server to the new one in the data center to ensure maximum data transfer speed, copying all data from the old server to the new one, validating it, and then bringing all of the services back up so that InfoQ.com was running again (still using the old database server at this point)
- Database InnoDB Table Conversion: This included bringing up MySQL on the new database server and exporting the tables va mysqldump, shutting down MySQL and wiping out the existing data, and re-importing the data in order to convert it to file-per-table mode (with validation at all steps)
- First Slave Setup: This included pointing the new database server at the old one, and ensuring that master/slave replication was happening correctly, and then pointing the reporting jobs at the new slave server
- Second Slave Setup: This included pointing a second new database server at the first new one, and ensuring that master/slave replication was happening correctly - this would result in a
master <-- slave 1 <-- slave 2setup, which allowed for rapid synchronization of slave 2 off of slave 1 rather than a much slower synchronization off of the (fully utilized/overloaded) master which would have resulted from a
master <-- slave 1, slave 2setup
- Swapping To The New Master: This included performing the actual switchover from the existing old master to the new one (what was the first slave) and changing the old master to be a new slave, as well as pointing the web servers to point at the new master database
It was also determined that, by installing MySQL 5.1 on the two new slave servers, that we could save a further downtime later on to upgrade MySQL. Our research indicated that a MySQL 5.1 slave could successfully synchronize against a MySQL 5.0 master, so by starting with MySQL 5.1 on the slaves we would then be able to combine the upgrade to 5.1 with the switching of the first slave database to be the new master.
One of the other potential options (which would have avoided downtime) would have been to perform the mysqldump against the master server and to then transfer that to the slave, rather than taking down the master and performing a file copy - however, the very high utilization level of the server ruled this out as a possibility as it was anticipated that running the backup job would put too much burden on the master database server and would likely result in a de facto downtime period for InfoQ.com anyways due to the slow responsiveness which would be expected.
Executing The Plan
The first challenge was finding a good time to do this upgrade - the development team is in Romania, the ISP is in Missouri, and the operations team is in Vancouver and Oregon. Friday late evening in Pacific time (Saturday morning in Europe) was chosen because it allowed the downtime to occur during a quiet period for InfoQ.com, and also for the operations team to start in their evening and the development team to start in their morning (there is a 10 hour difference between Romania and Vancouver). Once this time was chosen, step 1 was complete.
Step 2, the one which involved server downtime, was originally estimated to take around 2 hours based on the two servers being directly linked via gigabit ethernet. However, due to network issues which were encountered, this ended up taking more than 5 hours. This reminded us of an important point: If you think that networking issues cannot happen within a single datacenter, you are wrong, period.
After that, steps 3 through 6 seemed to happen very smoothly, and no further issues were encountered. We performed the switchover to the new master database on February 26th, and things seemed to be working normally after a basic round of testing. As a welcome bonus, the site felt much more responsive, and we were quite happy with the process overall.
However, this would be a pretty boring post if the story ended there. :)
Problems With The Comment System
A couple of days after the migration, it first came to our attention that there were issues with the commenting system. In particular, new comments would appear to be added to the site but would subsequently disappear after a few minutes, giving users the impression that they had been deleted. The cause of this ended up being a surprising one - by enabling master/slave replication on our database server, the version of Jive which currently use was losing comments. The solution to this was to change the transaction isolation level from READ_UNCOMMITTED to REPEATABLE_READ (because READ_UNCOMMITTED can throw errors in a replicated environment), and this allowed comments to be created once more.
However, this didn't end up solving all of our problems. Although our initial tests showed that creating comments now worked, we started to see very strange consistency issues after a day or two - in particular, we noticed that new comments which were added to posts created after the database migration were being merged into seemingly arbitrary existing conversations on the site. Digging a bit further into this, we discovered that Jive had reset the unique ID it uses to keep track of threads without reporting any errors or warnings to the logs, and as a result we were potentially overwriting or corrupting existing (albeit old) forum threads since that counter had reset from ~16,000 to 0. The culprit on this one ended up being our choice of a row-based replication strategy for MySQL, and once we realized that we were able to address it and perform cleanup on the comment threads which had been incorrectly assigned.
What We Learned
There were a few takeaways that we got from this upgrade process:
- Detailed Planning Is Important: By putting a lot of thought into the planning and reasoning through the contingencies (including what to do in case things failed or errors were encountered), things go much more smoothly and you can stick to a plan when things go wrong as opposed to scrambling in the middle of an upgrade when something unexpected happens
- Plans Will Change: There is an old saying, "no plan survives first contact with the enemy" -- it is valuable to think through planning, but you also need to be ready to react to contingencies, and there is no way to plan for everything
- Systematic Debugging Saves Time: The faster you rule out things, the faster you find the real cause of an issue. And, as soon as you see an error, throw it into a Google search and see what comes out - this can be a huge time-saver for obscure, non-obvious bugs
- Actively Seek Feedback: Don't wait for someone to email you or otherwise directly contact you if they encounter a problem - we first learned of this issue by keeping a close eye on Twitter commentary, and if we had waited for someone to email us about the issue it might've existed for much longer. Twitter searches and Google Alerts are your friend - use them
As always, we're interested to hear any feedback you have, via any one (or even several!) of our feedback channels:
- Twitter via @InfoQ
- Email via firstname.lastname@example.org
- Google Groups via the InfoQ group
- LinkedIn via the InfoQ group
- Commenting on news posts
How Can We Use Our Creative Power and Technological Opportunity to Address the Challenges of the 21st Century?
Gyorgyi Galik Feb 26, 2015