InfoQ

InfoQ

News

My Bookmarks

Login or Register to enable bookmarks for unlimited time.

The content has been bookmarked!

There was an error bookmarking this content! Please retry.

Upgrading InfoQ.com's Database: Mostly Smooth, A Couple of Bumps

Posted by Alex Popescu and Ryan Slobojan on Apr 21, 2010

Sections
Enterprise Architecture,
Operations & Infrastructure,
Process & Practices,
Architecture & Design,
Development
Topics
Agile ,
SOA ,
Ruby ,
.NET ,
Java ,
Inside InfoQ ,
Debugging ,
Architecture ,
Operations
Tags
Database Replication ,
Update ,
Database

Recently, InfoQ.com upgraded the backend database that we've been running the site off of since launch. However, everything did not go according to plan, and even though the vast majority of the migration was smooth we encountered some unexpected issues along the way (which have now been resolved). This post will discuss in detail what our plan was, what worked and what didn't, and how we detected and recovered from the errors we saw.

The Plan

Planning for this upgrade was quite involved and spanned a few months. Our existing database server had been running at almost full utilization since the hardware hadn't changed from the initial launch in 2006, and this was starting to result in performance issues on the site, particularly during regular reporting jobs. In addition, this upgrade was a good opportunity for InfoQ to migrate from a single master MySQL database to a master/slave MySQL setup which would allow for hot failover and the running of reports and backups against a slave rather than the master database, thus isolating InfoQ.com users from the performance ramifications of those processes.

A multi-step migration plan was prepared for this move, and the steps included:

  1. Preparation: This included teleconferences between the ISP (Contegix) and our development and operations teams, creating an operational plan, advance preparation of all necessary configuration files, and ensuring that the ISP knew what the plan was
  2. Service Downtime/Data Transfer: This included performing a controlled shutdown of MySQL (stopping the web servers, ensuring that transactions were all complete, setting the DB to read-only, validating the transaction log, and shutting down MySQL), physically connecting the old database server to the new one in the data center to ensure maximum data transfer speed, copying all data from the old server to the new one, validating it, and then bringing all of the services back up so that InfoQ.com was running again (still using the old database server at this point)
  3. Database InnoDB Table Conversion: This included bringing up MySQL on the new database server and exporting the tables va mysqldump, shutting down MySQL and wiping out the existing data, and re-importing the data in order to convert it to file-per-table mode (with validation at all steps)
  4. First Slave Setup: This included pointing the new database server at the old one, and ensuring that master/slave replication was happening correctly, and then pointing the reporting jobs at the new slave server
  5. Second Slave Setup: This included pointing a second new database server at the first new one, and ensuring that master/slave replication was happening correctly - this would result in a master <-- slave 1 <-- slave 2 setup, which allowed for rapid synchronization of slave 2 off of slave 1 rather than a much slower synchronization off of the (fully utilized/overloaded) master which would have resulted from a master <-- slave 1, slave 2 setup
  6. Swapping To The New Master: This included performing the actual switchover from the existing old master to the new one (what was the first slave) and changing the old master to be a new slave, as well as pointing the web servers to point at the new master database

It was also determined that, by installing MySQL 5.1 on the two new slave servers, that we could save a further downtime later on to upgrade MySQL. Our research indicated that a MySQL 5.1 slave could successfully synchronize against a MySQL 5.0 master, so by starting with MySQL 5.1 on the slaves we would then be able to combine the upgrade to 5.1 with the switching of the first slave database to be the new master.

One of the other potential options (which would have avoided downtime) would have been to perform the mysqldump against the master server and to then transfer that to the slave, rather than taking down the master and performing a file copy - however, the very high utilization level of the server ruled this out as a possibility as it was anticipated that running the backup job would put too much burden on the master database server and would likely result in a de facto downtime period for InfoQ.com anyways due to the slow responsiveness which would be expected.

Executing The Plan

The first challenge was finding a good time to do this upgrade - the development team is in Romania, the ISP is in Missouri, and the operations team is in Vancouver and Oregon. Friday late evening in Pacific time (Saturday morning in Europe) was chosen because it allowed the downtime to occur during a quiet period for InfoQ.com, and also for the operations team to start in their evening and the development team to start in their morning (there is a 10 hour difference between Romania and Vancouver). Once this time was chosen, step 1 was complete.

Step 2, the one which involved server downtime, was originally estimated to take around 2 hours based on the two servers being directly linked via gigabit ethernet. However, due to network issues which were encountered, this ended up taking more than 5 hours. This reminded us of an important point: If you think that networking issues cannot happen within a single datacenter, you are wrong, period.

After that, steps 3 through 6 seemed to happen very smoothly, and no further issues were encountered. We performed the switchover to the new master database on February 26th, and things seemed to be working normally after a basic round of testing. As a welcome bonus, the site felt much more responsive, and we were quite happy with the process overall.

However, this would be a pretty boring post if the story ended there. :)

Problems With The Comment System

A couple of days after the migration, it first came to our attention that there were issues with the commenting system. In particular, new comments would appear to be added to the site but would subsequently disappear after a few minutes, giving users the impression that they had been deleted. The cause of this ended up being a surprising one - by enabling master/slave replication on our database server, the version of Jive which currently use was losing comments. The solution to this was to change the transaction isolation level from READ_UNCOMMITTED to REPEATABLE_READ (because READ_UNCOMMITTED can throw errors in a replicated environment), and this allowed comments to be created once more.

However, this didn't end up solving all of our problems. Although our initial tests showed that creating comments now worked, we started to see very strange consistency issues after a day or two - in particular, we noticed that new comments which were added to posts created after the database migration were being merged into seemingly arbitrary existing conversations on the site. Digging a bit further into this, we discovered that Jive had reset the unique ID it uses to keep track of threads without reporting any errors or warnings to the logs, and as a result we were potentially overwriting or corrupting existing (albeit old) forum threads since that counter had reset from ~16,000 to 0. The culprit on this one ended up being our choice of a row-based replication strategy for MySQL, and once we realized that we were able to address it and perform cleanup on the comment threads which had been incorrectly assigned.

What We Learned

There were a few takeaways that we got from this upgrade process:

  • Detailed Planning Is Important: By putting a lot of thought into the planning and reasoning through the contingencies (including what to do in case things failed or errors were encountered), things go much more smoothly and you can stick to a plan when things go wrong as opposed to scrambling in the middle of an upgrade when something unexpected happens
  • Plans Will Change: There is an old saying, "no plan survives first contact with the enemy" -- it is valuable to think through planning, but you also need to be ready to react to contingencies, and there is no way to plan for everything
  • Systematic Debugging Saves Time: The faster you rule out things, the faster you find the real cause of an issue. And, as soon as you see an error, throw it into a Google search and see what comes out - this can be a huge time-saver for obscure, non-obvious bugs
  • Actively Seek Feedback: Don't wait for someone to email you or otherwise directly contact you if they encounter a problem - we first learned of this issue by keeping a close eye on Twitter commentary, and if we had waited for someone to email us about the issue it might've existed for much longer. Twitter searches and Google Alerts are your friend - use them

As always, we're interested to hear any feedback you have, via any one (or even several!) of our feedback channels:

  • This article is part of a featured topic series on SOA and also Agile

Related Sponsor

In today’s hyper-competitive world, later may be too late to adopt Agile development and this Roadmap for Success will help you get started. Download "Agile Development: A Manager's Roadmap for Success" now!

No comments

Watch Thread Reply

Educational Content

New-age Transactional Systems - Not Your Grandpa's OLTP

John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.

Cool Code

Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.

Collaboration: At the Extremities of Extreme

Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.

Yesod Web Framework

Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).

Transactions without Transactions

Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.

Attila Szegedi on JVM and GC Performance Tuning at Twitter

Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.

10 tips on how to prevent business value risk

One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.

Interview: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives

InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.