New-age Transactional Systems - Not Your Grandpa's OLTP
John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.
The content has been bookmarked!
There was an error bookmarking this content! Please retry.
Posted by Alex Popescu and Ryan Slobojan on Apr 21, 2010
Recently, InfoQ.com upgraded the backend database that we've been running the site off of since launch. However, everything did not go according to plan, and even though the vast majority of the migration was smooth we encountered some unexpected issues along the way (which have now been resolved). This post will discuss in detail what our plan was, what worked and what didn't, and how we detected and recovered from the errors we saw.
Planning for this upgrade was quite involved and spanned a few months. Our existing database server had been running at almost full utilization since the hardware hadn't changed from the initial launch in 2006, and this was starting to result in performance issues on the site, particularly during regular reporting jobs. In addition, this upgrade was a good opportunity for InfoQ to migrate from a single master MySQL database to a master/slave MySQL setup which would allow for hot failover and the running of reports and backups against a slave rather than the master database, thus isolating InfoQ.com users from the performance ramifications of those processes.
A multi-step migration plan was prepared for this move, and the steps included:
master <-- slave 1 <-- slave 2 setup, which allowed for rapid synchronization of slave 2 off of slave 1 rather than a much slower synchronization off of the (fully utilized/overloaded) master which would have resulted from a master <-- slave 1, slave 2 setupIt was also determined that, by installing MySQL 5.1 on the two new slave servers, that we could save a further downtime later on to upgrade MySQL. Our research indicated that a MySQL 5.1 slave could successfully synchronize against a MySQL 5.0 master, so by starting with MySQL 5.1 on the slaves we would then be able to combine the upgrade to 5.1 with the switching of the first slave database to be the new master.
One of the other potential options (which would have avoided downtime) would have been to perform the mysqldump against the master server and to then transfer that to the slave, rather than taking down the master and performing a file copy - however, the very high utilization level of the server ruled this out as a possibility as it was anticipated that running the backup job would put too much burden on the master database server and would likely result in a de facto downtime period for InfoQ.com anyways due to the slow responsiveness which would be expected.
The first challenge was finding a good time to do this upgrade - the development team is in Romania, the ISP is in Missouri, and the operations team is in Vancouver and Oregon. Friday late evening in Pacific time (Saturday morning in Europe) was chosen because it allowed the downtime to occur during a quiet period for InfoQ.com, and also for the operations team to start in their evening and the development team to start in their morning (there is a 10 hour difference between Romania and Vancouver). Once this time was chosen, step 1 was complete.
Step 2, the one which involved server downtime, was originally estimated to take around 2 hours based on the two servers being directly linked via gigabit ethernet. However, due to network issues which were encountered, this ended up taking more than 5 hours. This reminded us of an important point: If you think that networking issues cannot happen within a single datacenter, you are wrong, period.
After that, steps 3 through 6 seemed to happen very smoothly, and no further issues were encountered. We performed the switchover to the new master database on February 26th, and things seemed to be working normally after a basic round of testing. As a welcome bonus, the site felt much more responsive, and we were quite happy with the process overall.
However, this would be a pretty boring post if the story ended there. :)
A couple of days after the migration, it first came to our attention that there were issues with the commenting system. In particular, new comments would appear to be added to the site but would subsequently disappear after a few minutes, giving users the impression that they had been deleted. The cause of this ended up being a surprising one - by enabling master/slave replication on our database server, the version of Jive which currently use was losing comments. The solution to this was to change the transaction isolation level from READ_UNCOMMITTED to REPEATABLE_READ (because READ_UNCOMMITTED can throw errors in a replicated environment), and this allowed comments to be created once more.
However, this didn't end up solving all of our problems. Although our initial tests showed that creating comments now worked, we started to see very strange consistency issues after a day or two - in particular, we noticed that new comments which were added to posts created after the database migration were being merged into seemingly arbitrary existing conversations on the site. Digging a bit further into this, we discovered that Jive had reset the unique ID it uses to keep track of threads without reporting any errors or warnings to the logs, and as a result we were potentially overwriting or corrupting existing (albeit old) forum threads since that counter had reset from ~16,000 to 0. The culprit on this one ended up being our choice of a row-based replication strategy for MySQL, and once we realized that we were able to address it and perform cleanup on the comment threads which had been incorrectly assigned.
There were a few takeaways that we got from this upgrade process:
As always, we're interested to hear any feedback you have, via any one (or even several!) of our feedback channels:
Getting Started with Stratos - an Open Source Cloud Platform
Fair Trade Software Licensing - A Guide to Neo4j Licensing Options
In today’s hyper-competitive world, later may be too late to adopt Agile development and this Roadmap for Success will help you get started. Download "Agile Development: A Manager's Roadmap for Success" now!
John Hugg discusses high volume transaction processing applications with high and low frequency profiles, and how VoltDB can be used for that purpose.
Kevlin Henney examines code samples to see what can be learned from them starting from the premise that one won’t write great code unless he knows how to read it.
Jason Ayers share the observations he made watching a team of developers collaborating in real time on the same code base, pushing XP, pair programming and continuous integration to their extremes.
Michael Snoyman presents Yesod, a web framework written in Haskell and containing a web server, templating, ORM, libraries (templating, gravatar, etc.).
Richard Kreuter and Kyle Banker on how to avoid classical RDBMS transactional systems by using compensation mechanisms, transactional messaging or transactional procedures.
Attila Szegedi talks about performance tuning Java and Scala programs at Twitter: how to approach GC problems, the importance of asynchronous I/O, when to use MySQL/Cassandra/Redis, and much more.
One category of risk that project teams need to ensure they address is business value failure – delivering a product that fails to provide value for the business investor.
InfoQ spoke to the authors of Software Systems Architecture on a couple of new topics, the System Context viewpoint and Agile, which have been added to the second edition.
No comments
Watch Thread Reply