Serious Red Hat Linux Bug Affects Haswell-based Servers
A recent post by Gil Tene raises the importance of an important, little known patch to Linux kernels that should be reviewed by all users and administrators of Linux systems, especially those who utilize Haswell processors. Tene reports that in particular users of Red Hat-based distributions (including CentOS 6.6 and Scientific Linux 6.6) should apply the patch as soon as possible. Even if your instance of Linux is running in a VM, that VM is most likely hosted on a Haswell machine if is on the popular cloud providers (Azure / Amazon /etc) and would benefit from the patch.
Tene describes the flaw as follows:
“The impact of this kernel bug is very simple: user processes can deadlock and hang in seemingly impossible situations. A futex wait call (and anything using a futex wait) can stay blocked forever, even though it had been properly woken up by someone. Thread.park() in Java may stay parked. Etc. If you are lucky you may also find soft lockup messages in your dmesg logs. If you are not that lucky (like us, for example), you'll spend a couple of months of someone's time trying to find the fault in your code, when there is nothing there to find.”
Tene goes on to explain how the flawed code performed (boils down to a switch block missing a default case). The big reason for the problem today is that while the code in question was fixed in January 2014, the flaw was backported into the Red Hat 6.6 family around October 2014. Other systems including (SLES, Ubuntu, Debian, etc) are also probably affected.
The fix for those systems is only now being distributed and it could be overlooked. Red Hat users should look for RHEL 6.6.z or newer. A key point made by Tene is that the fix has been unevenly distributed as different distributions make specific choices on what goes into their kernel.
For example, RHEL 7.1 “The upstream 3.10 didn't have the bug. But RHEL 7's version is different from the pure upstream version. Unfortunately, RHEL 7.1 (much like RHEL 6.6) backported the change that included the bug… I expect that some other distros may have also done the same.”
For RHEL based distributions, Tene produced a quick table for reference (emphasis in the original):
RHEL 6 (and CentOS 6, and SL 6): 6.0-6.5 are good. 6.6 is BAD. 6.6.z is good.
RHEL 7 (and CentOS 7, and SL 7): 7.1 is BAD. As of yesterday. there does not yet appear to be a 7.x fix. [May 13, 2015]
RHEL 5 (and CentOS 5, and SL 5): All versions are good (including 5.11).
A conversation about this discovery at Hacker News saw some disputing the amount of affected systems, but it provides some context for checking whether or not your system may need a patch.