Facebook engineers Slobodan Predolac and Nicolas Spielberg have recently described how they "solved a long-term mobile debugging problem and reduced the crash rate ... by more than 50 percent." In the process, they show general useful techniques and a few Facebook tools that can help with large, rapidly evolving codebases.
The bug started to manifest itself, Predolac and Spielberg recount, as a Core Data crash. Their first step was using Facebook own Hipal and Scuba tools to query and aggregate data across crash reports. The outcome of this analysis was that the Code Data error codes "varied across half a dozen different manifestations."
One of the complexities of identifying the origin of the problem lay with how Facebook develops its software, with a monthly release cycle and "hundreds of developers committing to each release." So, the two engineers write, "while we were able to narrow the timeframe, we couldn’t isolate further than a couple thousand commits." Furthermore, A/B tests carried over each release made it hard to tell whether "changes were related to code or configuration."
As a last resort, Facebook engineers started making different hypothesis, and after excluding a number of them they tried to challenge the idea that Core Data was the root cause of the problem. Consequently, they identified "a piece of affected code where [they] could easily switch from Core Data to SQLite to start testing [their] hypothesis."
Shortly thereafter, they received a corruption crash report pointing at some file being overwritten "by rogue thread or process." That was a step in the right direction, but identifying the rogue thread or process "in a massive codebase" did not seem to be an easy task. The approach they followed was to open a honeypot file right before opening the SQLite file so they could catch any thread writing to the file and then inspect the corrupted file. By doing this, they found a common prefix across all attachments: 17 03 03 00 28
and used the following command in lldb
to set a breakpoint to identify anyone that would try and send that content to the POSIX write()
command:
breakpoint set -n write -c "(*(char**) ($esp + 8))[0]==0x17 && (*(char**) ($esp + 8))[1]==0x03 && (*(char**) ($esp + 8))[2]==0x03 && (*(char**) ($esp + 8))[3]==0x00 && (*(char**) ($esp + 8))[4]==0x28"
Quickly they found out that their SPDY network stack was the likely culprit and set off to verify this hypothesis. To accomplish that, they used Fishhook, an open-source tool developed at Facebook, to rebind the system write
call.
// setup a honeypot file int trap_fd = open(…); // Create new function to detect writes to the honeypot static WRITE_FUNC_T original_write = dlsym(RTLD_DEFAULT, "write"); ssize_t corruption_write(int fd, const void *buf, size_t size) { FBFatal(fd != trap_fd, @"Writing to the honeypot file"); return original_write(fd, buf, size); } // Replace the system write with our “checked version” rebind_symbols((struct rebinding[1]){{(char *)"write", (void *)corruption_write}}, 1);
By the next day, they had a new crash report showing that the SSL layer was writing to a socket that had already been closed and then reassigned to the database file.
Once the reason of the crash was found, the fix was a matter of a couple of hours work.