Debugging iOS File Corruption at Facebook

Facebook engineers Slobodan Predolac and Nicolas Spielberg have recently described how they "solved a long-term mobile debugging problem and reduced the crash rate ... by more than 50 percent." In the process, they show general useful techniques and a few Facebook tools that can help with large, rapidly evolving codebases.

The bug started to manifest itself, Predolac and Spielberg recount, as a Core Data crash. Their first step was using Facebook own Hipal and Scuba tools to query and aggregate data across crash reports. The outcome of this analysis was that the Code Data error codes "varied across half a dozen different manifestations."

One of the complexities of identifying the origin of the problem lay with how Facebook develops its software, with a monthly release cycle and "hundreds of developers committing to each release." So, the two engineers write, "while we were able to narrow the timeframe, we couldn’t isolate further than a couple thousand commits." Furthermore, A/B tests carried over each release made it hard to tell whether "changes were related to code or configuration."

As a last resort, Facebook engineers started making different hypothesis, and after excluding a number of them they tried to challenge the idea that Core Data was the root cause of the problem. Consequently, they identified "a piece of affected code where [they] could easily switch from Core Data to SQLite to start testing [their] hypothesis."

Shortly thereafter, they received a corruption crash report pointing at some file being overwritten "by rogue thread or process." That was a step in the right direction, but identifying the rogue thread or process "in a massive codebase" did not seem to be an easy task. The approach they followed was to open a honeypot file right before opening the SQLite file so they could catch any thread writing to the file and then inspect the corrupted file. By doing this, they found a common prefix across all attachments: 17 03 03 00 28 and used the following command in lldb to set a breakpoint to identify anyone that would try and send that content to the POSIX write() command:

breakpoint set -n write -c "(*(char**) ($esp + 8))[0]==0x17 
    && (*(char**) ($esp + 8))[1]==0x03 
    && (*(char**) ($esp + 8))[2]==0x03 
    && (*(char**) ($esp + 8))[3]==0x00
    && (*(char**) ($esp + 8))[4]==0x28"

Quickly they found out that their SPDY network stack was the likely culprit and set off to verify this hypothesis. To accomplish that, they used Fishhook, an open-source tool developed at Facebook, to rebind the system write call.

// setup a honeypot file
int trap_fd = open(…); 
// Create new function to detect writes to the honeypot
static WRITE_FUNC_T original_write = dlsym(RTLD_DEFAULT, "write");

ssize_t corruption_write(int fd, const void *buf, size_t size) { 
    FBFatal(fd != trap_fd, @"Writing to the honeypot file");
    return original_write(fd, buf, size);
}

// Replace the system write with our “checked version”
rebind_symbols((struct rebinding[1]){{(char *)"write", (void *)corruption_write}}, 1);

By the next day, they had a new crash report showing that the SSL layer was writing to a socket that had already been closed and then reassigned to the database file.

Once the reason of the crash was found, the fix was a matter of a couple of hours work.

Topics

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Cell-Based Architecture Adoption Guidelines

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

Proactive Approaches to Securing Linux Systems and Engineering Applications

Helpful links

Choose your language

Write for InfoQ

Rate this Article

This content is in the Performance topic

Related Topics:

Related Editorial

Related Sponsored Content

Popular across InfoQ

Cloudflare Introduces Workflows for Building Scalable Resilient Multi-Step Applications

Cloudflare Introduces Short-Lived SSH Access, Eliminating the Need for SSH Credentials

Microsoft Introduces Modern Web App Pattern for .NET: Accelerating App Modernization to the Cloud

Apache Tomcat 11.0 Delivers Support for Virtual Threads and Jakarta EE 11

AWS Lambda Introduces a Visual Studio Code-Based Editor with Advanced Features and AI Integration

Generally AI - Season 2 - Episode 5: Do Robots Dream of Electric Pianos?

Beyond the Breach: Proactive Defense in the Age of Advanced Threats

Steve Klabnik and Herb Sutter Talk about Rust and C++

Challenges and Lessons Porting Code from C to Rust

Grab Employs LLMs for Conversational Data Discovery with GPT-4, Glean and Slack

Cell-Based Architecture Adoption Guidelines

Software Architecture Tracks at QCon San Francisco 2024 – Navigating Current Challenges and Trends

Making Digital Accessibility More Than Just High Contrast: Building Truly Inclusive Software

What Developers Can Do to Continue to Program as They Age

How Rules Can Foster Creativity: The Design System of Reykjavík

Launching AI Agents Across Europe at Breakneck Speed With an Agent Computing Platform

OSI Releases New Definition for Open Source AI, Setting Standards for Transparency and Accessibility

Being a Responsible Developer in the Age of AI Hype

Optimizing Uber's Search Infrastructure: Upgrading to Apache Lucene 9.5

Improving the Efficiency of Goku Time-Series Database at Pinterest

Expedia Migrates a Massive Cassandra Cluster to ScyllaDB with Zero Downtime

QCon San Francisco

QCon London

InfoQ Dev Summit Boston

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?