How LinkedIn Identified a Kernel Lock Contention Issue Causing Recurring System Freezes

When LinkedIn engineers encountered short-lived, recurring outages where the database powering their user feed became unavailable and then recover without leaving helpful traces, they had to devise a novel approach to uncover the root cause using off-CPU profiling with eBPF.

As LinkedIn engineer Pratikmohan Srivastav explains, investigating those incidents was especially challenging because they were ephemeral, lasting only 10-15 seconds, and left no useful logs. Additionally, they recurred with no clear pattern and showed no clear external trigger.

A first clue emerged by correlating the incidents with the system memory behavior, which showed that each event coincided with a momentary spike in memory allocation, quickly resolved with the system stabilizing at a higher baseline. Further analysis ruled out other common causes, including CPU throttling, memory fragmentation and compaction, and file I/O.

Thus, the analysis based on conventional monitoring and metrics provided no hits at the root cause of the issue, which prompted LinkedIn engineers to dig deeper into the OS and runtime-level behavior during the freezes. Their approach turned to off-CPU profiling to understand what threads were blocked at the time.

Our solution was to build a trap. We wrote a monitoring script that would automatically capture an off-CPU profile the instant a freeze was detected. The script works as follows:

The script used an eBPF toolkit, BCC, to continuously monitor database health and immediately trigger the BCC offcputime.py profiler to record kernel stack traces of blocked or sleeping threads during 15 seconds. This allowed LinkedIn engineers to capture an off-CPU profile during a live freeze:

This was the key breakthrough - these events were too brief for conventional monitoring to capture the underlying cause, so the only way to observe the root cause was to have profiling instrumentation already in place when the freeze began.

The root cause was traced to a huge memory allocation, around 3.5 GB, which triggered a kernel-level lock on the mmap_lock semaphore, effectively blocking all threads.

Any operation that modifies the process's virtual address space - such as a large mmap allocation - must hold this lock in write mode. While the write lock is held, all other threads that need any memory operation (including madvise for purging, and page fault handling for I/O) are blocked.

Further analysis revealed that the allocation was triggered by Rust in-memory HashMap (pkey_vs_docref), which maps primary keys to internal document references. When it grew past 58,720,256 entries, it hit a resize threshold and doubled in size.

Once the root cause was identified, LinkedIn engineers quickly resolved the issue by pre-allocating the HashMap, thus preventing the resizing during operation. This came at the cost of an additional ~3 GB of resident memory at startup, which proved to be an acceptable trade-off.

This incident highlighted several important lessons, Srivastav says: pre-allocating large data structures can help prevent sudden memory spikes in latency-sensitive paths; eBPF-based off-CPU profiling is a powerful tool for diagnosing “silent freezes” that leave little to no trace; and for ephemeral issues, automated instrumentation that activates on failure conditions can be essential for capturing meaningful diagnostics when the problem occurs.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Sergio De Simone

Rate this Article

This content is in the Monitoring topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter