BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Podcasts Architecting SQL Server on Linux: Slava Oks on Drawbridge, LibOS, & Addressing Between Windows/Linux

Architecting SQL Server on Linux: Slava Oks on Drawbridge, LibOS, & Addressing Between Windows/Linux

In this week’s podcast, Wesley Reisz talks to Slava Oks, who has worked at Microsoft for over 20 years on flagship products, including SQL Server. He also led the kernel team who worked on the Midori operating system. More recently, he has worked on bringing SQL Server to Linux.

Key Takeaways

  • Microsoft SQL Server runs on Linux through a containerised approach called Drawbridge
  • Drawbridge implements a Linux loader and a minimal set of ABI calls to allow an in-process NT user mode kernel to run
  • SQL Server runs on top of a SQL platform layer (called SQL OS) that could be ported to run on Drawbridge
  • SQL Server had supportability commands added to allow the state of the system to be measured with SQL calls
  • A number of efficiency gains were applied to both the Drawbridge components and the SQL Server code to bring performance to within 20% of the equivalent process running on Windows

Show Notes

  • 0m:40s - How did you get involved with the porting effort of SQL Server to Linux?
  • 0m:50s - Two years ago, a couple of former colleagues from the SQL Server team approached Slava.
  • 1m:10s - A daunting task, however after understanding how it could happen, decided to join the effort.
  • 1m:30s - Without the approach taken it would have been a tremendous task.
  • 2m:00s - Having worked on Midori previously, as well as Drawbridge, it seemed a good idea.

Drawbridge

  • 2m:30s - Drawbridge is a hardware abstraction layer which provides a compatibility story for old programs.
  • 2m:45s - When tasked with Linux, if the Drawbridge layer was available a port would then be easier.
  • 3m:10s - Since there was an existing prototype of the Drawbridge layer available on Linux already, running SQL Server inside Drawbridge on Linux was a sensible strategy.
  • 3m:30s - Drawbridge on Linux could run a simple Windows application, but didn’t have debug capabilities; in addition, it could crash.
  • 3m:50s - SQL Server helped identify issues in the Drawbridge layer for Linux, and bug fixing helped bring up the server.
  • 4m:25s - Drawbridge was a project from Microsoft Research to introduce security and high density containers in a cloud world.
  • 4m:50s - Using a cut-down kernel with the ring-0 content removed, the container could run unmodified windows-32 binaries and give security and high density at the same time.
  • 5m:15s - Over time, Drawbridge became used as a platform abstraction layer in various projects including Midori.
  • 5m:50s - Generally VMs take a lot of time to slim down but Drawbridge provides a higher density solution using a containerised approach.
  • 6m:20s - High density means being able to run more applications on a given piece of hardware consuming too many resources.

ABIs

  • 6m:20s - ABIs are the Application Binary Interface - one of the key pieces of research was to identify the ABIs needed to run a minimal kernel and application.
  • 7m:10s - Normal systems provide a couple of thousand ABI calls to provide operating system capabilities.
  • 7m:30s - The Drawbridge project was used to show it’s possible to run applications by cutting down that ABI surface to around fifty calls.

Security

  • 7m:45s - The limited number of ABIs improves security, because there’s a much smaller space to review.
  • 8m:15s - On Windows, Drawbridge consisted of three major pieces: LibOS (NT user mode kernel); a driver providing the 50 system calls (Platform Abstraction Layer); and the drawbridge monitor, which was used to provide the resources and monitor the application.
  • 9m:00s - The driver would terminate the application in Drawbridge if calls were made outside of the standard ABIs.

LibOS

  • 10m:10s - LibOS consists of two major pieces.
  • 10m:15s - The upper part is the implementation of the NT user mode kernel with the several thousand APIs that are used by win32 programs.
  • 10m:40s - The second part is a Drawbridge runtime library which supports the upper part.
  • 11m:00s - The user mode NT kernel has ring-0 functionality cut out, like page table management, interrupt management, scheduling management, and so on.

Linux

  • 11m:20s - The architecture for Linux is different; on Windows, it was about a high-density (and secure) way of running applications, but Linux is about a compatibility layer.
  • 11m:35s - On Linux there is no separate driver or monitor, but they are loaded into the Linux process when it starts.
  • 12m:20s - When running on Linux, the binary starts up a thread and loads the next part of the program.
  • 12m:45s - On Linux, there is a SQL platform abstraction layer, which consists of the LibOS and other components.
  • 13m:30s - Everything happens in a single process space which boots the abstraction layer and that then runs the SQL Server program.
  • 14m:30s - The SQL platform abstraction layer also exposes the SQL Operating System platform APIs that are used by the SQL Server program.

SQL OS

  • 14m:50s - SQL OS was put together for SQL Server 2005 that abstracted memory management, scheduling, synchronisation primitives, and I/O.
  • 15m:05s - It allows developers of the SQL Server to not worry about how to manage memory or create threads.
  • 15m:30s - SQL Server on Linux builds on LibOS which provides the semantics of the NT kernel.
  • 15m:50s - SQL Server also uses the SQL OS to provide an abstraction layer.
  • 16m:00s - The drawbridge runtime library wasn’t built for high scalable and highly performant applications like SQL Server.
  • 16m:10s - Although appropriate for the kind of applications it was originally built for, it wasn’t applicable for SQL Server.
  • 16m:20s - This required either that the drawbridge layer be upgraded to deal with high scalability and performance, or to create the SQL Server PAL - the latter was chosen in December 2015 as the way forward.

Supportability

  • 17m:30s - Supporting the database for high performance is a big task, so worked with the support engineers to make it supportable without needing to add a debugger.
  • 17m:55s - The dynamic management views (DMVs) provide a view of what’s happening in the SQL Server in a SQL style language.

Some assembly required

  • 19m:30s - Everything runs on Linux in user mode. 
  • 19m:00s - The initial binary loads a standard Linux ELF binary, but when the initial loader runs it loads the standard Windows binaries using its own loader mechanism.
  • 19m:20s - Some assembly required, in order to interact with the ELF and Linux subsystems.

Addressing

  • 20m:30s - Once the Linux loader has booted and initialized, it should not do any further memory allocation.
  • 21m:00s - After the platform abstraction layer is running, a chunk of memory is put aside for the Windows programs.
  • 21m:20s - A special memory page is created and put aside for communicating the NUMA and CPU information to the SQL platform abstraction layer.
  • 21m:40s - The SQL platform abstraction layer uses its own memory manager and uses the information provided by the loader to know what memory to use.
  • 22m:00s - The application uses the memory and calls APIs through the PAL that ultimately end up being Linux mmap calls.
  • 22m:30s - Since everything runs in a single Linux process, it’s possible for a memory issue to read outside of the PAL allocated memory and cause corruptions.
  • 22m:35s - The memory problems are mitigated by the fact that the PAL layer’s memory and the embedded process memory are separated, so that reading out-of-bounds will trigger a violation.
  • 22m:45s - Debugging memory issues is difficult enough when they are same-world issues; for cross-world issues, it’s even more difficult.
  • 23m:00s - Debugging tools on Linux can be used to verify that the worlds don’t collide, putting red pages in place, running with randomised memory.
  • 23m:30s - Valgrind is used to make sure that there aren’t any obvious errors.

Performance

  • 23m:50s - Performance and scalability are some of the keys of a database engine such as SQL Server.
  • 24m:10s - The multi-layer approach introduces the potential for problems, where SQL Server runs on top of SQL OS; that runs on top of the SQL PAL; which then runs on top of the NT kernel, which runs on top of the Drawbridge runtime, which runs on top of the Linux kernel, which may be running on top of a hypervisor.
  • 24m:25s - There is the potential for some redundancy between these layers.
  • 24m:40s - There were opportunities to explore the SQL Server interactions to optimise parts of the application layers based on how they work.
  • 25m:05s - The optimisation was more about the reducing of inefficiencies in the SOS layer, rather than the combinations of the layer.
  • 25m:30s - Drawbridge had lots of global state and contention as well as code path link issues inside of the LibOS all the way down to the driver, which consumed a large number of instructions.
  • 26m:00s - Drawbridge relied on Windows to do the memory abstraction; once the page came back there was a global state for every address region - so any allocation from inside the same pico process would result in a global lock.
  • 26m:55s - Contending on a global lock would be inefficient for large memory spaces, so this was removed to support higher throughput.
  • 27m:20s - Every dynamic allocation, or an I/O, came from a global heap with a global lock, which resulted in serious inefficiencies.
  • 27m:45s - Reducing memory allocations also improved a lot of performance in the SQL Server layer.
  • 28m:10s - Running on 16-way logical processors with 128 Gb of memory brought the performance into 20% of running on an equivalent Windows runtime.
  • 29m:00s - Optimising the I/O call paths reduced the instructions called that gave a significant performance boost.
  • 29m:30s - There are further inefficiencies with the networking stack that can be addressed in the near future.

Companies Mentioned

References

More about our podcasts

You can keep up-to-date with the podcasts via our RSS Feed, and they are available via SoundCloud, Apple Podcasts, Spotify, Overcast and the Google Podcast. From this page you also have access to our recorded show notes. They all have clickable links that will take you directly to that part of the audio.

Previous podcasts

Rate this Article

Adoption
Style

BT