InfoQ Homepage Presentations Reconciling Performance and Security in High Load Environments

Reconciling Performance and Security in High Load Environments

Bookmarks

View Presentation

Speed:

Download

50:51

Summary

Ignat Korchagin explores how to drive security in a high performance environment and make it a welcome and natural part of the product lifecycle.

Bio

Ignat Korchagin is a systems engineer at Cloudflare working mostly on platform and hardware security. His interests are cryptography, hacking, and low-level programming. Before Cloudflare, he worked as a senior security engineer for Samsung Electronics’ Mobile Communications Division. His solutions may be found in many older Samsung smart phones and tablets.

About the conference

Software is changing the world. QCon empowers software development by facilitating the spread of knowledge and innovation in the developer community. A practitioner-driven conference, QCon is designed for technical team leads, architects, engineering directors, and project managers who influence innovation in their teams.

Transcript

Korchagin: My name is Ignat. I work for Cloudflare. We're going to talk about performance and security. I do performance and security at Cloudflare. I'm also very passionate about cryptography. I enjoy low level programming, like Linux kernel, bootloaders, and other low level scary, unsafe things.

Performance vs. Security

Performance or security, what is more important? How many of you have found themselves in this scenario? Who here are security engineers? The rest is performance. On the one hand, we have the rest of the organization. Usually, the organization drives for performance, because you just get better returns if you have better performance. There is always this pesky security team which is like, "You need to secure systems. You need to add this and that." Sometimes that adds overhead. We're continuously in this war. We're like, what is more important? There are organizations which try to balance that somehow. They recognize there should be some trade-off. We should add some security, but really care about performance as well.

Definition of Performance

People miss opportunities that sometimes you can have both, and one helps the other. We'll go in to explore, what are the potential security improvements which might give you performance? When we talk about performance, most people think performance in the narrow sense, which is usually speed, throughput, or latency. Sometimes you need to consider performance in the broader sense, which includes all the above, but also you have some resource optimization. We have process optimization, or something else. Performance improvement basically means if you can get more for the same cost. Anything more, or you get the same for less. It doesn't matter.

Zero-Cost Security

Let's start simple. We're not going to jump security things which improve performance. First of all, there are many security things which have zero performance overhead. I call it zero-cost security. I'll just walk through all the examples we encountered in Cloudflare. There are actually a lot of other ones. There is a lot of potential elsewhere. All of these examples you might find useful and implement them at your company. What does zero-cost security mean? There are several things which fall into this category. First of all, it's when this actual security cost is negligible, or it affects something but it's a non-primary metric. Also, security costs may be just hidden or amortized by the architecture or the implementation. Finally, security cost is not incurred by normal system behavior, so-called prohibitive security.

Negligible Security Cost: Secure Boot Chain

Let's see the example of negligible security cost. Cloudflare operators a large network. Due to the nature of our business, we need to run on real hardware. We can't use any cloud providers. One of the things we've implemented is secure boot. We have a lot of servers throughout the world. They boot. We need to ensure that boot is secure. Anyone here knows how secure boot works or doesn't know? No? Secure boot is a simple concept. It usually starts with some trust anchor which is on the server. It's like a system firmware which is sometimes called BIOS. When you enable the feature, basically, the BIOS can verify your primary bootloader before executing. When you've verified it, now you trust that bootloader. Then you trust that bootloader to verify your operating system kernel. Now you extend this chain of trust to operating system kernel. Then, because you trust your operating system kernel, you trust it to verify its drivers and applications. In the end, you build yourself a trust chain, or a secure boot chain. The result is all running software on your systems, even if they're remote, are signed and authorized by you.

Secure boot basically has advantages. It ensures that all running code on your system is authorized by the system owner, by you. Basically, it's the most efficient protection from persistent malware. If your system gets compromised, if somebody gets and exploits a vulnerability in your system, the first thing the attacker would do is to ensure they're persistent. Then when you reboot the system, they can regain their control. Secure boot is one of the most efficient features to disable that possibility for the attackers. It also has some non-security improvements. When you can only run authorized and signed software on your system, it enforces your operational procedures. This is performance improvement in the broader sense. Because now your operations team has to always go through the version and control system to do any changes for the system. They can't do one-offs. Sometimes when operations people debug an issue, they fix it locally and forget to go through the whole process and make sure that fix is replicated to your network. Then your fix is lost. Also, your systems run only what's needed. Again, your operating staff cannot run their software, one-off, on the system. With negligible security costs, adding security costs do not add overhead to the running system, because all images, all signatures are checked at boot time. It affects system boot time only. It's a non-primary metric. Even then, modern servers, for us, it takes them actual minutes to boot up, get all the config, and start serving production traffic. Signature checking actually adds less than a millisecond to the boot time. It's negligible.

Amortized Security Cost: Data Encryption at Rest

The second category is amortized security cost. Let's talk about data encryption at rest. Who here doesn't encrypt data at rest? Good. You don't? Why not? It's not important. How this data is stored. Let's review the software stack. You have your application or services. They usually write some files. Files and filesystems care about files. Then filesystem translates those files into data blocks and sends to your operating system block subsystem. Then the blocks are getting stored in the hardware. If you want to encrypt data at rest, you can do it at these various levels. You can actually just buy self-encrypting disks, which is quite easy. You can use your operating system full disk encryption. Actually, in Linux, you can encrypt data directly in the filesystem. Or of course, you can just add code to your application with encrypted data before writing it down. In Cloudflare and like many other SaaS businesses, we prefer operating system full disk encryption for different reasons. Mostly, because it's very easy to set up and no configuration needed. It's fully transparent to applications. Applications do not even know that they're writing encrypted data. Data gets encrypted automatically. We don't roll our own crypto. If instead we do it in the application layer, and we go to each developer asking them to add some cryptography in their software, it usually goes wrong. Unlike hardware layer, it's open source. We can see and audit the code and make sure it works properly. Even Microsoft these days prefers software disk encryption over hardware disk encryption because of the latest vulnerabilities found in hardware disk encryption.

What Is a CDN?

Disk encryption is a typical example of amortized security cost. If you follow the traditional model of a service, so if you have a server and you have many users across the internet, the disk encryption does add the overhead. That overhead is usually amortized by the network latency between your users and the server. Users most likely will not notice that overhead. It's a bit different for us. One of the main Cloudflare offerings is a CDN. This is where we try to bring our servers as close as possible to users. We're very close actually. We have a lot of data centers across the world, more than 200 now. Each major city with internet usually has our data centers. This is where the overhead of disk encryption starts to get noticed by the users.

Average CDN Cache Response Tail Latency

At some point, it was so noticeable where our users started complaining, and we tried to look into it. Basically, we found that disk encryption does create visible overhead. This is our average CDN cache response tail latency. We did an A/B test, you can see that the latency from an encrypted server has much more spike here. It clearly stands out. We tried to look into this problem. We came up with the patch to Linux kernel, which just improves disk encryption performance. When we developed the patch, we expected to improve performance, but we still expected to see some overhead, because you definitely need to have some overhead with encryption compared to things without encryption. In the end, the result was more not that we expected. The blue line here is in our patched encrypted implementation. It actually doesn't do any overhead in our CDN cache tail latency.

Disk Encryption Overhead

We expect a lower disk encryption overhead. We even discuss what overhead is acceptable to balance between performance and security. We actually got none. Our patch doesn't change any crypto algorithm format. It just changes the architecture in the Linux kernel disk encryption implementation. We deployed it immediately. Zero overhead data encryption is a no-brainer. What we learned is that if we don't see any overhead of adding disk encryption, that encryption is not a bottleneck anymore. Probably, there is something else in our systems which is bottlenecking our tail latency now. This security improvement not only added security, it also encouraged us to review our CDN architecture to actually seek these other bottlenecks. We want to get to the point where we see some overhead from encryption.

Prohibitive Security: Syscalls and Seccomp

Who here doesn't know what a system call is? As usual, you have your applications and services. They don't run standalone. They run on some operating system. We run on Linux. Applications cannot do stuff on their own. They need the operating system to do stuff on their behalf, like writing files and sending data over the network. Operating system has a well-defined interface for applications to do that. This well-defined interface is called system calls. Applications just call that interface into the operating system. Here, you can read files and send data over the network. Linux have this nice thing called seccomp. Who hasn't heard about seccomp? Seccomp is the sandboxing tool in Linux. What it allows you to do is actually when your application starts, before they actually do anything useful, the application can declare its intent to the kernel. Seccomp is a contract, where an application can say, I will definitely use one set of system calls. I will never use other set of system calls.

Why is it useful? Let's check an example. It's a toy example. In real life, it is much more complex. Imagine you're writing a simple clock application. Seccomp is a Linux feature so your clock application will run on Linux. Before you even do something useful, when your application starts, you tell the Linux kernel, "I'm a clock app. Because I'm a clock app, I don't need anything from you except the time. I will only ask the time from the kernel." When your application executes, at some point, it asks the kernel for the time. The kernel responds properly. However, imagine you wrote your application in an unsafe C or assembly, and your application got hacked. Then the attacker tries to extract some private data and send it to their host somewhere on the network. It uses some code execution and tries to make your application send data over the network. Because Linux now sees you broke your promise, you broke your oath, you're dead. Seccomp greatly limits the potential damage of remote code execution exploits. It's zero-cost overhead. Of course, if your application dies, that's an infinite performance drop because your application doesn't run. It only happens if your application misbehaves. If your application behaves as is intended, you will get no security overhead. It also improves development velocity. When you declare your intent to the kernel, you declare what the application is intended to do. Software runs how it was programmed to run, not how it was intended to do. You always have bugs. Then, if your application suddenly crashes, that probably it somehow behaves not the way you intended it to behave. Seccomp just sometimes helps to find bugs in software. All those things are examples of zero-cost security.

HTTP/2 and HTTP/3

There are security features which actually improve performance. Here, we'll probably focus more on performance in the narrow sense. Who's heard about HTTP/2? A protocol was adopted as an internet standard in 2015. It's a major rewrite from HTTP/1 which is almost 30 years old. We will celebrate 30 years next year. Unlike HTTP/1, it's a binary protocol. HTTP/1 is a text based protocol. It does connection multiplexing. In HTTP/1, you have to have the underlying L4 connection for each HTTP connection. For HTTP/2, they can reuse one layer 4 connection for many HTTP connections. They also have these nice features called server push. HTTP/1 is request-response button only. In HTTP/2, a server can proactively push resources, which it's sure the client will request anyway. There is also HTTP/3. It's still not standardized. It's another improvement. Where HTTP/2 still uses TCP as its transport layer, HTTP/3 uses UDP for different reasons.

HTTP/2 Performance

When we originally deployed our first version of HTTP/2, we decided to measure performance. These numbers are from 2015. It's our first implementation of HTTP/2. Even then, our company's homepage received a 100% performance boost. Average page load time decreased by half, if it's loaded over HTTP/2. There is another interesting demo which visualizes the performance HTTP/2 might bring to your service. You see a side by side comparison, how a big resource loads over HTTP/1 versus HTTP/2. Why not have it? There is one thing, HTTP/2 is available only via TLS. Although the standard itself does describe plaintext HTTP/2, there are no implementations which implement plaintext HTTP/2. If you want HTTP/2, you need to enable TLS. Of course, you can implement your own implementation. Then you consider the balance here, like the easiness of enabling TLS on your service or re-implementing the whole layer 7 protocol. It's a no-brainer.

Secure Sockets Layer (SSL) / Transport Layer Security (TLS): RSA vs. ECC

Speaking about TLS, who here doesn't know how TLS works? TLS is a security protocol over layer 4. This is a typical TLS handshake. Here, it's not that easy. The blue part is a layer 4 TCP part. In TCP, you have the three-way handshake, SYN/ACK and SYN/ACK. Then the TLS handshake. Why do we need the TLS handshake? TLS has been there quite a while. It allows for different options and cryptographic algorithms to be used. That's why you need the handshake for the client and the server to negotiate which specific crypto in particular they'll be using for that session. It happens on this stage, a client hello and server hello mode. In TLS, you get the cipher suite. It's a combination of key agreement. Client and server agree which key agreement algorithm to use, which digital signature algorithm to use, and which encryption algorithm to use.

The interesting part is digital signatures here. There are only two options nowadays. You can use either RSA signatures or you can use elliptic curve signatures. RSA signatures are this older cryptosystem. It was originally published in 1977. It's security based on the hard problem of factoring large numbers. However, sub-exponential complexity cracking algorithm exists for RSA now. Sub-exponential means they're not efficient yet, but they're very close to being efficient. That's why to make these algorithms unusable in practice, modern RSA has very large key size. That's why you should never use RSA with smaller than 2000-bit keys. In contrast, ECC, they call it the newer cryptosystem. Newer is 1995. It's based on a different problem. A hard problem is based on a discrete logarithm problem over elliptic curves. Because it's a different problem, currently, no sub-efficient cracking algorithms exist for ECC. That's why you have smaller key size. Today, people use roughly around 256-bit keys.

Let's compare the performance of these. I ran it basically yesterday on one of our servers. Here are the numbers. How many signs per second can each algorithm do? We can see that ECC signs is 15 times faster. Signs is what you care about on your server because your server creates signatures. In the end, if you switch from RSA to ECC, you get faster TLS handshakes. You get less CPU utilization. You get less key storage, because keys are much smaller. As a side effect, you get better security, because ECC is just inherently more secure.

More on CPU utilization. We did this measurement in 2017. We use Google's BoringSSL as our TLS implementation. One of our data centers we measured which crypto algorithms are used during the TLS handshake and how much CPU was spent on them. The interesting part is here. During our measurement, three-quarters of TLS connections used elliptic curve cryptography, and only half a quarter used RSA. However, we also measured all the CPU time we spent in the BoringSSL library, the crypto library which implements all of this. From all the time our CPU is busy executing BoringSSL code, almost 50% of that is spent doing RSA. Only 8% is doing elliptic curve cryptography. 8% of CPU time serves three-quarters of signatures, whereas 50% of your CPU time in crypto serves only a small part of RSA, and there is also RSA here. Switch to ECC, you get instant performance boost and less CPU utilization. Most people use RSA because they're afraid of compatibility that some clients might not support elliptic curve cryptography. Actually, most modern TLS server implementation actually allows you to provide a fallback. Because it's the server that decides which algorithm to use, so if the server detects the client which supports elliptic curve cryptography, it will use elliptic curve cryptography. If the client does not advertise it supports elliptic curve cryptography, the TLS server can fall back to ECC. Compatibility is not the issue here.

The Internet: Network of Networks

Let's go from small server to the internet. Who knows how internet works? Who doesn't know how internet works? This is the internet. It's a little bit bigger than that. Actually, the point here is the internet has no centralized system. It's very decentralized. It has no Skynet. You can blow up and get humanity back. It's very distributed. Internet is basically a network of networks. It's a connection. Networks just interact with each other.

The Internet: Autonomous Systems (AS) and Border Gateway Protocol (BGP)

In internet terminology, these separate networks are called autonomous systems. How they communicate is via a protocol called BGP, Border Gateway Protocol. Why do they need to communicate? For your packet to travel from point A to point B, it needs to know the path. Intermediate nodes and the internet need to know who has specific resources, or basically, IP addresses.

They do it via announcements. Different autonomous systems announce which IP addresses they have. We have one autonomous system, which is basically Cloudflare because we now own this nice IP, announces here we have 1.1.1.1. There is another autonomous system which announces it has 8.8.8.8, so it's Google. You don't only announce the systems, the IP addresses you own, you also announce your neighbor IP addresses because you might be an intermediate layer where packets traverse.

The Internet: Packet Switching

The point is, this whole thing is not static. The configuration continuously changes on the internet, so these announcements come and go. In the end, you get what's so-called packet switching, so to reach from A to B, your data doesn't always traverse the same path because these announcements continuously change. In the end, they may reach your destination through different paths.

The Internet: BGP Security

One of the major drawbacks is BGP is quite an old protocol that originally didn't have any security. All these announcements are based on a gentleman's agreement. Your neighboring networks expect you to announce what you own. They don't expect you to announce what you don't own. Of course, this leads to bad actors appearing on the system and they try to announce addresses they don't own. It doesn't always create complete network failure because packets traverse different paths, so some packets may go here and some packets may go here, like what this connection is concerned about. You just get some packet loss but it's not critical. The network throughput is bad.

The Internet: BGP with Resource Public Key Infrastructure (RPKI)

Internet people came up with RPKI, which stands for Resource Public Key Infrastructure. What it basically means is that to combat this scenario, you now sign your announcement. You provide a clear cryptographic proof that you own this IP address. Your neighbors are expected to check this cryptographic proof before they accept your announcement and consider it valid. This is a nice feature. It basically allows you to remove these bad actors from the network. RPKI prevents bad actors from claiming resources they don't own, but these false claimers are not always bad actors. Sometimes, you announce invalid routes because you have bugs in network equipment and software. All these network equipment is configured by real people, and people make mistakes. There are network equipment misconfigurations. In the end, when you deploy RPKI for security, you improve network throughput, because you prevent these misconfigurations from having any effect on the internet. Thus, you make things faster. Some misconfigurations cause severe outages. One example we experienced in Cloudflare in this blog post, when a small U.S. provider called Verizon announced incorrect routes and shut off half of the internet. Minor misconfigurations take place all the time, and you get some packet loss here and there. It's usually not very impactful. It just decreases the performance. RPKI in place, these misconfigurations do not happen, or they're just not valid. Security improves performance.

Cloudflare Network

Let's talk about performance in the broader sense. If you can't find performance improvement in the narrow sense, sometimes you can get performance improvements in a broader sense. This is Cloudflare's network. We have a lot of these data centers, more than 200 data centers. They're all physical, real hardware. Most of these locations are colocation, so we own the hardware, we don't own the data center. Many of Cloudflare engineers actually never visit it. There are data centers we never visited. We just send the hardware there. It gets connected remotely and we start using it.

Data Center Provisioning

We have to provision new data centers a lot. It's a very time consuming operation. First, we need to connect hardware. It's done for us by the data center staff. Then when we take over, we need to go through this very intimate process called verify hardware. Unlike in cloud storage providers, where you get a nice interface for AWS, we have nothing. We just have hardware somewhere. We need to set up initial network connectivity. We basically need then to manually configure the out-of-band interface, which if you deal with hardware is called BMC sometime. They usually come from the vendors, unsecure, so you need to actually secure the out-of-band interface by manually going and clicking buttons. Then when we have a secure out-of-band interface, we need to dump all the serial numbers from the remote hardware. Then we need to cross check the serial numbers with our inventory system to actually make sure we're dealing with the hardware we expect it to be. There is no missed hardware. We received all that we want. It's in the right place at the right time. Then when we're done with it, we do initial key provisioning. To efficiently manage servers remotely, you need some SSH or configuration management system. That usually comes up with using some public key cryptography. Each server should have a unique public and private key pair. You need to push those keys into the server and connect them to a configuration management system. In Cloudflare we use SaltStack. Then when you connect to that server, you need to verify key fingerprints over secure out-of-band channel to ensure it's all secure. You have to do it for each server in the data center. Some data centers are small. Some data centers are large. At the end your operations team is crazy.

Trusted Platform Module (TPM)

What is a TPM? Who knows what a TPM is? Actually, each modern server or laptop has a TPM. It's a tamper resistant crypto chip in modern laptops and servers. Like any other crypto chip should do, it can provide you secure key storage and hardware random number generator. Actually, the most useful point about TPM is it's a fundamental building block for remote attestation, which provides you authenticated identity for remote systems and trustworthy assertions about the remote system state.

Remote Attestation

How does it work in practice? Imagine you have the verifier. For us, it's our configuration management system. You have your remote entity, your remote server with the TPM. At any point in time, the verifier can send a so-called quote, a request to the remote entity. What the TPM can provide is information about the identity. Also, it can provide the full report about the state of the system, which operating system it runs. When it was booted. Whatever you can think of. It's basically the full state of the remote system. Because it's a secure crypto chip, it can provide it in a secure manner. It signs the response. The response is authenticated, and integrity support. By the time we do this remote attestation, we know we're communicating with the right host. We know that we're communicating with the right host securely. The remote host runs only authorized software. We discussed secure boot at the beginning, this is a local thing. The server will not boot unauthorized software. Here, the TPM allows the verified remote party at the other end of the world to make sure that the remote server runs only the operating system we tell it to run and not anything else. After that, we basically can trust the remote host.

Data Center Provisioning with TPM

When you integrate TPMs into your provisioning process, you can get better automation. This is what automation does. It verifies server identity. It verifies the running operating system. It cross checks serial number. Actually, the server itself can cross check serial numbers, because we're trusted to do because it runs our software. The automation, because we have a secure channel, it can provision configuration management keys. It can push the configuration management keys. Basically, when configuration management system kicks in, it brings up the server and starts serving production traffic. What our engineers do at this point is drink tea, having a good work-life balance, or work and something interesting. In the end, by employing TPM, which is a security complex, you get better automation. Less room for human error and misconfigurations during the provisioning. You have faster data center provisioning because robots are just faster than humans. It's our big weakness. You have more efficient engineering time because engineers can develop interesting new things instead of doing manual, repetitive tasks. As a side effect, you get better security which is nice to have.

Conclusions

What we learned is that security does not always have to impact performance. Remember zero-cost security. There are many modern things where sometimes it actually improves your performance. If you can't make security improve performance in the narrow sense, perhaps look for opportunities where security can improve performance in the broader sense like process and cost. This approach is actually useful if you're a security engineer and you want to drive and prioritize your work with the rest of the company. This dialog, performance by security, is what the rest of the company understands. Driving security by performance, like describing better security in performance terminology helps your company to understand security better, and helps you to get it prioritized and done. Try to adopt this approach and maybe you'll have an easier time at work.

Questions and Answers

Moderator: At Cloudflare, do you use all these things?

Korchagin: Yes. These are all the things we use.

Moderator: Do you have any more? Maybe some secret tips that you can share?

Korchagin: Yes. I also wanted to mention a nice firewall. A firewall also improves performance. You waste CPU cycles on the firewall, but if you implement the firewall properly, it will block bad traffic from the rest of your systems. Consider a case of a non-malicious zip bomb. There are certain archives, which can basically clog your CPU. If you block them early in the firewall, your CPU will not be wasted. They're broken anyway so no need to process them. A proper firewall is a nice example. I think it is performance in the narrow sense.

Moderator: It's like a filter. It works like a filter. You raise some performance outside of your network to get better performance inside of your network. Makes sense. Do you use any of these things that Ignat mentioned? HTTP/2. Full disk encryption, probably. Usually, it works just by default.

I didn't know about TPMs because, from my perspective, TPM provisioning is something that should be set up on your data center. It's not something you set up as an engineer for your software.

Korchagin: You don't need to provision TPMs. You can add your own keys to the TPM. TPM also has a manufacturer baked in, so-called endorsement key, which is actually used for remote attestation. You put your own keys for secure storage, but for remote attestation a vendor puts a key there and the certificate so you can actually verify the TPM's public key back to the vendor. When the vendor ships you serial numbers and other data about the systems you're buying, they also send us the TPM public key so we can track which TPM was installed in each server. We can cross check it remotely easily.

Moderator: For developers, we have TPM, let's check. They're valid, fine. We have remote attestation and we don't spend any time on that. I like a security thing that as a developer I don't need to spend time on.

Korchagin: Who definitely knows they use elliptic curve cryptography in TLS?

Participant 1: I've got a question about data encryption. Have you got any tips around runtime encrypting data for storage, maybe long-term storage or anything like that?

Korchagin: You mean the overhead?

Participant 1: Yes. Also for performance optimization.

Korchagin: I just urge you to go to my link, and to my talk. These slides have a full test case, A and B. I didn't do it on a production server. I posted all the comments so you can retrace the steps on your local laptop, and you can immediately see the overhead. The problem which we encountered was that implementation of disk encryption in Linux kernel was not very efficient. The architecture is a bit old. The architecture is basically considered spinning disks, which were 10 years ago. Now everyone is using SSDs or NVMes. Basically, there are some other things in the Linux kernel which didn't work properly so they made workarounds. Now these things work properly, these workarounds are not needed anymore. Most of the overhead comes not from crypto itself but the way it's all tied together. To give you some numbers, our patch on my test scenario, improved the throughput by 100%.

Participant 1: I was thinking about field-level type encryption. Maybe if you wanted to encrypt an email address or something, not the whole set of data. Is that something you've had any experience with or tips around.

Korchagin: Selective encryption is mostly useful if the crypto is expensive. Crypto is not expensive. It's how you implement it and wire it up that's expensive. Sometimes you may end up over-complicating the system and spending more time parsing your data, looking for email addresses to encrypt them rather than just encrypt the whole thing sequentially.

Moderator: Here is a sticker, don't roll your own crypto. Actually, answering the question about field-level encryption, because we do it a lot. Usually, what we say is that if your software runs on JavaScript, or PHP, or Ruby, or Java, you won't notice cryptography. As Ignat says, all the details of your application, all the network calls will be more expensive than cryptographic computations itself. It also depends on the architecture or the data flow. What we see when we re-engineer applications to add cryptography, especially field-level selective encryption, where you need to encrypt only some fields, sometimes you need to re-engineer the data flow of your system, and to avoid select all calls to your database. Just to distinguish reading flow, writing flow, reading plaintext data, reading sensitive data in a different way. In that case, you can have better security, because you have the different data flows, and you can monitor which behavior is normal and which behavior is abnormal behavior. You can even have a performance boost, because you don't do all the separations in one call as you used to do before.

Participant 2: Are you ever worried that the hardware you receive from a manufacturer has been tampered while in transit?

Do you ever consider that the manufacturer or the hardware you receive from a manufacturer can be tampered in the transit? What do you do?

Korchagin: Yes. This is why we want TPMs.

Participant 2: Because last year, there was an article in Bloomberg about some Doom scenario about the chips.

Korchagin: The Super Micro thing. That article wasn't proved actually. We definitely consider the possibility when the hardware is in transit, it can be tampered with. That's why TPMs are quite helpful. That's why we cross check the serial numbers and everything. Depending on the complexity of the attack, you can receive a hardware with different network card. If the attacker was not that efficient, it will have a different MAC address. That's why TPMs are useful, so you can at least cross check the hardware wasn't tampered with in transit, to some extent.

Participant 3: A follow-up of that. How do you trust the TPM vendor to not be nation-state compromised?

Korchagin: You have to. Unfortunately, the way trust works, you can't create trust out of thin air. You always have to have a trust anchor. That's why it's called Trusted Platform Module, not Trustworthy Platform Module. You need to implicitly trust, but from that trust, you can start building your trust chain to make other systems trustworthy. That initial trust, which is usually called the trusted base, you have to implicitly trust it. To use Trusted Platform Module, you have to trust the TPM vendor. To use our servers, we can check if they were tampered with in transit. What if our vendor is tampered? We have to trust it. Currently, there is no way unless you build everything. You're willing to spend the cost to build everything from scratch, your own server, but then you have to trust the vendors of all the components. It goes to an infinite loophole.

Moderator: This is how trust works.

See more presentations with transcripts

Recorded at:

Oct 09, 2020

Ignat Korchagin

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?