InfoQ Homepage Articles Who Moved My Code? An Anatomy of Code Obfuscation

Who Moved My Code? An Anatomy of Code Obfuscation

Nov 09, 2022 13 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

Keeping programs, or technology safe is more important than ever. Combined measures, protection layers and various methods are always required to establish a good protective shield.
Obfuscation is an important practice to protect source code by making it unintelligible, thus preventing unauthorized parties from easily decompiling, or disassembling it.
Obfuscation is often mistaken with encryption, but they are different concepts. Encryption converts information into secret code that hides the information's true meaning, while obfuscation keeps the information obscure.
There are various methods to obfuscate code, such as using random shuffle, replacing values with formulas, adding ‘garbage’ data, and more.
Obfuscation works well with other security measures, and is not a strong enough measure on its own.

In the bipolar world we live in, technology, open source software, and knowledge are freely shared on one hand, while the need to prevent attackers from reverse engineering proprietary technologies is growing on the other. Sometimes, the price of technology theft can even risk world peace, just like in the case of the Iranians, who developed a new attack drone based on a top-secret CIA technology they reverse engineered. Code obfuscation is one measure out of many in keeping data safe from intruders, and while it might not bring world peace, it can, at least, bring you some peace of mind.

Introduction

When it comes to high-end and sophisticated technology, Iran never had the upper hand – the embargo and sanctions did not leave Iran with any technological advantage except for one: creativity. The Iranians find the most creative ways to try and stay on top. To prove our point, here’s an interesting story: in 2011, using simple signal interference, Iran hijacked an American super-secret drone: the RQ170 Sentinel, which was the state-of-the-art intelligence gathering drone used by the CIA. It took the Iranians “only” a few years to reverse engineer the Sentinel, in an effort which paid off well: it led to the production of the Iranian Shahed 191 Saegheh, which is based on the Sentinel’s technology, and was recently sold to Russia.

What can programmers, technology vendors, and governments do to keep their technologies safe from the sticky fingers of malicious attackers who want to reverse engineer valuable technologies?

Keeping programs, or technology safe, is the same as keeping your house safe from burglars: the more valuables you have, the more measures you take to protect them, taking into account that in most cases, no one can guarantee your home is 100% safe. Same goes with protecting source code: We want to prevent unauthorized parties from accessing the logic, or the “sauce secrète” of our application, extracting data, cloning, redistributing, repacking our code, or exploiting vulnerabilities.

Hide the needle in the haystack

The best security experts will tell you that there’s never an easy, or a single solution to protect your intellectual property, and combined measures, protection layers and methods are always required to establish a good protective shield. In this article, we focus on one small layer in source code protection: code obfuscation.

Though it’s a powerful security method, obfuscation is often neglected, or at least misunderstood. When we obfuscate, our code becomes unintelligible, thus preventing unauthorized parties from easily decompiling, or disassembling it. Obfuscation makes our code impossible, (or nearly impossible), for humans to read or parse. Obfuscation is, therefore, a good safeguarding measure used to preserve the proprietary of the source code and protect our intellectual property.

To better explain the concept of obfuscation, let’s take “Where’s Waldo” as an example. Waldo is a known illustrated character, always wearing his red and white stripy shirt and hat, as well as black-framed glasses. The challenge is to find Waldo among dozens or even hundreds of people doing a variety of amusing things in a double-paged illustration, full of situations, characters, objects and events. It’s not always easy, and it might take some time to parse the illustration, but Waldo will always be found in the end, thanks to his unique looks.

Now imagine Waldo without his signature stripy shirt, hat, or glasses – instead, he wears a different shirt every time, different hat, and a wig. Sometimes he will even be dressed as a woman. How easy would it be to find him? Probably near to impossible.

Figure 1 Imagine looking for Waldo without his signature stripy shirt, glasses, and hat. Instead, he will be wearing regular clothes and a face mask.

Using the same concept, when we obfuscate, we hide parts of a program's code, flow, and functionality in a way that will make them unintelligible – we mask them, we “twist”, scramble, rename, alter, hide, transform them, and on top of that, we pour a layer of junk.

Good obfuscation will use all these methods, while maintaining our obfuscated code indistinguishable from the original, non-obfuscated source code. Generating a code which looks like the real thing will confuse any attacker, whilst making reverse engineering a difficult proposition to undertake.

Bear in mind that obfuscation, like any other security measure, does not come with a 100% guarantee, yet it can come as close to it as possible if done right, especially if combined with other security measures.

Obfuscation != encryption

It’s important to differentiate obfuscation and encryption, often being mistaken to be the same though they are not. Obfuscation and encryption are two different concepts, and one does not replace the other – if anything, they complete each other.

When we encrypt, we convert information into secret code that hides the information's true meaning. When we obfuscate, the information stays as is, but in an obscure format, as we increase its level of complexity to the point that it’s impossible (or nearly impossible) to read or parse.

A strong encryption is a strong security measure, but we must keep in mind that any lock will be open at some point. Anything encrypted must be decrypted in order to be used, which is like opening the door of the fortress – however strong, it’s still a weak spot. This is where the advantage of obfuscation comes into place: when we obfuscate, we do not encrypt, we simply hide our code in plain sight. Think of obfuscation as hiding the needle in the haystack – if done well, it will take an unreasonable amount of time and resources for an attacker to find your “needle”.

From our experience of years as programmers and obfuscation advocates, we found out that obfuscation is a bit like Brexit – experts are either utterly for it, or passionately against it. However, let’s remember that security always requires several methods used in conjunction with one another – if one fails, the other will still be there – which is exactly why obfuscation and encryption make a good pair. Obfuscation should always come last, i.e. after you add layers of encryption, and fully debug the program, it’s time to obfuscate.

Though this article focuses on how to create a string obfuscation tool, it’s important to point out that, in real life, commercial obfuscation tools obfuscate much more than strings – they include obfuscating functions, API calls, variables, libraries, values, and much more.

See me not

Large corporations use obfuscation for any sensitive software. For example, Microsoft Windows’ Patch Guard is fully obfuscated and literally impossible to reverse engineer. If you are a programmer, you probably don’t own fancy security tools big corporations use, and why should you? But it doesn’t mean you should not be able to protect your code using some simple and practical measures. Obfuscating strings is a good way to save you the use of expensive and complex obfuscation tools on one hand and make your code unintelligible on the other.

In fact, if you take a typical executable and dive into it using any hex editor, or even Notepad, you may find many strings among the binary data which reveal trade secrets, IP addresses, or other pieces of information (figure 2), all in the form of strings, that you really don’t want to give away.

Figure 2: If we open an exe using a Hex editor, we can find some strings, which might give out a lot of information which can be exploited by attackers. In this case. The string “calculator” is found.

Now, let's say your software connects to a remote server and you store the IP being used and don’t want it to be revealed. You can mask and hide the sensitive data that way. The data will only be hidden from the executable file. Of course, once you communicate with a remote server, sniffing tools will show the IP along with anything sent and received – so take that into account. We should point out that there are ways to hide both IP and data even from sniffing tools (such as Wireshark), but that’s a subject of its own.

Under the hood of string obfuscation

There is more than one method to obfuscate your code, as obfuscation itself should be implemented on several levels, or layers – whether it’s the semantic structure, the lexical structure, control flow, API calls, etc. In order to create robust protection, we must use several techniques. As the focus in this article is on string obfuscation, let’s explore four sub-methods.

The importance of being random

When we think of random numbers, we can imagine a lottery machine: the machine uses spinning paddles at the bottom of the drum, and it spins the balls randomly around the chamber. A ball is then shot through a tube, meaning that each ball is randomly picked.

You might ask: why do we need to use random elements in our code? The answer is that one of the methods to decode obfuscated data is to examine what you expect to be the logical order of things, and once we randomize this order, it's harder to guess what the obfuscated data is in the first place.

The big question is: can a computer program generate real random numbers without any hidden logic, which turns the random numbers into, well, not so very random? After all, there are no spinning paddles, no shooting balls, just a man-made program run by a computer.

C++, for example, offers the <random> library header file, and the rand() function. This library is meant to help us generate a random number, or what we might call “a pseudo random”. Why pseudo? Well, because the “random” output generated using rand() is not really random. If we use rand() to iterate while creating random numbers, then test the results statistically, we can see that past several iterations, the generated numbers fail a statistical test, as some of the “random” results can be easily predicted.

An entrepreneur named Arvid Gerstmann developed his own random number generator that is more random, and we use his library as part of the final project in our book, when we develop a mini string obfuscation tool.

Shuffle ‘em like a deck of cards

When we obfuscate, we shuffle various elements, such as strings, functions and so on, so that their order will be (almost) random, which makes it harder to analyze if someone is trying to crack your code. Think of shuffling data as taking a deck of cards and mixing them up in a random order. We do the same with the function we will be generating.

Shuffling is changing the order of some elements in a random way (or almost random), which makes it harder for an intruder to analyze and reverse engineer our code. One of the methods to decode obfuscated data is to examine what you expect to be the logical order and when you shuffle that order, but it's harder to guess what the obfuscated data’s order is. Of course, the aim is not to alter the behavior code, but simply work on a separate module which handles the shuffled elements as they should be handled once called.

Replacing Values with Formulas

Another method used in obfuscation is to randomly replace values with different types of formulas such as x=z-y or z=y+z. Let’s say we have the value 72, we can replace this value with 100-28, or 61+11. When the formula is x=z-y, we need z to be random but larger than y. In other words, we will insert this randomly generated formula into the generated source code instead of the original value.

Figure 3 shows how obfuscated code will look when we insert random formulas.

Figure 3: Good obfuscation uses randomly replace values with different types of formulas such as x=z-y or z=y+z.

Adding Junk and ‘garbage’ data

Another method of concealing the content of our code, making it harder to parse and reverse engineer, is adding random junk data in between the real data. Let’s say for example that the result is a NULL terminated array – we place the NULL at the end of the string and the junk after the NULL. An obfuscated string using this method will look like this:

result[12] = L’$’;
result[0] = L’t’;
result[5] = L’5’;

Now, imagine that we assign the values of the real and junk characters in random order, so we may start with char [12] then [0], then [5], and so on, which makes it harder to understand the flow and result if examined.

Remember: obfuscated code is only as good as its weakest link. We always must test its resistance and try de-obfuscating it. The harder it gets, the stronger the obfuscation is.

Tip: Keep in mind that obfuscated source code is hard to maintain and update. Therefore, it is recommended to maintain the non-obfuscated version and obfuscate it before deploying a new version.

After having discussed a few general concepts behind code obfuscation, in the next section we will present a simple tool named Tiny Obfuscate, aimed to obfuscate strings, and which works in two modes: ad-hoc stream, and entire source code project.

Tiny Obfuscate

Tiny Obfuscate, is a Windows application developed by Michael Haephrati using C++, was initially introduced in a Code Project article as a small Proof of Concept that can be used to convert a given string to a bunch of lines of code that generates it.

Figure 4: The original Tiny Obfuscate interface

You enter the string and a variable name, and the lines of code are generated so they can be copied to the program and replace the original string.

Figure 5: The advanced Tiny Obfuscate commercial version

A more advanced version of Tiny Obfuscate was actually used in real life, as part of the development of several commercial products. This version has a “Project Mode” and an “Immediate Mode”. The Immediate Mode resembles the original version from the old article, but has more features:

Users can select the type of string (UNICODE or wide char, const and more).
The obfuscated code is wrapped inside a new function which is generated.
Optionally: the function code and prototype are inserted into a given .cpp and .h, not before checking whether there isn’t already a function which obfuscates the given string.
The function call is copied to the Clipboard (either the newly generated function or an existing one, if the given string was obfuscated before), so the user can just paste it instead of the given string.
The generated function is automatically tested to verify that it will return the given string.
Various control and escape characters are handled. These include: \n, \t, etc. %s, %d and so forth.
Comments are automatically added to keep track of the original string that was obfuscated and when it was obfuscated.

Example:

Let’s test and see how string obfuscation works with the following example. Let’s say we have the following line

wprintf(L"The result is %d", result);

Now, we wish to obfuscate the string, which, in this case is The result is %d. We enter this string to the Immediate Mode “String to obfuscate” field:

and just press ENTER.

We will then see the following alert:

and the following code will appear (and inserted to the project’s source and header files).

The Project Mode, allows selecting a Visual Studio Solution or Project, going over all source files, selecting what to obfuscate (Variables, Function names, Numeric values, and Strings), viewing a preview of the results, then checking the obfuscated project and interactively checking and unchecking each element to get the optimal result.

The advanced Tiny Obfuscate software generates and maintains a sqlite3 database which keeps track of anything done, allowing it to revert to the original version and undoing any action done.

Conclusion

In this article, we have introduced the topic of code obfuscation, with emphasis on string obfuscation. If you want to go deeper, in our book Learning C++ , (ISBN 9781617298509, by Michael Haephrati, Ruth Haephrati, published by Manning Publications), we teach complete beginners the basics of the C++ programming language, and gradually build their skills towards a final project: creating a useful compact, yet powerful, string obfuscation tool.

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

Who Moved My Code? An Anatomy of Code Obfuscation

InfoQ Article Contest

Key Takeaways

Introduction

Related Sponsored Content

Hide the needle in the haystack

Obfuscation != encryption

See me not

Under the hood of string obfuscation

The importance of being random

Shuffle ‘em like a deck of cards

Replacing Values with Formulas

Adding Junk and ‘garbage’ data

Tiny Obfuscate

Example:

Conclusion

About the Authors

Michael Haephrati

Ruth Haephrati

Rate this Article

This content is in the Database topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter