Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ


Choose your language

InfoQ Homepage Presentations Analyzing Codebases for Fun and Profit

Analyzing Codebases for Fun and Profit



Jordan Bragg discusses using entry-points, breadth-first scanning, and operation tagging to demystify the domain, see where to dive deeper, and uncover what technical debt may exist.


Jordan Bragg is a generalist who started his career in QA at Lockheed Martin. Jordan has since worked full-stack, focusing on back-end service development and security. At Castlight, Jordan has spent his time integrating with partners, simplifying domains, creating R&D-wide enablers, and pushing for increased security, observability, and event-driven systems.

About the conference

QCon Plus is a virtual conference for senior software engineers and architects that covers the trends, best practices, and solutions leveraged by the world's most innovative software organizations.


Bragg: My name is Jordan Bragg. I'm a software engineer at Castlight Health. I want to spend a little time talking about analyzing code bases and why it's important for being a better developer. Why is it important? The quicker you can jump into a code base, and understand it, the quicker you can provide value. This could also lead to you making better decisions on what types of tools, libraries you adopt. It can also give you a better understanding of the risk in certain libraries, or tools, or even into your internal code base, what risks exist in domains that you're analyzing.

Providing Value

Digging a little bit deeper on the providing value aspect. As you become more senior, the more you are reading code, especially in domains that you know little about, and so being able to jump into a code base to provide guidance and support and not purely friction for engineers, is a big win. Outside of that, sometimes you have to jump into a domain and own it or bring it forward, or deprecate it, or derisk it. Being able to quickly jump in and understand it is vital for some of these time sensitive things. Then, lastly, there's a bunch of open source libraries that we've started to use these days, and so being able to understand what you're pulling into your code. One of the beauties around open source is that we can read the code and we can iterate and create enhancements or bug fixes, and not just submit tickets and hope somebody fixes it for us. Basically, analyzing code, reading code, writing code, it's all an acquired skill that we get better at over time, the more we do. It's something where you have to use it or lose it, and the less you do, the more you're going to lose that skill.


Over time of reading a bunch of code bases both for different companies as well as open source libraries, I see some patterns that I generally follow when I do that. I wanted to codify or structure it a bit, and maybe provide it as a way that could be helpful for you to build yours, or at least reflect on how you do it, and maybe even share some of the things that work for you. Breaking things down into three categories is how I did mine, starting with defining the problem, and doing some planning before we actually just jump into code. Then, if needed, we can explore the code base and break it down.


The define stage is really around defining your problem, which could also be, what is your goal? Are you fixing a very specific bug in some code base? Are you adopting or rolling out some new technology or tool? Or are you trying to understand some complex domain to provide value for teammates and guidance? Or, are you also taking ownership of code? There's various goals, just make sure you're clear. The next part is around how much time do you want to invest in this? Do you have an hour, a day, a month? This very much affects how much context you can gather, which leads to the next part of, how much context do you need? Time and the need are intertwined here. Do I need a surface level just so I can provide value and see risks, or do I need to know every component here, because I'm taking ownership and need to innovate on it?

Planning - Context Gathering

Then we move on to the planning phase here. In the planning phase, the first thing we want to do is really just gather some context. This involves seeing what documentation exists. For open source libraries that are supported, there's a lot usually. Of course, they don't list all the bad things that happen in the code, but they do give a lot, and so this is not true for generally your company where documentation is either nonexistent or it's very limited or stale. A lot of times as soon as it's written, it's already stale. Then you could also do things like pair program with engineers who know the domain. They can walk you through some bugs or some of the high level flows. Other ways could also be hands-on debugging. Can you get an example, run a simple example, set up your dev environment, write a test? There's various ways to get your hands dirty and go through a debugger. The only caution here is that sometimes setting up your developer environment takes longer than reading the code, at least depending on how much context you need. Be wary.

Planning - Define Entry Points

The next part of planning is really around understanding the entry points that you want to read through. There's the general procedural stuff that we want to read, but there's also some of the how I got here. If you're looking at a bug, you might have an error or a line number somewhere in the stack, where you're interested. Then there's a question of, do you need a larger context to understand how you got here and why it happened? Your entry point might include something further up the stack. If what you're interested in, involves some state that was created outside of this flow, then you might want to also identify entry points where that state was created and used. Then, in a similar way, what is the asynchronous things going on in your code that you need to include because it affects the things you're mainly interested in? You can add those as entry points.

Interests for Tagging, and Deep-Dives

Once you have an idea of your entry points, and you have some context, and terminology, then it gets into what you really are interested in. There are some things that I tag as interest as I read through code. One of primary interest for me is always I/O. Anytime I'm calling a service, or inserting into a database, or reading from a cache, or even reading and writing from disk, I want to make sure I tag those as important. I also generally care about anything that's configuration or input, so things that I can provide that can change the behavior I generally care about. Then we get into domains. For domains, if you inherited, let's say, a monolithic system, and you're trying to break it down into microservices or at least understand all the domains. You could, as you're going through these entry points, tag pieces where they're intermingled, or you see a domain that isn't part of this core responsibility. To tag these things which could aid in breaking your service up, and you could tie it with domain driven design, understand the domains, the bounded contexts, and then use this tagging to help detangle.

Then the last three are things that I always tag. The first one is anything that I've identified that is a key element or responsibility that warrants, maybe it's important to dig deeper and understand. Then, if there's other things that you think are important, and you want to know about, but they're not key responsibilities of this entry point, so I mark those as passive interests. Then, as I'm going through the code, I really want to maximize the value that I'm providing. Anytime I have questions about things that don't make sense, or things I don't understand, I write those down. If I see things that just seem wrong, there's bugs, inefficiencies, bad readability, make sure you mark those too. That way, you could potentially put some tickets or code it yourself to improve the code. You leave the code better than you left it.

Explore - Breaking it down

Now that we've done some defining the problem, planning before we do it, we are looking at, now we're going to dive in. What does that mean? Breaking the code down into sub-problems is a good path. For that, iterating on some of these entry points, tagging these items of interests, and potentially breaking them apart by those items of interest. Then, really, stay on the surface. Don't let the depth really pull you away or else you'll be 10 levels deep, 4 levels removed, very confused about why you're there. To do this, summarize things the best you can. Make assumptions if it's not super clear. Question things that you can dig in deeper. You can only just try to simplify how you describe each level.

Exploration Pitfalls

This leads to some pitfalls that I have fallen into. The first one is around understanding every line of code. I always want to understand everything if I'm going to read it. However, the time commitment and sanity and volunteer generally are a limiting factor. You have to be really clear on how much context you really need. Some of the techniques I've used to keep myself on track is, again, the surface level analysis or breadth-first analysis. Try not to go in depth before I can summarize the top level. Then if you're in the top levels, the surface, and there's complex branching logic, if you can't summarize it, then simplify it by either if it's not of interest, go over it. If it is of interest, try to follow only the most common path or the path that your entry point follows. Then, if you really need to go beyond the depth to summarize the top level item, then set yourself a depth limit and time box yourself. For me, I try to keep myself at over than four in depth. Then I like to use the Pomodoro Technique to keep me focused, and tell me when I should come back to the surface.

A second pitfall that has two parts that I've fallen into in the past is around documenting nothing. Many times, you don't do much planning, you just jump into the code and read, and you don't take any notes or anything. There's a few problems with that. One is that you're really not maximizing the value of your time spent in there. Secondly, like for me, I don't have the memory capacity to consume all that data and be able to verbalize it all and visualize it well. However, this has led to another part of the problem, which is around documenting too much. You can't document every line of code, you have to keep it at a summary, minimal level about what you're really interested in knowing about these things. Keep it high. One way I've gotten around this in the past is by taking handwritten notes. I get cramps really quickly, so I have to make sure I keep it at a summary level.


Now that I've talked a little bit about the structure that I follow internally, I wanted to go through a little example, at least briefly. I'm going to define my problem. My problem is that I wanted to understand how Kafka consumer works. For Castlight, we started leveraging Kafka a lot more. One of the key pieces here was the consumer. I wanted to have a base knowledge of how this thing worked. Obviously, there's a ton of other questions I have here around, how do I ensure it's exactly once? How does rebalancing work? How chatty is the consumer? What is the system impact? These are all things I care about. I feel like there, I could build on from a base concept of, how does it work? How much time do I have to spend in this? Let's just say I have a day.

Gathering context, I spend a good chunk of my time commitment at gathering context. There were so many good articles around that explained how a Kafka consumer works, even like exactly at the code level. It's very likely, you don't have to go that deep unless you're wanting to contribute or understand specific aspects within. There's great resources. Just looking at a simple example here of entry points. A simple example of a Kafka consumer involves creating a consumer, subscribing to topics, consuming records, doing something with them, and then closing. There's a lot of internal things going on if you've read context about the consumer around it, committing offsets, and joining group, and metadata, and all that. I want to understand why that happens.

Looking at the first entry point here, which is just the constructor. My thought here is that it's just going to be some simple instantiation, storing some of the configuration I passed in. Yes, no I/O, hopefully. Looking at the code, it is initially a bit overwhelming. You don't have to read all this. It's around 150, 160 lines of instantiation logic. I wanted to start tagging these interests and breaking it down a bit. Doing a few passes over this code, the first pass I'm going to do is to tag interests of state that is being managed in the Kafka consumer. If I do one pass through and identify each of the state, I now have 19 different things that this thing manages. If I do another pass, and say, this time, I want to identify interests of configuration, so what inputs that I can provide change the behavior here? I can go through and list each one. You could see that just at a surface level, there's 34 different configurations. I'm listing which variables they're tied to so I can get an idea of what state exists here. What state is managed by this Kafka consumer, which is a lot?

Doing another pass, we're going to say, what things are passive interest and which things are key to the Kafka consumer. In the first page of code here, I saw mostly three things of passive interest. I'm going to list those, who, why. The GroupRebalanceConfig, EnableAutoCommit metrics, these are things that I mostly care about, just because I'm passing the configuration or input into them, and so, of these 34 configs I had, there's some that I'm not including. I'd like to understand those.

Going into the second page, I start to see things that are key to polling records. What topics am I subscribed to? The metadata about the Kafka cluster and topic ownership, which is the metadata. I see a piece here around how we get the bootstrap servers to resolve the hosts. There is a question here of what I/O happens here, which ends up just being a DNS resolution. Then this K4 is interesting. There's this metadata bootstrap, and you're giving it the broker host. There's a question here of, does this do Kafka API calls? We could set it as a point of interest and dig deeper. Just to give you a brief here is, it does not. It just sets some internal state on nodes.

Moving on, there's this.client on K5 down below. Just glancing at, it appears to be the main in I/O client that talks between the consumer and Kafka brokers. If we move on to the next page, you could see that that client is passed into K7 and K8. Going back a bit, K6, we're talking about how we assign topic partitions to different consumer group instances. That's of interest. Then the last two, the coordinator. The assumption here is it does everything minus actually get records and deserialize them. Whereas the fetcher below is what's actually polling records from the topic and doing deserialization. One thing that I'll pause and say that's of interest here is on K7, there's this group ID is present. If it's not present, it's null. There's a question here of, what is the behavior of the Kafka consumer if you don't have a coordinator? How many people now need to assume that the coordinator could be null? I'm sure there's a good story for that.

Moving on, looking at the next section, we're looking at the subscribe. Here again, we're looking for, is there any I/O going on? Is it state based? What's going on? If we take a quick look, this piece of code is a lot more readable than the last one. There's not a lot of stepping set here. Scanning through for passively key interests. Of interest here is that K1 and K5 were acquiring and releasing a light lock. Since Kafka consumer is not thread safe, a lot of these operations lock, so they can't be interweaved. The next part is around validation. There's this K2 fetcher clear buffer for unassigned topics. I don't think this applies to us. It's something that we should maybe dig into and understand why this happens. K3 and K4, we're talking about subscribing to a topic, and then requesting metadata update. There's questions there of, does this actually reach out to Kafka and update the metadata, or assign subscription metadata? Does it do anything with Kafka itself? The answer, if you set a depth limit of four, and you go three levels deep, you'll see that the subscriptions and metadata is pretty much just updating state and instance within there, and there's no I/O. It's pretty simple.

Moving on to the last one, which is the poll. This is the beast of the three. We won't go too deep on it. Taking a quick look here, not a lot of stepping set that we see. However, a few things of interest. Again, K1, K7 we're looking at the whole, obtaining the light lock and releasing at the end. Looking at some metric of this process, there's some validation. We have a passive interest on this client TriggerWakeup. We don't need to dig into it right now, I don't think. Then K3 and K4 are of interest, because it seems like those are the two core things. UpdateAssignmentMetadataIfNeeded, my assumption here, especially by its WaitForJoinGroup is that it does everything with the coordinator minus polling actual records. This is part of a branch logic where one is in a while loop and one's a single call. Bubbling up to our initial entry point, we'll see that it always falls into this first one, and that the second one is deprecated. Then K5 and K6 is around transforming the data before you return it, and optimization to call forward. Really, what we care about here is K3 and K4.

Kafka API Calls

Just giving you an idea here, if we dig a few levels deeper in those two methods, one thing we'll bubble up here is what API and I/O calls are done between the consumer and Kafka. Just to give you an idea, the UpdateAssignmentMetadataIfNeeded, makes up to 11 different API calls to Kafka, depending on what exists already, the first run versus the nth run. Then pollForFetches is primarily just around fetching and getting metadata.

State and Config by Class

Stepping back a little bit, and I'll be able to scan through them quickly. One of the things to do is to create some visualizations of what you know, and what's valuable. One of those is creating a simple class diagram from the Kafka consumer of the 19 items that we cared about. Tying them to the main classes that we found of key interest. Then, the main piece here of interest is I'm tying all of the configuration state to where it's being used. If I care about where request timeout milliseconds is used, I know it's in both the Kafka consumer, the consumer network clients. I think that's it, yes. If I want to understand how they're used in each one, I can dig deeper. Now I can summarize my overall learnings from that reading, which is that there's a lot of state which is risky. It's not thread safe, so be careful. There's a lot of configurations that can turn so many knobs. There's the logic in the creation of the consumer, and the subscribe is very minimal and just setting state. Then the poll, there is a huge amount of logic there that happens sequentially every run. We know the two pieces that we really want to dig into more if we need more context.

What's Next?

Rolling it back. What is next? We did a little bit of verbalizing and visualizing, but continue doing more, iterate on any new entry points to get more context. Make sure that you stop when you have enough context. I think that's pretty important, since you don't need to read every line of code. Then, as you do this over and over, you'll improve your own process, have better tools, remove or add steps.

Questions and Answers

Van Couvering: Definitely a theme that maybe you can talk to more is the tooling around this. How do you actually do the annotation is one big question. Then another one is about overall management of your notes, so you don't get lost in your own notes. Maybe you can talk more to those questions.

Bragg: I think that there's going to be no perfect answer on the tooling piece. As you were saying, David, you looked around a little bit, and there's not a ton. There's obviously room for improvement here. I think in the past, I've locked myself into writing some of these notes, and then summarizing them into more formal documentation in Lucidcharts and things like that. I think that there's room for improvement. I would love to see more like IDE tools that you can use that add some associated metadata, and then you can upload that into Confluence or something.

Van Couvering: Yes, or even be able to annotate, and have those annotations be part of the code itself that gets checked into GitHub, some standard annotation. The closer you have is comments.

Bragg: You have comments. I think that we also try to document code with like BDD tests and things like that. As we've seen, those are hard to keep up to date too.

Van Couvering: One person did mention that VS Code has bookmarks they can add to the code. We talked about using breakpoints to annotate code and then doing notes associated with that.

I was going to ask you about the visual stuff. You had one example of the class diagram. Another thing that I know I've done is flow diagrams, whether it's sequence diagrams or the diagram where you have different entities and they just show the steps using one through four, the step numbers, things like that, to show the flow.

Bragg: I think that depends on the type of entry point that you're looking at. The example that we did, the first one was building up state, which I feel a little bit like a sequence diagram is not great for. I care a lot about all the configuration and so the class diagram made sense there. Then the later parts where you can have procedural, that's where those flow diagrams and sequence diagrams can be really helpful.

Van Couvering: Did you use a tool to autogenerate the class diagram, or did you create it by hand just for the properties you note as important?

Bragg: I tried to use some of the tools, but the tools brought in a bunch of other variables that weren't part of the initial instance. In time crunch, I manually created that one.

Van Couvering: I actually have found that tools can be almost as overwhelming as the code, because the class diagrams, they don't know how to separate the details that are important from those that aren't, and so you end up with a class diagram that can be as confusing as the code. I actually learn better when I draw them by hand. It helps that sink into me more when I actually draw it by hand.

Bragg: IntelliJ has an autogenerated class diagram thing. If you say like, add variables or add methods, it adds everything, and you don't have any way of saying, I only want to see some of the stuff I care about.

Van Couvering: There's a couple of questions related to instrumentation, tracing tools, and static analysis. Have any of these tools helped you in the past, or have you found them to be useful?

Bragg: I think that it depends on identifying some of the things you actually want to read. Sometimes I'm like, I don't even know where to start with some of these entry points, if I'm looking at a library or whatever. Observability tools can be good to link me in a certain direction, or tell me what state exists. I'm a big proponent of tracing, instrumentation, debugging, things like that to help you walk through. Even this stuff, when, for example, it probably would have been good if I could have used a debugger to see what the state was, because I ended up having to spend a bunch of time looking at the instantiation, because there was that much state.

Van Couvering: How do you share what you learned with the rest of the team to make sure no one has to dive into that same code base or knowledge again?

Bragg: For me, I'm a very visual person. To me, it's always about creating visualizations, explaining with examples. When I write these notes, a lot of times, it's very rough for me, and that's why the tagging is really important in summarizing what the code does, and not trying to document every line. Because then I want to take that and make it readable for everybody, whoever the audience is, and create visualizations and things like that. It's an iterative process.

Van Couvering: Once you've done that, there is still the risk that it will also get out of date, like any other documentation.

Bragg: It is true. That is the documentation dilemma where, as soon as you write it, it's stale. I think that that's where it would be good to have good tooling and have a way to associate it with the code in Git, because then if they change the code in merge reviews, then you could have it at least notify you that your documentation is updated.

Van Couvering: I just was curious what has been one of your more challenging code bases. Then when you conquered it, what benefits did you see out of finally getting it figured out?

Bragg: Some of the most challenging code bases are usually the ones internal, because there's zero documentation and stuff. I think that, to me, I like understanding, getting at least a base understanding of how something works. That helps me understand when I see issues or if I need to do reviews or rewrite it. It's really useful. I'm trying to give, like you said, one big win. I think that I do this all the time. To me, it's just like, it was codifying something that I already do all the time, because I find it necessary. I can never just jump into these code bases and just write code without understanding it somewhat. For this Kafka example, we started using it pretty heavily and there's so many different weird issues that pop up. Having at least even like the base understanding like we walked through, is really valuable to understand like, "I understand that happens because there's all this internal state that it managed, and the metadata is out of date," or whatever. I feel like you save a lot of time on your day to day, once you understand it.

Van Couvering: I do want to share it, because part of the reason I invited Jordan is because he and I worked together at Castlight, and I saw incredible value when you did all that work studying Kafka, and how it enabled us to be much more successful with it. I do think it is a superpower to have these skills and be able to quickly learn a code base and then apply it to what you're trying to build in-house. I found it really valuable. I've always been impressed with Jordan's ability to quickly understand code. Now I know the secrets too.

Bragg: It seems straightforward to me. Like I said, I do wish there was more tooling. I think that where internally I find it most useful is, at Castlight, we've been going more towards microservices and breaking up our domains and stuff. Being able to jump in these big monoliths and understanding the domains and breaking them apart isn't really useful, especially if you own some table and you don't know that somebody else is going directly to said table. It's good to dig in and understand those things.


See more presentations with transcripts


Recorded at:

Sep 11, 2022