BT

Facilitating the Spread of Knowledge and Innovation in Professional Software Development

Write for InfoQ

Topics

Choose your language

InfoQ Homepage Articles Why and How to Test Logging

Why and How to Test Logging

Key takeaways

  • Troubleshooting distributed, scalable systems (often composed of volatile infrastructure) depends on adequate logging and search facilities
  • Techniques such as unique event IDs, transaction tracing, and structured log output, give us visibility into application behaviour and health
  • Logging no longer “slows down” performance, instead there are important speed gains in system failure recovery, especially if using log aggregation
  • We need to test core operational requirements such as logging
  • We can test logging in a similar way to functional requirements, with user stories and BDD scenarios

 

Modern log aggregation & search tools provide significant new capabilities for teams building, testing, and running software systems. By treating logging as a core system component, and using techniques such as unique event IDs, transaction tracing, and structured log output, we gain rich insights into application behaviour and health, especially cross-component visibility. This article explains why it is valuable to test aspects of logging and how to do this with modern log aggregation tooling. This approach makes logging a channel or vector to make distributed systems more testable.

Logging speeds things up overall

Historically, many people considered that logging 'slowed down' software. With synchronous file I/O, slow disk storage, and even slower network speeds, this view had some merit, and so we tended to be judicious about what we logged from software running in the live environment. However, with asynchronous file I/O and SSD storage becoming the norm and 1Gb, 10Gb, and even 100Gb Ethernet increasingly common, the performance characteristics of logging now look different.

These days, outside time-critical applications such as financial trading and other algorithmically-sensitive situations, pure runtime execution speed is rarely something we should optimize for in software systems. Particularly in the context of distributed systems, cloud, and IoT, we need to consider the time it takes to restore service (often termed 'Mean Time to Recovery', or MTTR) following a failure, as well as the time it takes to determine the cause of problems in upstream (testing) environments.

Modern log aggregation and search tools – such as ElasticSearch/Logstash/Kibana, LogEntries, Loggly, Sematext, and so on – provide hugely empowering ways of interacting with our software as it is running through rich user interfaces for exploring application behaviour and programmable REST APIs for searching and correlating events across multiple servers.

Although additional logging may result in a 5%-10% drop in raw runtime execution speed, the detailed information available in a single searchable location helps us to diagnose problems much more quickly, speeding up our incident response and often significantly reducing the time to find that awkward bug!

The combination of fast I/O & storage and modern log tooling – particularly when the tooling is available to all testers and developers – enables us to treat logging as a key part of our software systems; this leads us to ask: if logging is a key part of our software system, how can we test it?

Parcel tracking as an analogy

Most of us are familiar with online parcel tracking tools. These allow us to see where our parcel is given a tracking ID. These tools have two interesting features: being able to trace a specific parcel through the delivery network and also showing various different states (or events) that can relate to a parcel at different times.

When tracking a parcel, we can see states such as ‘Arrived at depot’, In transit’ and ‘Delivered’; these represent specific states or events and each has an internal identifier (ID) within the system – an event ID.

Within modern asynchronous distributed software systems, we can use a similar technique to trace execution of operations across different component boundaries. To help us do this, we define some of our own event IDs that relate to the system that we’re working with.

Test for expected events and correlation IDs

We should not spend time testing the logging subsystem itself, such as log4net, log4j, etc.; we should assume that the mechanics of logging (writing to disk, rotating log files, flushing buffers, etc.) are already handled. Instead, we should concentrate on ensuring three separate but related things:

  1. Events that we expect to occur appear correctly in the log stream
  2. Transaction identifiers (aka correlation IDs) flow through the log stream as expected
  3. Events are logged at the appropriate level (Info, Error, Debug, etc.) – if we're using configurable log levels

Of course, by checking for these things, we exercise the logging subsystem and implicitly test that too. By addressing logging as a testable system component, we also tend to reduce the ‘time-to-detect’ for problems, increase team engagement, enhance collaboration, and increase software operability.

We need to define a set of event type IDs that correspond to useful and interesting actions or execution points in our software. Exactly how many of these IDs you use depends on your software, but at a minimum we have ApplicationStarted and things like DatabaseConnectionFailed or DocumentStoreUnavailable (define additional IDs when these are useful – don’t try to define all possible events up front).

For instance, if we’re building an ecommerce application in C#, we might have:

public enum EventID
    {
        // Badly-initialised logging data
        NotSet = 0,
        // An unrecognised event has occurred
        UnexpectedError = 10000,
 
        ApplicationStarted = 20000,
        ApplicationShutdownNoticeReceived = 20001,
        
        PageGenerationStarted = 30000,
        PageGenerationCompleted = 30001,
        
        MessageQueued = 40000,
        MessagePeeked = 40001,

        BasketItemAdded = 60001,
        BasketItemRemoved = 60002,
        
        CreditCardDetailsSubmitted = 70001,

        // ...
    }

We use human-readable names for the event IDs along with unique integer values that group together related or similar events: here, all event types relating to ‘Basket’ have integer IDs between 60000 and 69999. When our software reaches this state in the code, it writes the relevant event type ID to the log file along with other log data. This in turn gets picked up by the log aggregation system and made available for searching (via a browser and an API).

If we want say to automate a test that expected or unexpected events occurred in the log stream we can query for the event via a simple API call using curl.

For example, we might want to check that a database query occurred (expected events DatabasePreQuery and DatabasePostQuery) and there was no connection issue (unexpected event DatabaseConnectionFailed).

This is the curl command to query the Elasticsearch API (running on localhost) for DatabasePreQuery events (you can find it on Github as well):

$ curl -XGET 'http://localhost:9200/_search?q=message:DatabasePreQuery&pretty'

The result of this query could be, for example (formatted for clarity and added line numbers for reference):

1  {
2    "took" : 19,
3    "timed_out" : false,
4    "_shards" : {
5      "total" : 20,
6      "successful" : 20,
7      "failed" : 0
8    },
9    "hits" : {
10     "total" : 1,
11     "max_score" : 11.516103,
12     "hits" : [ {
13       "_index" : "logstash-2016.05.06",
14       "_type" : "logs",
15       "_id" : "gHEKyHasRb6GaUhM1gywpg",
16       "_score" : 11.516103,
17       "_source":{"message":
18                    "[2016-05-06 17:07:42] slim-skeleton.INFO: DatabasePreQuery [] []",
19                    "@version":"1",
20                    "@timestamp":"2016-05-06T16:07:42.749Z",
21                    "host":"vagrant-ubuntu-trusty-64",
22                    "path":"/var/www/wibble/logs/app.log"}
23    } ]
24  }
24
25 }

Line #10 above shows us that the query matched exactly one log stream entry (total hits was 1), and at line #13 starts the query response, with the actual log message starting at line #18.

We can then input these search results into our database query test using our tooling of choice to parse the JSON response and identify which events occurred or not.

For example, a basic Ruby implementation of the test (you can find it on Github as well):

1 require 'json'
2 
3 file = open("Database_prequery_search.json")
4 prequery = JSON.parse(file.read)
5
6 file = open("Database_postquery_search.json")
7 postquery = JSON.parse(file.read)
8
9 file = open("Database_connectionfailed_search.json")
10 connectionfailed = JSON.parse(file.read)
11
12 expected_prequery_event  = (prequery["hits"]["total"] == 1) 
13 expected_postquery_event = (postquery["hits"]["total"] == 1) 
14 unexpected_connectionfailed_event = (connectionfailed["hits"]["total"] == 0)
15
16 expected_prequery_event && expected_postquery_event && unexpected_connectionfailed_event

We're using the 'json' Ruby gem to parse the results of the pre and post query curl searches that we previously saved to adequately named files (first 10 lines). Lines #12 to #14 state our expectations for the test result (i.e. the log stream should contain a single DatabasePreQuery, a single DatabasePostQuery and no DatabaseConnectionFailed hits or events). The last line is the actual test (Ruby will return true if all of our expectations were correct or false otherwise).

More complicated tests (or analysis of incidents) might, for instance, require searching all Database events in a given time period, counting how many queries, failures, etc. The approach, however, is the same as described above, only slightly more complicated code for iterating over a larger JSON response.

Here's an example of a curl query for that class of tests (you can find it on Github as well):

$ curl -XGET 'http://localhost:9200/_search?q=message:Database*&pretty' -d '
{ 
   "query" : {
      "range" : {
         "@timestamp" : {
            "gte" : "now-10m/m"
         }
      }
   }
}'

We can also trace a single path of execution through the system using a correlation ID that we inject into the system when the activity is first initiated: a user clicking on a button, or a batch job starting, for instance. As long as the correlation ID is sufficiently unique for our searches in the log aggregation tool, we will see only results that relate to that query.

By searching for the correlation ID, we can determine exactly which servers or containers were involved in the processing of the requests and in effect reconstruct the request journey. There are commercial tools that provide features like this but by building in a few of these features ourselves, we gain a valuable insight into the operation of the system.

User Stories for testing logging

Let’s say we have a browser-based system for the legal market that allows legal professionals to edit and save documents. The document data is saved to a document store database and a message is placed on a message queue. However, as part of the regulatory regime in one country, we need to ensure that the document meets certain requirements, so an ‘audit’ service listens for the messages on the message queue and then inspects the newly-updated document:

Here we have an asynchronous, distributed system where apparent success in the browser app – the document was successfully updated – might actually need further workflow to be initiated (say, if the document audit finds problems with the document).

By using log aggregation with event IDs and transaction tracing, we gain the ability to assert that certain specific log messages should appear in the log aggregation system based on actions in the main transactional system:

Specifically, if we know that the audit service should write a log message with the event ID ‘AuditRecordCreated’, we can set up a test to search for that ID after the web app user journey has completed:

With this ability in place to assert our expectations about system behaviour, we can write behaviour-level tests similar to this:

Given I run a scenario as a Lawyer
  And I create a document
  [And I wait 5 seconds]
When I search the logs API
Then I should find a recent entry for “AuditRecordCreated”

This means that logging acts as a way for us to expand our testing of distributed systems by being specific about the actions and events that we expect to be logged and searching for these at specific times.

Conclusion

Troubleshooting distributed, scalable systems (often composed of volatile infrastructure) depends on adequate logging and search facilities. We need to log interesting events in our system and associate them to a given business transaction (such as a parcel delivery) via a unique (correlation) ID. Log aggregation and search tools allow us to trace a business transaction end-to-end with a simple ID query. We can also query a category of events to investigate component or system failures (for e.g. find all Database events leading to a incident). Finally, we saw that we can and we should test these operational requirements in a similar way to functional requirements, namely via user stories and BDD scenarios.  

About the Authors

Matthew Skelton has been building, deploying, and operating commercial software systems since 1998. Co-founder and Principal Consultant at Skelton Thatcher Consulting, he specialises in helping organisations to adopt and sustain good practices for building and operating software systems: Continuous Delivery, DevOps, aspects of ITIL, and software operability. Matthew curates the well-known DevOps team topologies patterns and is co-author of the books Database Lifecycle Management (Redgate) and Continuous Delivery with Windows and .NET (O’Reilly). @matthewpskelton

Manuel Pais is a Dev+Build+QA = DevOps advocate (and InfoQ lead editor). People-friendly technologist at Skelton Thatcher Consulting. Jack of all trades, master of continuous improvement, Manuel enjoys helping organizations adopt DevOps, Continuous Delivery and Cloud, from both technical and human perspectives. He has worked on a range of technologies (Java, .Net, web, mobile) and industries (banking, telecom, defense and aviation). @manupaisable 
Quick facts: favourite race: fim da Europa (end of Europe), favourite wine: red, favourite accomplishment: being a dad.

Rate this Article

Adoption
Style

BT