On Test Levels and Coverage

I'm lucky to work with some of the most talented engineers I've ever met. As a member of the Canonical QA department, we frequently talk about automated code testing, things like: What's the best strategy to test declarative UI languages? How do we test difficult-to-reach-place X? How do we measure how many tests we should have at each level? etc. etc.

There is, however, one discussion (some might say 'argument') that I find myself frequently engaged in, and I don't seem to be able to make much traction. So, instead of repeating my viewpoint ad nauseam, I'll explain myself here, and I can point people to it in the future.

Note: While all the code snippets in this post are either python or pseudo-python, the content of this post applied to any language.

The Question

The question we're often trying to answer is:

How many tests do I have to write before I can confidently release my software to the public when all the test suites pass?

And a common answer is:

When you have 100% test coverage.

Now, when you ask what "test coverage" means, you'll often be told:

Line and branch coverage, as measured during your unit test run.

I think this is wrong, and I'm going to explain why I think it's wrong in a fairly roundabout manner.

Example Code

Typical software is not written in isolation. For anything but the simplest of software, you need to talk to external components ("external" meaning "outside your source tree"). Here's one such example from autopilot, the acceptance test tool I maintain:

def _get_click_app_id(package_id, app_name=None):
    for pkg in _get_click_manifest():
        if pkg['name'] == package_id:
            if app_name is None:
                # py3 dict.keys isn't indexable.
                app_name = list(pkg['hooks'].keys())[0]
            elif app_name not in pkg['hooks']:
                raise RuntimeError(
                    "Application '{}' is not present within the click "
                    "package '{}'.".format(app_name, package_id))

            return "{0}_{1}_{2}".format(package_id, app_name, pkg['version'])
    raise RuntimeError(
        "Unable to find package '{}' in the click manifest."
        .format(package_id)
    )


def _get_click_manifest():
    """Return the click package manifest as a python list."""
    # get the whole click package manifest every time - it seems fast enough
    # but this is a potential optimisation point for the future:
    click_manifest_str = subprocess.check_output(
        ["click", "list", "--manifest"],
        universal_newlines=True
    )
    return json.loads(click_manifest_str)

This isn't the best, or the worst code I've ever seen; it's reasonably readable, and it comes with a suite of tests that provide 100% test coverage. We generate branch and line coverage with the wonderful coverage.py tool. The coverage output looks like this:

So, based on the common answer of "If all your tests pass you should be confident to release", should I be able to release this code when all my unit tests pass? Hell no!

The misunderstanding

The problem is that in the original answer,

When you have 100% test coverage

we have a misunderstanding about the meaning of "test coverage".

An Alternative Definition

I propose that "test coverage" is actually a rather useless term. To me, it means "100% coverage for all tests at all levels". The important thing to note here is that this does not apply only to unit test coverage. So, what other types of tests are there?

This is where the discussion usually runs into trouble. The terminology around software testing is hopelessly fragmented, with the same terms often meaning different things to different people. What follows is a non-exhaustive list of tests at different levels.

This list is in order from the lowest test levels to the highest. A general rule of thumb is that you need more individual tests at a lower level, and fewer at a higher level. Similarly, lower level tests will be smaller and simpler, while higher level tests will be larger, and tend to be more complex.

Unit Tests

Purpose

The purpose of a unit test is to verify the correctness of low level algorithms. Unit tests tend to concentrate on parts of the code that manipulate data.

Definition

A test at the lowest level. It must test a single unit of code. Another way of thinking about this is that a test must be able to fail in a single way only. It really must test only one thing. Sometimes a 'unit' of code will be a single method or function, most often it'll be an even smaller unit of code.

Depending on the structure of the code under test, unit tests may need to resort to mocking or patching out other parts of the code under test. Reliance of mocks and patches are a sign of poor code structure though, so excessive use of those techniques at this level should make test authors nervous.

Example:

Given the following code:

def _get_property_file_path():
    return '/system/build.prop'


def _get_property_file():
    """Return a file-like object that contains the contents of the build
    properties file, if it exists, or None.

    """
    path = _get_property_file_path()
    try:
        return open(path)
    except IOError:
        return None

We need to write unit tests for the _get_property_file function. This function is nice and simple. It:

  1. Calls another function for a path to open.
  2. Opens the path for reading, and returns a file-like object to the caller.
  3. If the open failed, catch the exception and return None.

Since this function does more than one thing, we need more than one unit test.

The first test verifies that the path as returned by the _get_property_file_path function is correctly opened and returned. We do this by patching the _get_property_file_path function to return the path to a file that we create in the file, and populate with a unique string. (Note: I've edited the test a bit, to remove some things that would otherwise detract from my example.)

def test_get_property_file_opens_path(self):
    token = self.getUniqueString()
    with tempfile.NamedTemporaryFile(mode='w+') as f:
        f.write(token)
        f.flush()
        with patch('_get_property_file_path') as p:
            p.return_value = f.name
            observed = platform._get_property_file().read()
    self.assertEqual(token, observed)

There're a few things to take notice of here:

  • We patch the _get_property_file_path function so we can override what it's returning. We could also have restructured the code to accept this as a parameter, which would, arguably, be cleaner.
  • We're cheating a little bit: this test verifies that the function returns a file-like object (at least, one that supports read()), and that the file-like object contains the string we wrote to the file. Unit test purists may suggest splitting this into two separate tests, but I'm slightly more pragmatic than that.

Then we need several tests to check the error conditions. This first one makes sure that the function returns None when we don't have permissions to open the file:

def test_get_property_file_opens_path(self):
    with tempfile.NamedTemporaryFile(mode='w+') as f:
        os.chmod(f.name, 0)
        with patch('_get_property_file_path') as p:
            p.return_value = f.name
            observed = platform._get_property_file().read()
    self.assertIsNone(observed)

We also need similar unit tests that check that the function returns None when the file does not exist, and possibly in a few other cases.

Summary

As the example shows, unit tests are concerned with verifying the correctness of a single unit of code. In this case, the calculation we're performing is simple: "if we can open file X, then return the open file, otherwise return None". However, the unit test does not test how this unit of code interacts with any other unit of code. In my opinion, this is a major source of software defects. For that, we need a higher level of tests...

Integration Tests - Internal

Purpose

While unit tests test a single unit of code, integration tests test how two or more units work together.

Definition

Integration tests are tests that cover two or more functions (or methods, or any other callable), and concentrate on how they integrate with each other. "Internal" integration tests test functions within the same codebase.

When writing integration tests, we start to consider the architecture of the software. If your code is well written, this is something that we have not had to do until now - unit tests shouldn't have to know or care about much outside the function they're testing.

Integration tests are primarily concerned with two things:

  1. The format of data being passed between functions. This includes:
    • Format/contents of parameters passed from one function to another.
    • Format/contents of data returned from one function to another.
    • Set of possible error codes returned from one function to another (this is really just a subset of the point above).
  2. For languages that support this, the exceptions that can be raised by a function.

Example

Consider the following code:

def _get_click_app_id(package_id, app_name=None):
    for pkg in _get_click_manifest():
        if pkg['name'] == package_id:
            if app_name is None:
                # py3 dict.keys isn't indexable.
                app_name = list(pkg['hooks'].keys())[0]
            elif app_name not in pkg['hooks']:
                raise RuntimeError(
                    "Application '{}' is not present within the click "
                    "package '{}'.".format(app_name, package_id))

            return "{0}_{1}_{2}".format(package_id, app_name, pkg['version'])
    raise RuntimeError(
        "Unable to find package '{}' in the click manifest."
        .format(package_id)
    )


def _get_click_manifest():
    """Return the click package manifest as a python list."""
    click_manifest_str = _load_click_manifest_content()
    return json.loads(click_manifest_str)


def _load_click_manifest_content():
        """Load the click manifest file from disk, and return it as a string."""
        click_path = _get_click_manifest_path()
        return open(click_path, 'r').read()

This code isn't wonderful - but it serves to illustrate the purpose of integration tests. Let us assume that the _get_click_app_id function has been fully unit tested. As in the unit test example, the unit tests would mock out the _get_click_manifest function, and return a python dictionary containing what we expect to be returned by the json.loads call.

We have here three separate functions, one of which opens an external resource (we'll talk about that separately, later).

There are many possible bugs that could occur here, even with 100% unit test coverage. To name a few:

  1. What happens when _load_click_manifest_content cannot open it's file? It raises an IOError (which would have been verified in a unit test) - is that error condition handled by it's calling functions? (nope!)
  2. What happens when the string returned by _load_click_manifest_content is not valid json, or cannot be parsed by the json module for some other reason? It will raise a ValueError - do callers handle that error case? (nope!)
  3. What happens when the python object returned by _get_click_manifest does not contain the items expected, or contains the right items, but in the wrong format?

Architecturally, we have code that look like this:

Caller <--> _get_click_app_id <--> _get_click_manifest <--> _load_click_manifest_content <--> external resource

If we were writing unit tests, we'd care about the units themselves. Here, we care about the communication between them. In order to test this, we need to control both ends of the call chain. In this case, we control the external dependency by mocking _load_click_manifest_content to return data we set in the test, and the test is the caller, so we control that end as well.

For example, we might write a test that looks something like this:

def test_cannot_open_manifest_file_results_in_RuntimeError(self):
        with patch('_load_click_manifest_content', side_effect=IOError):
                self.assertRaises(
                        RuntimeError,
                        lambda: _get_click_app_id('foo')
                )

This is the integration test you'd write to test scenario 1, above. This test will fail, exposing the fact that somewhere between _get_click_app_id and _load_click_manifest_content, that particular error condition is not covered (of course, somewhere else in the code we'd have to handle that raised RuntimeError, and that would have to be tested again, but in a higher-level integration test).

Measuring Coverage

Measuring "test coverage" for unit tests is easy: you run your coverage tool, and make sure that all lines of code and branches of execution are covered by at least one test (assuming your code is well structured).

Measuring test coverage for integration tests is harder. First, it's important to note that your integration tests need to be measured separately from your unit tests. The reason is simple: if you have 100% unit test coverage, you won't spot any missing integration test coverage, since all your lines will already be covered.

Even with a separate coverage report, you cannot look at the file coverage percentage. Instead, you need to make sure that all of the following are covered:

  1. Any point where function A calls function B: you need to verify that A passes the correct parameters to B.
  2. Any point where function B returns data to function A: you need to verify that B returns data in the format that A expects. If A simply returns the data to it's caller, then you need an integration test that covers that caller as well.
  3. Any point where a function returns an error code or raises an exception: you must make sure that the error code is handled correctly by the calling function. Exceptions can be harder to deal with, since they will propagate up the stack, meaning that you may need to write integration tests that cover many functions (more on that later).

I have not yet found a way of producing a nice report that makes it easy to see where integration tests are missing. My general process is to use a unit test coverage report, look at call sites, return statements, and places where exceptions are raised.

High Level Tests

Before any integration tests can be written, you need to decide where to cut the call chain. A piece of complex software will typically have hundreds of calls between the highest and lowest level functions. Obviously we do not want to write an integration test that covers that much code.

Hopefully, your software has been written in concentric shells of functionality, which makes the decision about where to cut the chain easy: write integration tests between the exposed shell interface, and the shell beneath it, or the external interface, whichever comes first. If you're writing a functional core with an imperative shell, your life will be much easier. Writing code in this style also helps because the exposed interface of each shell can be a sensible, compact API. This gives us a good place to consolidate raised exceptions, and to return a few, well known errors.

Integration Tests - External

Purpose

Verify that your external data sources behave the way you expect them to.

Definition

Similar to internal integration tests, external integration tests focus on data flowing between units of code. In this case, however, one of the units comes from an external data source. These external data sources include:

  • Libraries that you link to (python packages / modules, for example).
  • Some physically remote system with an API (for example, an HTTP server with a restful API, or an XMLRPC server).
  • A file on disk that is supposed to contain data in a certain format.
  • A separate process that is supposed to print data in a certain format to one of it's file descriptors.
  • ...and many many more examples.

Example

Well structured code tends to exist in concentric shells of functionality, where each layer contains more specific, low-level code. Let's continue with the example from the previous section. We have three units of code that look like this:

_get_click_app_id <--> _get_click_manifest <--> _load_click_manifest_content <--> external resource

The "external resource" in this case is a file on disk. We care about the communication between _load_click_manifest_content and this external data source. Note how the communication with the external data source is in the lowest-level code. To test this communication, we need to cover several possible error conditions:

  • The data source does not exist.
  • The data source exists, but we don't have permissions to read it.
  • The data source exists, but is empty.
  • The data source exists, but contains data that's different from what we were expecting.

Obviously, the exact data source in question will determine the tests that need to be written: external libraries might raise exceptions, for example, while files on disk cannot. A common source of errors when dealing with third party libraries is changing APIs. External integration tests make sure that you'll know about these changes as soon as possible. Files on disk are about the simplest of all data sources we can deal with.

An external integration test for _load_click_manifest_content might look like this:

def test_can_open_click_manifest_file(self):
        content = _load_click_manifest_content()
        self.assertNotEqual("", content)

Notice that we don't mock out the external data source - we want to get the real thing, or we're not testing anything at all.

Running External Integration Tests

Why separate internal from external integration tests? The reason is simple: it can often be difficult to obtain the data source on a development machine. For example, the file on disk in the previous example might only be created on certain platforms, or with certain software installed.

Generally speaking, you want to keep tests that require external resources separated from tests that do not. With this separation in place, we can run all the tests that don't require any external resources on developer machines, and tests that do require external resources can be run in a QA lab, where these external resources can be provided in a controlled manner.

Coverage

How do you tell when you've covered all your external integration points? It's tough. The only real way is to have well structured code, where communication to these external components is reduced to a few small units of code. If anyone has any ideas regarding how to track these integration points, I'd love to hear them!

Acceptance Tests

Purpose

Ensure that the user workflow behaves as designed.

Definition

Even with the best unit and integration test coverage, it's still possible (nay, likely!) that your software will contain bugs. This is especially true for graphical applications, where the layer of code closest to the user (the UI) is essentially untested, since you don't control the UI toolkit code. It's reasonably common to see bugs creep in due to mistakes made when coding the UI. For example, toolbar buttons that do nothing when clicked, model views that don't render the underlying data as intended, pagination that's broken, etc. etc. These bugs are almost always impossible to test for in one of the lower levels, often due to the fact that the UI toolkit does not allow you to call into it without it actually starting the application UI.

Good acceptance tests try and emulate the user, and treat the software as a black box. The test does not know or care about the underlying algorithms or architecture of the code. The test generates keyboard and mouse events, and uses a toolkit (such as the excellent autopilot) to figure out what the application under test does.

Example

Acceptance tests are difficult to demonstrate, since they tend to be written at a very high level, and rely on a lot of external components in order to work. This is an example from the Ubuntu clock app acceptance test suite (which uses autopilot as it's test framework):

def test_delete_alarm_must_delete_from_alarm_list(self):
       """Test to delete an alarm and check if it has been properly removed"""

       # (set-up) Create a recurring alarm
       self.page.add_recurring_alarm(
           'Test alarm to delete', ['Monday', 'Tuesday'])

       old_alarm_count = self.page.get_num_of_alarms()
       self.page.delete_alarm(index=0)

       # Check that the alarm is deleted
       self.assertThat(self.page.get_num_of_alarms,
                       Eventually(Equals(old_alarm_count - 1)))

Other Types of Testing

That concludes the list of the three essential test levels. However, even with good test coverage at every level, there's a good chance you will release defective software. There are several types of testing we have not covered. For example, we've not mentioned security testing, memory leak detection, performance tests, etc. Those are topics for another day! This is not supposed to be an exhaustive list, but rather a minimum list of the tests you need before you can use your test suite as an indicator of software quality.

Summary

Before you can use your test suite as an indicator of your software quality, you need to have tests that cover the following areas:

Purpose Description
Algorithmic Correctness: You need 100% (or as close to it as is realistically achievable) unit test coverage. These really do need to be proper unit tests though. Don't fall into the general trap of writing "unit" tests that test several units of code at the same time, or you'll end up masking potential errors in your coverage report.
Data Flow and Error Cases: You need integration tests, both internal and external, to ensure that the individual units of your code work well with each other, and understand how to handle errors raised by lower level components.
User Facing Functionality: Finally, you need acceptance tests that test the presentation layer between the underlying core algorithms and the user.

I have not yet seen a piece of software that meets these requirements (several come close). Writing software is hard work, and it's even harder when you add the burden of exhaustively writing tests. However, it's only a short term cost. Any software that hopes to survive a decade or more of real-world requirements changes cannot skimp on the tests and expect to survive.

My impression of the software industry is that when it comes to quality, we're still figuring out what we ought to be doing. I suspect that, while my suggested requirements may seem outlandish and extreme today, in 10 years time they'll seem like the lowest possible level of testing required... but I might be wrong. What do you think? Let me know in the comments!


comments powered by Disqus