Inter-service testing with Acceptable

The Ubuntu Core snap store is architected as a number of smallish, independent services. In this week's post I want to talk about some of the challenges that come from adopting a distributed architecture, and how we're working to resolve them.

Overview

In a typical monolithic architecture, the functionality that you ship is contained within a single codebase. Before deploying a new version of your service you probably run an extensive suite of tests that cover everything from high-level functional tests to low-level unit tests. Writing these tests is mostly a solved problem at this point: if you want to write a functional test that covers a feature that resides in several units of code you can do that - it will probably involve mocks or doubles, but at this point it's a mostly solved problem.

A diagram of functional tests covering the interaction between many units in one codebase.

Functional tests are relatively easy to write when all the code being covered lives in the same codebase.

Similarly, tests that test the interaction between just two components in a system are easy to write, since everything is in the same codebase: we can write a test that calls the first component, causing it to call the second and return a result. In order to determine whether the interaction was successful or not we can make assertions on the result of the operation, as well as rely on the language's run-time checks (although this is one area where Python is not as helpful as a statically typed language).

A diagram of unit tests covering the interaction between two units in one codebase.

Testing interactions between two components in a single codebase is easy too.

The key point to realise here is that even in a dynamically-typed language such as Python, we still rely on the language to do a lot of run-time checks for us. For example:

  • When calling a function you must provide the correct number of arguments.
  • When calling a function with keyword arguments you must spell the argument names correctly.
  • When unpacking the result from a function you must know how many arguments to unpack.

There are exceptions to all of these, but the general point remains: even in a dynamically-typed language, the language runtime catches a lot of low-level issues for you. (In terms of connascence, Python mostly takes care of connascence of name and connascence of position for us.)

The problem arises when we want to test a feature that has units of code in more than one service, and where the communication protocol between those services is something reasonably low-level. For the snap store, we've standardised on HTTP to transport JSON payloads between services. All of a sudden, testing inter-service communications starts to look like this:

A diagram of calls across codebases with HTTP.

Consider a function that makes a request to a remote service with a query, and returns the result. This simplified code-snippet shows how such a function might look:

import requests

def set_thing_on_service(username, thing_name):
    url = 'https://remote-service.internal/api/path'
    json_payload = {
        'username': username,
        'thing': thing_name,
    }
    response = requests.post(url, json=json_payload)
    return response.json()['status']

Of course, a real function would contain a lot more error-checking code. The key thing to notice here is that since we're using HTTP, there's nothing stopping us from sending completely the wrong payload from this function. How do we know that json_payload has been created correctly? How do we know that the service response contains a status key? Clearly we need to write some tests to gain confidence that this code does what we want it to. There are several approaches that can be used here...

Mocking

This is the approach that we've followed up to this point. The basic idea is that we mock out the code that sends data over HTTP, and instead our mock returns a response as if the remote server had responded to us.

The author of the code above writes a test that looks like this:

import io
from unittest import mock
from requests.packages.urllib3.response import HTTPResponse
import production_code  # contains the production code we're testing.


def test_set_thing_on_service():
    # 1.- Set up a mock so we don't actually send anything out on the wire.
    #     instead, our mock will always return a 200 response with the
    #     response payload we expect.
    with mock.patch('production_code.requests.adapters.HTTPAdapter.send') as mock_requests:
        mock_requests.return_value = HTTPResponse(
            status=200,
            reason='OK,
            body=io.BytesIO(b"{'status': 'OK'}"),
            headers={}},
            preload_content=False,
        )

        # 2.- Call the code under test, get the returned value.
        return_value = production_code.set_thing_on_service(
            'username', 'thing')

        # 3.- Make assertions on the returned value and the calls that were
        #     made to the remote service:
        assert mock_requests.assert_called_once_with(...)
        assert return_value == 'OK'

If this were a real test (and not just an example in a blog post) I'd want to refactor this to hide some of the ugly setup code. Hopefully this example adequately illustrates some of the issues with this approach:

  1. While this test asserts that the requests mock was called with arguments that we expected, we have no guarantee that these are the arguments that the remote service expected. Even if we ensure this is the case when this code is written, if the remote service ever changes its API this test will continue to pass, despite the code now being incorrect.
  2. Similarly, we're making assumptions about what the remote service sends in response to our query.
  3. We're even making assumptions about the fact that the API even exists in the first place - our mock will catch all HTTP requests, regardless of destination url or HTTP method.

Finally, it's likely that the same developer is writing the test case as wrote the production code. This makes it much more likely that any faulty assumptions the developer had while writing the production code are likely to make it into the test, in effect hiding the issues present.

This approach does catch some issues, particular when the logic to determine exactly what to send to the remote service is complex. However, with refactoring, that complex code can usually be separated and tested separately. In my experience, the vast majority of bugs in inter-service communication code end up being issues in the format of the requests and responses, rather than in any higher level code.

No Mocking

At the other end of the extreme from mocking everything is... mocking nothing. If the service is small enough, then instead of mocking the remote service, we can just run it during our test run. This is particular useful when:

  • The remote service is stateless - i.e. it has no database or other forms of persistence. Even stateful services can be run if the database is lightweight and easy to set up.
  • The service itself is reasonably lightweight. This means: quick to start, has a reasonably small memory and CPU footprint.
  • The service itself does not require any other services to be running in order to be useful (that way madness lies).

A typical test might look like this:

def test_set_on_service():
    # 1.- Start the remote service if it's not running. Configure local DNS
    #     such that the code under test will talk to this running process.
    ensure_service_is_running()

    # 2.- Call the code under test, grab the returned value...
    return_value = production_code.set_thing_on_service(
        'username', 'thing')

    # 3.- Make assertions on the returned data. There is no mock, so we
    #     can't (easily) check what data was sent, but we know that the
    #     communication happened with the _real_ remote service, so that's
    #     probably good enough.
    assert return_value == 'OK'

    # 4.- Stop the service at the end of the test. Probably use 'addCleanup'
    #     or whatever your test framework of choice supports to ensure this
    #     always runs, even if the above assertion fails.
    ensure_service_is_stopped()

In reality using something like the excellent testresources to manage the remote service process allows you to minimise the overhead of having to start and stop the service for every test.

If you can get away with it, this is an excellent approach to testing interactions between components. In reality, this is rarely a practical option.

A third option

We've found that mocking everything doesn't work, and mocking nothing is great, but rarely practical. Is there a third option, a compromise between the two above approaches?

The first insight towards understanding this third option is that the most common cause of problems in inter-service communication are at the "presentation layer". Most issues are with how the data sent between services is formatted - a few examples that I'm sure I've committed personally in the course of my career include:

  • Typo-ing a value in the production code, and then typo-ing it in the tests as well (editors are particularly bad at enabling this, especially those that "learn" your typos and then helpfully offer them as completions later on).
  • Forgetting the fact that the remote API requires an additional value (perhaps an HTTP header) to be set, and ignoring that fact in my tests as well.
  • Forgetting that some services are more delicate than others about whether trailing '/' are present on URLs, and requesting the wrong resource in my production and test code.
  • Confusing two remote APIs and calling the wrong one in my production code, then writing the unit test to the same assumptions.
Perhaps a good option to find a middle-ground between "mocking everything" and "mocking nothing" would be if we could somehow ensure that the format of the data being sent to a remote service was correct, but still mock the actual operation of the remote service.

Layered services

Once we start thinking about data validation as being separate from the logic of a service, we can start to structure our services using a more layered approach, where presentation, logic, and persistence code are separated within the codebase:

A diagram of a layered service.
  • The presentation layer deals with the external API the service is exposing. In our case this means HTTP, Flask, and JSON.
  • The logic layer deals with the business logic in the service. This is where the actual work happens, but it deals with data types that are internal to the service (for example, we never pass a Flask request object into the business layer).
  • The persistence layer is where state is stored. Many services have some sort of database (in our case it's almost always Postgresql, but the specific database doesn't matter much). Some services may talk to other micro-services in their persistence layer.

The typical flow of a web request is:

  1. The presentation layer receives the request and validates the payload.
    • The presentation layer might return a response right away. For example, the service might be in maintenance mode, the request might be invalid, etc.
    • If the request is valid, the presentation layer typically converts the request into an internal data-type, and then forwards it to the logic layer.
  2. The logic layer receives a request from the presentation layer and actions the request.
    • Sometimes the logic layer can return a response right away - in the case of a stateless service the response is often something that's calculated on the fly (for example, imagine a signing service that verifies GPG signatures on signed documents).
    • Sometimes this means retrieving something from the database or from a third-party service. In both these cases this involves making a call into the persistence layer.
  3. The persistence layer retrieves the data requested from the database or service in question. The persistence layer usually has to deal with some separate concerns from the rest of the system. For example, it might have to talk to an ORM like sqlalchemy, or it might have to speak HTTP to some remote service.

The stack unwinds as you'd expect it to. The last thing that happens is that the presentation layer finally converts the response into something that Flask understands, and that response is sent out over the wire.

This architecture isn't particularly surprising or controversial, but it gets us one step closer to being able to extract the presentation layer validation code so it can be used elsewhere.

Declarative validation

The next step is to make the presentation layer validation declarative, rather then imperative. That is, we want to transform the usual imperative presentation layer validation:

from flask import request


def my_flask_view():
    payload = request.get_json()
    required_keys = {'thing_one', 'thing_two'}
    for key in required_keys:
    if key not in payload:
        return "Error, missing required key: %s" % key, 400
    # MUCH more code to completely validate the payload here:...

...into a declarative form:

from flask import request
from acceptable import validate_body

@validate_body({
    'type': 'object',
    'properties': {
        'thing_one': {'type': 'string'},
        'thing_two': {'type': 'string'},
    },
    'additionalProperties': False,
})
def my_flask_view():
    payload = request.get_json()
    # payload is validated if we get here...

Keen-eyed observers will notice that the declarative specification here is jsonschema. This allows us to create some incredibly powerful specifications - much more so than the above example which simply states that both keys must exist, and their values must be strings.

Introducing Acceptable

The from acceptable import validate_body is the first import from the acceptable package we've seen so far. What is Acceptable? It's a wrapper around Flask, and contains some of the new technology we've had to build in order to build the new snap store. Of particular interest to this blog post is the fact that it contains the mechanism we're using for inter-service testing.

The validate_body decorator takes a jsonschema specification, and will validate all incoming requests against that schema. Only requests that validate successfully cause the view function to be called. There exists a similar decorator named validate_output that operates on the output of an API view instead of the input.

So far we've built a nicely structured service, but we haven't improved the situation with regards to testing code that wants to integrate with this service at all. The final piece of the puzzle is that acceptable includes a command that can scan a codebase (or multiple codebases) for the validate_body and validate_output decorators, extract them, and build "service mocks" with them.

This gives us a library of mock objects that we can instantiate during a test. Setting up a service mock requires only that the test author specify a response they want the mock to return. Setting up one of these mocks causes the following things to be configured in the background:

  1. The correct URL that the remote service exposes for the API you want to integrate against is mocked. No other URLs are mocked, and only the correct HTTP method is mocked, so you can't accidentally mock out more than you intended to.
  2. Any requests to that URL will be validated against the jsonschema specification. Payloads that fail validation will result in an error response.
  3. The response the test author passes to the service mock will be validated against the validate_output schema. If it fails validation the mock will raise an error, and your test will fail. This prevents you from making faulty assumptions about what a remote service returns in your test code.
  4. All calls to the target service are recorded, so you can still make assertions on what was sent to the target service, as well as what the service responded with.

A test author integrating with an acceptable-enabled service gets to write tests that look like this:

from service_mocks import remote_service

def test_set_on_service(self):
    # 1.- configure the service mock for the specific API we're integrating
    #     against. We pass in the response data we want from that service,
    #     and this step will fail if the response data we provide does not
    #     match the response specification in the target service:
    service_mock = self.useFixture(
        remote_service.api_name(output={'status': 'OK'}))

    # 2.- Call our function under test. If this function sends an invalid
    #     request to the correct url, or sends any request to a different
    #     url then the response you get will be what you'd expect: a 400
    #     and a 404, respectively.
    return_value = production_code.set_thing_on_service(
        'username', 'thing')

    # 3.- Assertions based on the code-under-test's return value work as
    #     you'd expect:
    assert return_value == 'OK'
    # ... and you can also assert based on what calls were made during the
    # test:
    assert len(service_mock.calls) == 1

This is far from perfect, but it's a big step forwards compared to anything else I've seen. I'm now much more confident when writing integration code between multiple services - I can be reasonably sure that if my tests pass then at least the presentation of the requests and responses in my tests are accurate. Of course, these tests are only as good as the jsonschema specifications are on the target service.

Questions and Answers

Can I use it?

You can, but you probably shouldn't, at least not yet. While we're using it in production, and it's performed well for us, we're still not ready to make any promises about keeping APIs stable, even between minor versions. Additionally, there are still several bugs and issues that need more investigation and engineering work.

Having said that, it's open source, and there's nothing to stop you from using the ideas expressed in acceptable in your own codebases. Today's blog post can be summarised as "If you make your presentation layer validation declarative, then you can extract it and use it to build better service mocks for inter-service testing". There's nothing in the implementation that's particularly tricky. Acceptable does a lot more than just presentation layer validation, and I look forward to writing more about it in the future.

What are the known issues?

The largest known issue today is that writing assertions against the configured service mocks is somewhat unpleasant. This is due to the fact that the service mocks all share a single responses mock object, which in turn means the 'calls' list is shared across all service mock instances. Fixing this is on my personal TODO list, probably by moving away from responses.

What are the plans for the future?

Acceptable is still being evaluated. It has certainly proved useful, but we need to see how useful it's going to be long-term, and whether it's worth the cost of having another codebase to maintain.

If we decide to keep investing in it, there are a few things I'd like to see fixed:

  • We need some basic documentation. Acceptable isn't hard to use once you know it, but the lack of good documentation is a little unfriendly.
  • The script that extracts jsonschema specifications could be easier to use, and the library it generates could be easier to release (there's currently a few manual steps involved).
  • It would be nice to support HTTP responses with Content-Type's of something other than application/json. For example, some of our APIs expose application/hal+json, and acceptable currently has no support for this.
  • Ideally I'd like to see acceptable converge on a more cohesive set of APIs. It's currently a bit of a "grab-bag" of tools. Ideally we'd turn it into a flask-compatible framework with capital-o Opinions about how to build a service.

Conclusion

Writing and testing robust inter-service communications has been one of the main challenges involved in pursuing a more distributed, less monolithic architecture. Acceptable is very much experimental software at this stage, but it's already providing some benefit, and the ideas are transportable to other web frameworks or even other languages. I hope that while acceptable itself might not be useful to others in its current form, the ideas expressed in this post are interesting, and perhaps spur others into developing similar tools for themselves.

As always, if you have any questions, please do not hesitate to ask.