Why Distributed Testing Is the Best Way to Test Microservices

Why I used to give up on testing for microservices-based apps (and what’s changed since then).

Aug 25th, 2022 10:00am by Aviv Kerbel

Featued image for: Why Distributed Testing Is the Best Way to Test Microservices

Feature image via Pixabay.

Aviv is the developer advocate at Helios. An experienced developer and R&D leader, Aviv is passionate about development in modern environments and trending dev tools and technologies. Prior to Helios, Aviv was a senior software developer at multiple startups and an associate at Amiti, an Israel-based venture capital firm, where he was involved in due diligence and investment processes for tech startups.

It’s been almost a decade since I started developing full time, and almost four of those years were spent developing in a microservices-based environment. But while we were all educated about the importance of testing, over time I found it was getting more complicated to write tests for my code, until it became almost impossible. Or maybe just not worth it, because I found it took more time than writing the code itself!

After all, with so many asynchronous calls between services, it was easy to miss exceptions — so the preparations for the tests and building the infrastructure became tedious and time consuming, not to mention having to prepare data for those kinds of tests.

Think about a flow that includes Kafka topics, Postgres, DynamoDB, third-party APIs and multiple microservices, which all depend on each other somehow. Any change might affect many others. This is not monolith-land anymore, where we would get an HTTP500 status code from the server. In this scenario, we might have 100 microservices and still not find the exception that was thrown to another microservice without being notified about it.

When we tried to build backend E2E tests, here are the options that we tried and tested:

Option one: Log-based testing

The first option we tried was fetching the logs added by the developers of the feature. These logs should validate the relevant service data.

For example:

class OrderTest(unittest.TestCase):
  def test_process_order_happy_flow(self):
      with self.assertLogs('foo', level='INFO') as cm:
          requests.get(f'/process_order/{TEST_CLIENT_ID}')
          self.assertEqual(
              cm.output,
              [f'INFO:send email to {TEST_CLIENT_ID}',
               f'INFO:Charge {TEST_CLIENT_ID} succeeded!']
          )

1

2

3

4

5

6

7

8

9

class OrderTest(unittest.TestCase):

def test_process_order_happy_flow(self):

with self.assertLogs('foo', level='INFO') as cm:

requests.get(f'/process_order/{TEST_CLIENT_ID}')

self.assertEqual(

cm.output,

[f'INFO:send email to {TEST_CLIENT_ID}',

f'INFO:Charge {TEST_CLIENT_ID} succeeded!']

)

Pros: This method is always available and doesn’t require additional tools.

Cons:

Logs aren’t always reliable, and they depend on the developers who created them (which doesn’t always happen).
Tests that are based on logs are limited by design, since they only test the logs and not the operation itself.

Option two: DB querying

If we save the operation indication in the DB, we can query the DB during the test and then create our test based on DB updates:

class OrderTest(unittest.TestCase):

   def test_process_order_happy_flow(self):
       requests.get(f'/process_order/{TEST_CLIENT_ID}')
       client = Client.get_by_id(TEST_CLIENT_ID)
       assert client.charged_successfully is True
       assert client.email_sent is True

1

2

3

4

5

def test_process_order_happy_flow(self):

requests.get(f'/process_order/{TEST_CLIENT_ID}')

client = Client.get_by_id(TEST_CLIENT_ID)

assert client.charged_successfully is True

assert client.email_sent is True

Pros:

It’s less flaky than log-based testing.
Side effect — this data can later help us generate analytics in real environments.

Cons:

It’s not always easy to expose the DB to testing projects, since it requires redesigning DB schemes, which is wrong from a DB design perspective.
Like the former option — it’s kind of “second-hand testing.” We test the DB object instead of the actual operation, so we might miss some issues.

There are probably more options, but they all suffer from similar issues — they are not reliable enough and are too time consuming.

This brings us to the bottom line — this is the reason why most of the developers I know don’t test their code properly.

Option three: Contract testing

Contract testing is a testing method that is based on contracts between two systems to ensure compatibility. The interactions between the two services are stored in a contract, which is then used during communication to ensure both sides adhere to the contract. Pact.io is a popular open source solution for contract testing.

Pros:

It ensures consistency.
It’s a single place for storing architectural changes.

Cons:

It does not cover all use cases — tests are created based on actual consumer behavior.
It requires mocking.

Option four: end-to-end testing

End-to-end (e2e) testing provides a comprehensive overview of the system components and how messages pass between them. With e2e testing, developers can make sure the application is working as expected and that no misconfigurations prevent it from performing in production.

Pros:

It provides powerful tests.
It ensures that microservices are really working for users.

Cons:

It’s difficult to maintain and debug.
It has high costs.

Option five: Trace-based testing — a new test paradigm

In the last few years, distributed tracing technologies have been gaining momentum. Standards like OpenTelemetry enable a different way to look at microservices: from a trace standpoint.

Distributed tracing allows overseeing everything that happens when one single operation is activated. By using this method, we could approach each operation and get a holistic view of everything that happens in my application triggered by this operation, instead of looking at it separately.

In OpenTelemetry terminology — the trace is built of spans. A span represents an operation — sending an HTTP request, making a DB query, handling an asynchronous event, etc. Ideally, using spans can help us review any attributes we want to validate for the test. Moreover, the trace also shows a continuous flow and the relation between operations, as opposed to a single operation, which can be powerful for test purposes.

How to Create Trace-Based Testing

When using traces for tests, you first need to deploy a tracing solution. As mentioned before, the common tracing solution in the industry is OpenTelemetry, or OTeL. It also allows searching capabilities.

Let’s review the things you should do if you want to succeed in testing with OTeL:

Deploy the OTeL SDK.
Build processor services that will take OTEL data and transform it into assets you can work with.
Create an ELK service that can save all the traces.
Create a tests framework that will allow you to search for a specific span inside traces.

And that’s it. Note that we didn’t discuss any visualization tools that would help you investigate traces, such as Zipkin on Jaeger, but you can use them as well.

But at the end of the day — another tool was still missing.

Looking at the relevant open source tools in the domain that are based on OpenTelemetry, we observed that most of them require the implementation of OpenTelemetry as a prerequisite, which left us with the same headaches.

Helios

Helios is a free tool that instruments OpenTelemetry and allows developers to generate test code for each trace.

Through spans, Helios collects all the payloads of each request and response. Using this capability, developers can generate tests without changing a single code line.

It’s also built in a smart way that enables the creation of a test from a bug. Even better, using the trace visualization tool, a developer can visually create a test directly from a trace.

Bottom line: The way developers test microservices today is broken. It’s time consuming and inefficient. Using distributed tracing testing is the right answer — it allows developers to properly test their microservices, since it provides a holistic view of all microservices. Whichever tool you choose, make sure you take advantage of its abilities.