Observability

In any complex application, at some point something will go wrong. In a microservices application, you need to track what’s happening across dozens or even hundreds of services. On the network level we have already gained some insight making use of what the Service Meshes have to offer. But to make sense of what’s happening, you must collect telemetry from the application(s). Telemetry can be divided into logs, metrics and traces.

Of course, even further insight can be gained on the client side by embedding JavaScript snippets into an application that measure e.g. the page load time experienced by real users, but such End User Monitoring (EUM) / Real User Measurement (RUM) is beyond the scope of these exercises. Examples for such products are Google Analytics or the open-source Matomo Web Analytics , or Akamai’s mPulse Boomerang as part of an APM toolchain . However, as these require adjusting the application for best results we will limit ourselves to those aspects mentioned above.

Logs

Logs are text-based records of events that occur while the application is running. They include things like application logs (possibly including trace statements) or web server logs. Logs are primarily useful for forensics and root cause analysis. We will investigate Logs below.

Metrics

Metrics are numerical values that can be analyzed. You can use them to observe the system in real time (or close to real time), or to analyze performance trends over time. To understand the system holistically, you must collect metrics at various levels of the architecture, from the physical infrastructure to the application, including:

Node-level metrics, including CPU, memory, network, disk, and file system usage. System metrics help you to understand resource allocation for each node in the cluster, and troubleshoot outliers.
Container metrics. For containerized applications, you need to collect metrics at the container level, not just at the VM level.
Application metrics. This includes any metrics that are relevant to understanding the behavior of a service. Examples include the number of queued inbound HTTP requests, request latency, or message queue length. Applications can also create custom metrics that are specific to the domain, such as the number of business transactions processed per minute.
Dependent service metrics. Services may call external services or endpoints, such as managed PaaS services or SaaS services. Third-party services may or may not provide any metrics. If not, you’ll have to rely on your own application metrics to track statistics for latency and error rate.

We will investigate the first three types of Metrics below.

Traces

Traces are correlated events spanning several services that still belong to a single client request.

A significant challenge of microservices is to understand the flow of events across services. A single transaction may involve calls to multiple services. To reconstruct the entire sequence of steps, each service could propagate a correlation ID that acts as a unique identifier for that operation. The correlation ID enables distributed tracing across services.

The first service that receives a client request should generate the correlation ID. If the service makes an HTTP call to another service, it puts the correlation ID in a request header. Downstream services continue to propagate the correlation ID, so that it flows through the entire system. In addition, all code that writes application metrics or log events should include the correlation ID.

When service calls are correlated, you can calculate operational metrics such as the end-to-end latency for a complete transaction, the number of successful transactions per second, and the percentage of failed transactions. Including correlation IDs in application logs makes it possible to perform root cause analysis. If an operation fails, you can find the log statements for all of the service calls that were part of the same operation.

If you are using Istio or Linkerd as a Service Mesh, these technologies automatically generate certain correlation headers when HTTP calls are routed through the Service Mesh data plane proxies. We already have investigated Istio’s standard tracing , whereas Linkerd’s tracing first requires specific configuration and thus had not been investigated so far.

We, however, will cover Traces independently of any Service Mesh, to showcase the possibilities.

OpenTelemetry

OpenTelemetry strives to unify all these different aspects of observability under one common roof.

We will approach OpenTelemetry separately to check its features.