Observability helps us understand the internals of a software from outside. Every enterprise software which is scalable and resilient to new changes needs to built using the principle of observability. A high quality software with no bugs, eventually with growing usage and complexity will present unknown problems. That is the hard truth we must accept. We should recognise what the SRE model emphasises: systems will continue failing. Observability will help us in understanding those failures better.
Logs, metrics, and traces are often regarded as the three pillars of observability. People might disagree on relative importance of each of these, but it is a general consensus that these 3 sources make a system more observable. Tracing is highly valued when we have complex distributed system where a single request flows through multiple services and layers before it can be served.
Tracing involves trace and spans. Trace is an execution path through the system. A trace has a name, an ID, and a time value. When you combine different traces from a distributed system, you can see an end-to-end flow of an execution path. These traces can help in analysing which part of the code in the system is taking how much time, how the data flows through various services and how the services are interdependent on each other. Zipkin and Jaeger are two of the most popular OpenTracing-compliant open source distributed tracing solutions. (OpenTracing is a vendor-neutral spec and instrumentation libraries for distributed tracing APIs.). Below is a screenshot from the Jaeger UI for a sample HTTP request.
A visualisation like this really helps in understanding a lot about the system. We see how requests flow from one service to another and between service and database layer. Traces with errors are highlighted for our help. We also get information about the time spent in each layer. This information is gold for someone trying to debug latency issues or performance issues.
One of the most common challenges when using any tracing library is the overhead of producing the tracing data and the inter-wining of core business logic with tracing specific code. While the first problem is claimed to be solved by vendor specific techniques like data buffering and batching, the second problem is slightly more complicated. Especially in a functional language like Scala , where we would like to keep composition and modularity while meeting our tracing requirements.
Let’s take an example to understand how tracing logic can easily overtake our code base and can make it less readable. Our sample code is a simple user service that fetches user information.
Our program returns a
UserSearchResult. The function
program call two other functions internally,
getUserData. The function
getUserData calls two other functions,
getUserNetwork internally to fetch the complete data. All the functions return the result in an
IO container which indicates wrapping a side effect. Let’s assume that
getUser fetches data from a database and
getUserNetwork call another service via
REST to get data.
It’s now time make our program more observable. We would like to trace the database and the rest api calls to the other services and see how much time it takes for each. Any errors , slowness and latency in fetching
UserSearchResult must be because of an issue with either the database call or the rest api call.
In the world of tracing , span is a building block. It represents an individual unit of work done. A span can also be started from another span representing the parent-child relationship between them. Spans also contain a Span-context that represents the state of a span and can be used to pass information across span boundaries. In our example we pass the root application span (this could be the first span / root span) to the
program method. All observable points call the
createSpan method that creates a new span as the child. The newly created span is passed by all methods to underlying methods so that the span information can be propagated across thread boundaries. Running our program reveals a trace like below
The above trace depicts the control flow of our system. There is a root span which has program span as its only child. Program span has two child spans to get user from database and get user data. User data span has further two spans to get user network and get user profile. Each span also has a flight time which indicates how much time was spent in that particular execution flow. This insight into our system would be very helpful in debugging and also to root cause bottlenecks. However, our code is now reeking of spans everywhere. Not only did all the method signature change to accommodate for span passing, we also have methods like
getUser where the new span is created but not used inside.
A closer look at the changed code reveals that the changes made to our original code were mainly focused on passing the parent span. Each point of trace called the create span method and used the parent span as a reference. A block of code that was previously returning an
IO[A] , now also needs a
Span. This can be represented by a datatype that we are well aware of — Kleisli. A
Kleisli[F, A, B] is a wrapper for a function that takes an
A and returns an
We basically create an alias for anything that takes a
Span and returns an
IO[A] . That alias is called a
Trace[A]. We create an overloaded
createSpan method that accepts an operation name and a
Trace[A]. The passed trace is a handle to the function that must be run by creating a new span which is what we do. We create a new span and the use that to run the passed trace. All our methods are now returning a
Trace[A] which is a Kleisli. Kleisli’s composes well and this can be seen in methods
program which have regained there original look. Running our program returns a similar trace to what we saw earlier.
So , we have managed to achieve our goal of making our program observable and also preserve the modularity and composability that we so much crave for. A further improvement to this code is to create a type class for our
Tracer and provide an
Kleisli instance for that type class.
All this functionality is available in an awesome open source library known as natchez. There is a nice example provided using Jaeger. If we were to re write our program using natchez it would look like
And the trace in Jaeger UI would look like ..
To summarise, tracing is an essential commodity in the world of highly complex and distributed applications. It might not be the silver bullet that solves all the problems but using it would help in gaining insight to how various components interact and how data flows in a system across a myriad of micro-services.
The challenge with introducing tracing in any code base is the amount of footprint the tracing code would leave. However if we use a library like natchez , we can still preserve readability and composability , while achieving the aim of passing span information across functions and thread boundaries.