The performance loop—A practical guide to profiling and benchmarking

Code executed at scale must perform well. But how do we know if our performance optimizations actually make a difference?

Taking a step back, how do we even know what we need to optimize? First, we might need to discover hidden assumptions in the code, or figure out how to isolate the performance-critical bits. Even once we know what must be optimized, it's challenging to create reliable before-and-after benchmarks. We can only tell if our changes helped by profiling, improving, measuring, and profiling again. Without these steps, we might make things slower without realizing it.

In this talk, you'll learn how to:

Identify the areas with an effort-to-value ratio that make them worth improving
Isolate code to make its performance measurable without excessive refactoring
Apply the "performance loop" to ensure performance actually improves and nothing breaks
Become more “performance-aware” without getting bogged down in performance theater

Previous abstract

Beyond simple benchmarks—a practical guide to optimizing code

We know it’s vital that code executed at scale performs well. But how do we know if our performance optimizations actually make it faster? Fortunately, we have powerful tools which help—BenchmarkDotNet is a .NET library for benchmarking optimizations, with many simple examples to help get started.

In most systems, the code we need to optimize is rarely straightforward. It contains assumptions we need to discover before we even know what to improve. The code is hard to isolate. It has dependencies, which may or may not be relevant to optimization. And even when we’ve decided what to optimize, it’s hard to reliably benchmark the before and after. Only measurements can tell us if our changes actually make things faster. Without them, we could even make things slower, without realizing.

In this talk you’ll learn how to:

Identify areas of improvement which optimize the effort-to-value ratio
Isolate code to make its performance measurable without extensive refactoring
Apply the performance loop of measuring, changing and validating to ensure performance actually improves and nothing breaks
Gradually become more “performance aware” without costing an arm and a leg

Slides

Introduction

I remember the first time I started benchmarking my code changes to verify whether the things I thought might accelerate this code really made an impact. I had already seen quite a few Benchmarks similar to the one below written with Benchmark.NET and felt quite certain it wouldn't take long.

[SimpleJob]
[MemoryDiagnoser]
public class StringJoinBenchmarks {

  [Benchmark]
  public string StringJoin() {
    return string.Join(", ", Enumerable.Range(0, 10).Select(i => i.ToString()));
  }

  [Benchmark]
  public string StringBuilder() {
    var sb = new StringBuilder();
    for (int i = 0; i < 10; i++)
    {
        sb.Append(i);
        sb.Append(", ");
    }

    return sb.ToString(0, sb.Length - 2);
  }

  [Benchmark]
  public string ValueStringBuilder() {
    var seperator = new ReadOnlySpan<char>(new char[] { ',', ' '});
    using var sb = new ValueStringBuilder(stackalloc char[30]);
    for (int i = 0; i < 10; i++)
    {
        sb.Append(i);
        sb.Append(seperator);
    }

    return sb.AsSpan(0, sb.Length - 2).ToString();
  }
}

Oh, I was wrong. I mean, writing the skeleton of the benchmark was indeed simple. The mind-boggling part was trying to figure out what should be taken into the benchmark, how to isolate the code without a crazy amount of refactoring, what should be deliberately cut away to make sure the changes envisioned are going in the right direction, and how to measure, change, and measure without burning away the allotted budget. But why even bother and go through all this hassle?

For code that is executed at scale, the overall throughput and memory characteristics are important. Code that wastes unnecessary CPU or memory cycles ends up eating away resources that could be used to serve requests. With modern cloud-native approaches, scalable code is even more important than before because we are often billed by the number of resources consumed. The more efficient the code is, the smaller the bill, or the more requests we can execute for the same amount of money. And let's not forget more efficient code execution also means we are consuming less energy which is an important cornerstone for GreenIT too.

We were able to see Azure Compute cost reduction of up to 50% per month, on average we observed 24% monthly cost reduction after migrating to .NET 6. The reduction in cores reduced Azure spend by 24%. Microsoft Teams’ Infrastructure and Azure Communication Services’ Journey to .NET 6

In this talk, I have summarized my personal lessons on how to make performance optimizations actionable. I will show you a practical process to identify some common bottlenecks, isolate components, and measure + change + measure without breaking current behavior. Let's not waste more time and get to the essence of this talk.

The performance loop

For me one of the key principles I try to apply to almost everything in software is making explicit tradeoffs and decisions as we go. This also applies to performance. A reasonably mature team should be "performance aware". My friend Maarten Balliauw once famously said, in some countries you have to be bear aware because for example, when you are hiking in Canada it is good to be prepared for the likelihood of a bear crossing your hiking paths, not so much in Switzerland though ;) I digress...

When it comes to performance, when you are performance aware, it doesn't mean you have to always go all the way in. Not at all. In fact, I always start with the simplest solutions that just work first and get some reasonably good test coverage in place. Once I have a working solution with good coverage, I start asking myself questions like:

How is this code going to be executed at scale, and what would the memory characteristics be (gut feeling)
Are there simple low-hanging fruits I can apply to accelerate this code?
Are there things I can move away from the hot path by simply restructuring a bit my code?
What part is under my control and what isn't really?
What optimizations can I apply, and when should I stop?

I have covered some of these nuances further in my talk "Performance Tricks I learned from contributing to the Azure .NET SDK.". Once I have a better understanding of the context of the code, depending on the outcome, I start applying the following performance loop.

Write a simple "sample" or harness that makes it possible to observe the component under inspection with a memory profiler and a performance profiler. The profiler snapshots and flamegraphs give me an indication of the different subsystems at play, allowing me to make an explicit decision on what to focus on and what to ignore.
Then I select the hot path, for example, the one responsible for the majority of allocations or the biggest slowdown (or where I feel I can make a good enough impact without sinking days and weeks into it). If the code path in question is not well covered, I try to get some tests in place to make sure my tweaks will not break the existing assumptions / behavior => it doesn't help when something is superfast but utterly wrong :)
Then I experiment with the changes I have in mind and check whether they pass the tests. Once it functionally works, I put things into a performance harness
To save time, I extract the code as well as possible into a dedicated repository and do a series of "short runs" to see if I'm heading in the right direction. Once I'm reasonably happy with the outcome, I do a full job run to verify the before and after.
Then I ship this code and focus my attention on other parts

But enough of the overview of the process. Let's dive into a practical example.

NServiceBus Pipeline

NServiceBus Pipeline Overview

NServiceBus is the heart of a distributed system and the Particular Service Platform. It helps create systems that are scalable, reliable, and flexible. At its core, NServiceBus works by routing messages between endpoints. Messages are plain C# classes that contain meaningful data for the business process that is being modeled. Endpoints can be running in different processes on different machines, even at different times. NServiceBus makes sure that each message reaches its intended destination and is processed. NServiceBus accomplishes this by providing an abstraction over existing queuing technologies. While it's possible to work directly with queuing systems, NServiceBus provides extra features to make applications more reliable and scalable.

The most critical infrastructure piece inside an NServiceBus endpoint is the NServiceBus pipeline. The pipeline is the engine that makes sure all the required steps involved (serialization, deserialization, transactions, data access...) in sending or receiving messages are executed as efficiently as possible. As such, it is crucial for the pipeline to not get in the way of our customers' code.

NServiceBus Pipeline Overview

This is conceptually very similar to the ASP.NET Core middleware

ASP.NET Core Middleware

or expressed in code

app.Use(async (context, next) => {
    // Do work that can write to the Response.
    await next();
    // Do logging or other work that doesn't write to the Response.
});

or as classes

public class RequestCultu

BeyondSimpleBenchmarks

Install / Use

README