Measuring latency

Introduction

The first step when we try to improve the performance of our code is to measure it so we don’t start hammering at the wrong places. Besides the usual profiling techniques sometimes we want to measure live performance while our code is running. I had a problem where I needed to measure the latency in a certain hot path of our infrastructure and wanted to introduce minimal overhead. This led me on a journey to understand how we actually measure time in CPU and what errors we can realistically expect.

Goal

My goal here is to understand how these measuring tools operate, what is the resolution at which we can measure time on modern CPUs as well as what is the average overhead we introduce by trying to measure live running code. Let us start with some definitions and then we will explore how this works in Intel and Apple M2 CPUs.

Time Stamp Counter (TSC)

The exact purpose and usage of this CPU register has changed over time. Initially it was a pure counter of CPU cycles and wasn’t meant to be used for measuring time. But as it often happens people realized that it can be used for time measuring under certain conditions, like knowing that frequency of a CPU is constant in time and there are no suspends. But then the time of multi-core and multi-CPU systems came and this introduced new problems. CPUs could now enter boost phases which would modify their frequencies, multiple cores and CPU would all run on their own frequencies and depending on which core and when you query you would get different results. But because people liked this ability to easily measure time for performance reasons CPU manufacturers set to solve the problem. There are many sources online on how this problem is solved hardware-wise but we will not get into that here, for we can just assume (at least for most modern CPUs) that all of our cores / CPUs have the same reference counter no matter their frequency.

Intel TSC

If we take a look at Intel Software Developer’s Manuals (it’s a huge document, consisting of 4 volumes and more than 5000 pages!). Section 19.17 describes exactly what a TSC is. It says:

The Intel 64 and IA-32 architectures (beginning with the Pentium processor) define a time-stamp counter mechanism that can be used to monitor and identify the relative time occurrence of processor events.

For Pentium 4 processors, Intel Xeon processors (family [0FH], models [03H and higher]); for Intel Core Solo and Intel Core Duo processors (family [06H], model [0EH]); for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors (family [06H], model [0FH]); for Intel Core 2 and Intel Xeon processors (family [06H], DisplayModel [17H]); for Intel Atom processors (family [06H], DisplayModel [1CH]): the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted.

— Intel Software Developer's Manual
19.17 p3792

So we know that in most modern CPUs there is going to be a solution at the hardware level that will give us a constant rate of counter increments that we can use for measuring time. Our only job is to determine what this frequency is but that shouldn’t be a hard thing to do. The same guide also mentions this:

The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

— Intel Software Developer's Manual
19.17.1 p3793

So that means that if the CPU goes into any specific state our counter will continue to tick at the same rate. We will assume that this is the case for our application. How we actually read this information, is given in the next section

In 64-bit mode, RDTSC operates the same as in protected mode. The count in the time-stamp counter is stored in EDX:EAX (or RDX[31:0]:RAX[31:0] with RDX[63:32]:RAX[63:32] cleared).

— Intel Software Developer's Manual
Volume 3 Chapter 2.8.6.1

This means that we will read our 64bit counter into two 32bit registers which we need to combine to get the final number.

One more thing (rather important one) we need to consider before we start writing any code is the following:

The RDTSC instruction is not serializing or ordered with other instructions. It does not necessarily wait until all previous instructions have been executed before reading the counter. Similarly, subsequent instructions may begin execution before the RDTSC instruction operation is performed.

— Intel Software Developer's Manual
19.17 p3792

What serializing means is that in order to improve performance, the CPU can look for instructions that are independent and execute them out of the order in which we programmed them, so that we don’t have to wait. In general we want this, but when we have a piece of code like this:

let start = now();
let result = some_computation();
let end = now();

CPU can look at the instructions of now() and some_computation() and decide it can start executing some of the instructions from some_computation() before it starts executing our now() instructions. This can lead to misleading measurements because we would be measuring partial executions of our code and failing to account for certain instructions. Luckily there is a way we can solve this by using some of the serialization barriers that Intel has to offer. Traditionally cpuid instruction was used for that but cpuid has rather large overhead, so instead we will use lfence. From the same manual we have:

Performs a serializing operation on all load-from-memory instructions that were issued prior the LFENCE instruction. Specifically, LFENCE does not execute until all prior instructions have completed locally, and no later instruction begins execution until LFENCE completes. In particular, an instruction that loads from memory and that precedes an LFENCE receives data from memory prior to completion of the LFENCE. (An LFENCE that follows an instruction that stores to memory might complete before the data being stored have become globally visible.) Instructions following an LFENCE may be fetched from memory before the LFENCE, but they will not execute (even speculatively) until the LFENCE completes.

— Intel Software Developer's Manual
Volume 2 Chapter 3.3 p1240

So before we read our rdtsc register we will issue an lfence barrier and do the same thing just after we issue the ending rdtsc to avoid any following instructions leaking before our rdtsc read call.

Apple / Arm system counter

As I am doing much of my development on a MacBook I wanted to understand how the same is done in Arm processors. A nice explanation of the concepts can be found in Arm’s Learn the architecture - Generic Timer guide. Story here is mostly the same as with Intel, Arm does provide the hardware architecture that ensures a stable and persistent counter is present across multiple cores:

The System Counter is an always-on device, which provides a fixed frequency incrementing system count. The system count value is broadcast to all the cores in the system, giving the cores a common view of the passage of time. The system count value is between 56 bits and 64 bits in width. From Armv8.6-A and Armv9.1-A, the frequency of the count is fixed at 1GHz. Pre-Armv8.6-A, the count frequency was a system design choice, typically in the range of 1MHz to 50MHz.

— Learn the architecture - Generic Timer

One thing that is different is that Arm has a concept of a virtual and physical counter. The physical one as the name suggests is the value produced by the actual hardware counter. The virtual counter is calculated as a Virtual Count = Physical Count - CNTOFF_EL2 where CNTOFF_EL2 is an offset value as per Arm documentation can be used to:

The virtual count allows a hypervisor to show virtual time to a Virtual Machine (VM). For example, a hypervisor could use the offset to hide the passage of time when the VM was not scheduled. This means that the virtual count can represent time experienced by the VM, rather than wall clock time.

— Learn the architecture - Generic Timer

We access these counter by reading either CNTPCT_EL0 or CNTVCT_EL0 registers. Here we also have the same problem with out-of-order execution we had with Intel processors and in order to solve that we can use isb instruction (Instruction Synchronization Barrier). Per the Arm documentation:

Instruction Synchronization Barrier(ISB) is used to guarantee that any subsequent instructions are fetched, so that privilege and access are checked with the current MMU configuration. It is used to ensure any previously executed context-changing operations, such as writes to system control registers, have completed by the time the ISB completes.

— ARMv8-A memory systems

Writing benchmarks

Intel

First thing we want to do is check at what frequency is our RDTSC ticking, we can look at that by inspecting diagnostic messages, I am running this on my thinkpad with Intel i5-6440HQ:

> sudo dmesg | grep -i tsc
[    0.000000] tsc: Detected 2600.000 MHz processor
[    0.000000] tsc: Detected 2599.992 MHz TSC
[    0.045536] TSC deadline timer available
[    0.156280] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x257a34a6eea, max_idle_ns: 440795264358 ns
[    0.487330] clocksource: Switched to clocksource tsc-early
[    1.588731] tsc: Refined TSC clocksource calibration: 2591.999 MHz
[    1.588763] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x255cb5c6a11, max_idle_ns: 440795249002 ns
[    1.588829] clocksource: Switched to clocksource tsc

We see that initially kernel detected a frequency of 2599.992MHz but after doing some calibration it settled at 2591.999MHz. This frequency translates to roughly 0.39ns of resolution. This means that every 0.39ns our counter increases by one, and effectively every operation faster than that would be invisible to us. Now let us write a simple benchmark to see how much latency we can expect from reading rdtscp.

use std::fs::File;
use std::hint::black_box;
use std::io::{BufWriter, Write};

const BASE_FREQ: f64 = 2_591_999_000.0;
const NS_PER_TICK: f64 = 1_000_000_000.0 / BASE_FREQ;
const NUM_OF_TRIALS: usize = 1_000;
const NUM_OF_MEASUREMENTS: usize = 1_000_000;

// we use no_mangle so that it would be easier to find this block of code in assembly
#[unsafe(no_mangle)]
fn now() -> u64 {
    unsafe {
        let val: u64;
        core::arch::asm!(
        "lfence", //sync barrier
        "rdtsc",
        "shl rdx, 32", // per documentation high 32 bits are in rdx so we shift them left
        "or rax, rdx", // and then combine with rax to get full 64bit count
        out("rax") val,
        out("rdx") _,
        options(nostack)
        );
        val
    }
}

fn main() {
    // to avoid any statistical fluctuations we will measure many times and avg over that
    let mut measurements = [0.0; NUM_OF_TRIALS];
    for trial in measurements.iter_mut() {
        let start = now();
        for _ in 0..NUM_OF_MEASUREMENTS {
            // This function from std::hint is suggesting the compiler to be as
            // pessimistic as it can about it. What this enables us is to stop
            // the compiler from optimizing this away. What can happen here is
            // the compiler seeing that we only do now() call in every iteration
            // and never use any of it, so it can fully remove this whole for
            // loop. In order to actually measure the latency of getting the
            // counter value we want to do these `useless` operations over and
            // over again.
            black_box(now());
        }
        let end = now();
        unsafe { core::arch::asm!("lfence") };
        *trial = (end - start) as f64 / NUM_OF_MEASUREMENTS as f64 * NS_PER_TICK;
    }
    println!(
        "avg: {}ns",
        measurements.iter().sum::<f64>() / measurements.len() as f64
    );
    let file = File::create("intel.txt").unwrap();
    let mut writer = BufWriter::new(file);
    for val in &measurements {
        writeln!(writer, "{}", val).unwrap();
    }
}

This version of the code has a sync barrier, for the fun of it I tried without the barrier to see how much overhead is synchronization taking:

Diagram 1: Histogram of rdtsc latency with and without sync barrier

We can see that barrier introduces around 4.5ns of latency on average but as we discussed not having a barrier makes our measurements meaningless.

M2

Let us repeat the same steps for MacBook. In order to see what is the frequency of our counter we can use sysctl

> sysctl hw.tbfrequency
hw.tbfrequency: 24000000

Oh that is quite unexpected, our counter works on only 24MHz which translates to around 41.67ns. That is quite the contrast from the Intel CPU. Arm documentation says that all newer Arm CPUs should have 1GHz timer, so why Apple decided to put M2 counter at 24MHz I don’t know. The benchmark code looks almost the same:

use std::fs::File;
use std::hint::black_box;
use std::io::{BufWriter, Write};

const BASE_FREQ: f64 = 24_000_000.0;
const NS_PER_TICK: f64 = 1_000_000_000.0 / BASE_FREQ;
const NUM_OF_TRIALS: usize = 1_000;
const NUM_OF_MEASUREMENTS: usize = 1_000_000;

#[unsafe(no_mangle)]
fn now() -> u64 {
    unsafe {
        let val: u64;
        core::arch::asm!(
        "isb", //sync barrier before we do the measurement
        "mrs {}, cntvct_el0",
        out(reg) val,
        options(nostack)
        );
        val
    }
}

fn main() {
    // to avoid any statistical fluctuations we will measure many times and avg over that
    let mut measurements = [0.0; NUM_OF_TRIALS];
    for trial in measurements.iter_mut() {
        let start = now();
        for _ in 0..NUM_OF_MEASUREMENTS {
            black_box(now());
        }
        let end = now();
        unsafe { core::arch::asm!("isb") };
        *trial = (end - start) as f64 / NUM_OF_MEASUREMENTS as f64 * NS_PER_TICK;
    }
    println!(
        "avg: {}ns",
        measurements.iter().sum::<f64>() / measurements.len() as f64
    );
    let file = File::create("arm.txt").unwrap();
    let mut writer = BufWriter::new(file);
    for val in &measurements {
        writeln!(writer, "{}", val).unwrap();
    }
}

And the resulting histogram:

Diagram 2: Histogram of cntvct_el0 latency with sync barrier

The reason I didn’t put without barrier is that because of the very bad resolution many times loop would finish before the counter would even change value, so the final average was very small and not really usable. But even if that wouldn’t be the case it still stands that we must sync before doing the measurements. We can plot all three on the same graph for comparison:

Diagram 3: Histogram of all latency measurements

Summary

Even though from the tests latency on M2 is smaller by a couple of nanoseconds, the very coarse resolution caused by running the counter at 24MHz limits our capabilities how precise we can go. On the other hand Intel provides much better resolution with a slight increase in latency but I would say for all but the fastest applications this will suffice. If we get to the point where nanoseconds are important then probably we also need a dedicated measuring hardware as well. I hope you also enjoyed this small foray in the world of CPU counter and how they work. See you soon!

Table 1. Latency summary
CPU (counter freq)	With barrier[ns]	Without barrier[ns]
Intel i5-6440HQ (2591.999 MHz)	12.81	8.37
Apple M2 Pro (24 MHz)	9.94	0.32 (this is not usable)