Tracer

An experimental tracing framework for Python.

Some code snippets: FillPages, an AVX2-optimized assembly routine for filling pages with a particular byte pattern. InjectionThunk, an assembly routine used for remote process code injection -- includes support for injection events and shared memory as well. PrefixSearchStringTable, an AVX2-optimized routine for searching whether or not a given string prefix-matches a set of strings in a custom string table. RegisterFrame, the main frame registration routine used by the Python tracing component.

Introduction

Motivation, Existing Works

The Tracer project was inspired by consulting work in New York city. Often, I was tasked with joining a team with an existing, large code base, and improving performance. Alternatively, new code would be written that would exhibit poor performance, and assistance was required to determine why.

I had been using three tools: the existing Python profiling functionality (cProfile), a line profiler (kernprof), and Visual Studio's Python profiling feature.

cProfile is cumulative in nature. Call sites are recorded and functions are depicted by way of both number of calls and cumulative call time. It is useful for detecting "red flag" performance problems -- an example I saw in real life was 30% of execution time being spent in gc.collect(). This is indicative of pathological performance problems, and was traced to some pandas behavior regarding temporary assignments.

For long-running, batch-oriented programs, though, where there are no pathological corner cases, the usefulness of cProfile diminishes. It becomes hard to distinguish signal from noise with regards to the cumulative call counts and function time. Additionally, cProfile only captures function call entry and exit times -- it doesn't do line profiling, and thus, won't be able to depict abnormally expensive lines within a routine (this can be common with NumPy and Pandas; a seemingly innocuous line will actually have a very high overhead behind the scenes).

A line profiling tool exists called kernprof. This enables line tracing in the Python interpreter, which can depict the cost of each line within a given function call. This is useful for identifying why a large function is exhibiting poor performance. However, it requires knowing the problematic function ahead of time. Functions must be programmatically "registered" for line-tracing at runtime. This quickly becomes very tedious for large code bases, especially if you're unfamiliar with the code base, and thus, not sure which functions should be inspected closer and which should be ignored.

The Python Tools for Visual Studio component has a custom Python profiling feature which actually leverages the underlying Visual Studio performance profiling functionality. It depicts richer call stack information, which allows visually identifying problematic areas in a much easier fashion than the other two tools. However, the overhead incurred by the profiler is significant -- something that took 5 minutes to run could take hours to run when being profiled. Additionally, the time taken to process the saved profile data would also be quite significant.

Each tool has various pros and cons. The source code is actually available for all three, and subsequent study proved quite beneficial in identifying approaches to various problems that were worth adopting, as well as areas where some improvements could be made if a more sophisticated architecture was put in place.

The use of splay-trees by the cProfile component interested me. The code responsible for registering new Python functions heavily uses the NT kernel's generic table APIs, which use splay trees behind the scenes.

I wanted to design something that supported line-level tracing at the Python module level. For a given project, there were only ever a couple of top-level Python module names I was interested in. Being able to specify a module prefix name (e.g. numpy, pandas and myprojectfoo) and ignoring everything else was important. This lead to the development of the StringTable component, which is an AVX2 accelerated prefix string matching algorithm geared specifically toward the needs of the tracing project.

I also knew I wanted to leverage memory maps extensively. The tracing facilities should be able to request blocks of memory and then be given back a pointer that they're free to write to. The underlying operating system's memory and file system manager would take care of ensuring the data was sync'd to disk as necessary. Additionally, it should be possible to load the data files in a read-only capacity (whilst the tracing data was being captured) to monitor performance in a live fashion.

This lead to the development of the TraceStore component, which underpins the vast majority of the tracing infrastructure.

C without the C Runtime Library

The Tracer project is implemented in C. However, it does not link to, use, or otherwise depend upon the C runtime library (CRT). Instead, it relies on equivalent functionality present in the NT kernel, by way of the ntdll.dll, kernel32.dll and ntoskrnl.exe images. Extensive use is made of the NT runtime library -- in fact, the main Tracer runtime library component is named "Rtl" after the similarly named NT kernel component.

The Tracer project's Rtl module defines many function pointer types -- some mirror identically-named NT procedures (e.g. RtlSplayLinks has a function pointer typedef named PRTL_SPLAY_LINKS), the rest are common procedures implemented internally (i.e. they will have .c implementations living in the Rtl/ directory) that are of use to other components in the project. For example, path and environment variable helper procedures.

The interface to the Rtl module's functionality is by way of a monolothic structure named RTL. This structure is predominantly composed of function pointers to either NT-implemented or internally-implemented functionality. A single RTL structure is typically initialized once per program instance at the very start by the bootstrap sequence (initialize an allocator via DefaultHeapInitializeAllocator(), then call CreateAndInitializeTracerConfigAndRtl()).

A pointer to the structure (of type PRTL) is then commonly passed around between downstream components. It will frequently be observed as the first parameter of a procedure, or captured as a function pointer field (PRTL Rtl) in a downstream C structure somewhere (for example, a runtime context).

No module currently uses more than one instance of an RTL structure, although there is technically nothing preventing LoadRtl() from being called to initialize multiple instances. It is assumed that once loaded and initialized, the RTL structure will stay valid until program termination. There is a DestroyRtl() procedure, however, as LoadRtl() requires the caller to provide the memory backing the structure, no memory allocation is done by this routine -- it simply calls FreeLibrary() on ntdll.dll, kernelbase.dll, kernel32.dll, and ntoskrnl.exe images. As these images are typically permanently mapped into a process's address space, the FreeLibrary() call is technically unnecessary, at least from the perspective that it won't ever result in the image being unloaded. It is good discipline, though, to pair LoadLibrary() calls with matching FreeLibrary() calls, so this is adhered to by DestroyRtl().

From the perspective of the downstream module author, the Rtl pointer (obtained either via a procedure parameter or embedded in a structure) can be assumed to always be non-NULL and valid.

Motivation Behind Not Relying on the C Runtime Library

1. Multiple C runtimes.
2. A runtime dependency on the CRT (i.e. the binary has an imports section with msvcrt stuff).
3. Avoid peturbing the traced environment.

The Mechanics of C without the C Runtime Library

1. Visual Studio setup.
2. __chkstack
3. __C_runtime_handler

Style

The Tracer project uses the Cutler Normal Form (CNF) coding style for the C language. This is the style used for the NT kernel; the name Cutler Normal Form parallels the Kernel Normal Form style that has guided UNIX kernel development over the years. For those with exposure to the NT kernel style (e.g. via Windows headers or NT driver development), the Tracer project will have a familiar feel. For those coming from a UNIX or Linux background, the style can appear quite jarring, as it shares almost no traits with the type of C style programs in that domain. The only commonality between the two styles is that they're both very consistent internally, and produce very clean looking code. Each style is an artefact of the environment it was created in.

Intrinsic C Types

Type names are always upper case. Intrinsic C types map to NT types as follows:

C Type NT Type (typedef) char CHAR unsigned char BYTE xxx finish me

Pointer typedefs are used to capture pointer types, e.g. PULONG is used instead of unsigned long *.

Structures, Unions and Enums

Structures, unions and enumerations all follow the same pattern:

typedef struct _FOO {
	USHORT SizeOfStruct;
	USHORT Padding[3];

	ULONG SequenceId;
	ULONG Checksum;

	PRUNTIME_CONTEXT Context;
} FOO;
typedef FOO *PFOO;
typedef FOO **PPFOO;

The structure name is always prefixed with an underscore, and defined via a typedef, with the type name matching the structure name, sans underscore. The structure is then followed by at least one pointer typedef, the name for which always matches the target type

Tracer

Install / Use

README

Tracer

Introduction

Motivation, Existing Works

C without the C Runtime Library

Motivation Behind Not Relying on the C Runtime Library

The Mechanics of C without the C Runtime Library

Style

Intrinsic C Types

Structures, Unions and Enums