Gsym
Python code that can parse mach-o and ELF files, parse DWARF debug info and generate a new symbolication format.
Install / Use
/learn @YtnbFirewings/GsymREADME
gsym
The code in this repository is designed to help finalized the "gsym" file format. It currently uses python to generate the format from DWARF. This allows us to quickly iterate on the design of the file format. Eventually native tools, hopefully the compiler or linker, will generate and parse this information.
What is gsym?
The gsym format is designed to store address lookup information efficiently. Many tools do symbolication of crash logs and stack back tracing. These tools require quick and efficient address lookups where the fewest number of pages are touched during the lookup itself. Many tools currently parse DWARF debug information for this. DWARF is a highly compressed format that requires a lot of parsing, extra memory, generation of lookup tables and manual searching for address information. This makes DWARF expensive and complex to use for address lookups. A symbolication tool that uses DWARF needs to be able to parse the debug info in .debug_info and .debug_abbrev in order to get to the line tables that are encoded in the .debug_line section. The address accelerator tables included in DWARF are random indexes of addresses where the table entries direct you to a compile unit that contains the address. It doesn't point to the exact function or variable within the compile unit, you must manually parse all information in each compile unit to find the function information you are looking for. To make matters worse, the DWARF accelerator tables are not always included, which means you must parse all DWARF from all compile units and create your own address lookup tables. Once you find your function information, you can get to the line table for the compile unit. This line table contains all of the source line information for all functions in the compile unit. This means you must search all line information for all functions to find the source file and line information you need for your address. The DWARF line tables for each compile unit, in DWARF 4 and earlier versions, store the files used in the line table, and for other DWARF info outside of the line table, by duplicating the paths from other line tables in each compile unit's line table prologue. If you want to extract the inline call stack for a given address, you must return the the function's debug info and parse all of the children of that debug information to discover the inlined call stack.
The gsym file format is designed to solve all of the above issues by designing a file format that can be mapped into read only memory and can be shared between multiple processes. The file format requires minimal setup to map and use and doesn't require any parsing or sorting prior to doing any lookups. The information is sorted and doesn't require any manual computation or indexing prior to doing lookups. The file format also stores the line table information for a function in a line table that is only for the source lines in that function. All file path information is stored efficiently and is shared between all line tables.
Introduction
The current code is in Python and can parse DWARF information found in mach-o and ELF file generates a ELF or mach-o files that contain the new gsym format. The gsym file format can be a part of an existing mach-o or ELF object file, or it can be a stand alone file format.
The gsym.py file contains a class named gsym.Symbolicator that is a class for a new symbolication format that uses sections within object files (ELF, Mach-o or COFF) to contain the symbolication data. The gsym file format efficiently stores information for addresses within an executable or shared library.
The information is designed to efficiently lookup addresses and provide function name, source file and source line information for a given address. To save space many things have been done:
- binary file format
- store the address lookup table as a base address + address offsets where the address offsets can be smaller than a full sized 32 or 64 bit address. The address offsets will be dynamic stored as uint16_t, uint32_t, or uint64_t depending on the max address difference in the file. We can look into storing 24 bit offsets in the future if needed but I didn't want to have to do any processing to decode 24 bit integers as it might affect the address lookup speed.
- compress C++ strings by removing default parameters for STL code and changing names into shorter forms (see shortencpp.py).
- use a string table for all strings that contains uniqued strings. The string table can be shared with the current executable or debug info which allows this information to be efficiently encoded in existing debug info files or as a stand alone file.
- store file paths by separating the directory and basename. These are stored as offsets into the string table. This allows file directories to share the same string in the string table. Unlike DWARF, all files are stored in one table for all line tables for all functions.
- store line tables efficiently using a technique similar to DWARF line tables.
- store inline information efficiently to allow inline call stacks to be extracted from a single address
File Format Details
The file format is designed to be able to be memory mapped into memory and used as is. Each lookup will only expand any information for a given address on demand. This means the address lookup information is all grouped together and touches as few pages as possible when doing the lookups.
If the gsym data is in a stand alone file, the file starts with the gsym header. If the gsym data is part of an object file, the data is placed into sections in the object file. The main section contains the gsym header, address table, address info offsets, files and the actual address info data. This section is named "__gsym" in mach-o and ".gsym" in all other file formats.
The string table is specified in the GSYM header using a file offset and byte size. This allows the string table to share strings with existing string tables (like ".strtab" or ".debug_str") or it can have its own stand alone string table. It also allows a file offset and size that encompasses multiple string tables in object files.
The format of the main section section is:
HEADER
Data layout on disk:
.align(16)
uint32_t magic;
uint16_t version;
uint8_t addr_off_size; // Size of addr_off_t
uint8_t pad;
uint64_t base_address;
uint32_t num_addrs;
uint32_t strtab_offset; // File offset for string table
uint32_t strtab_size; // File size for string table
.align(addr_off_size)
addr_off_t addr_offsets[num_addrs];
// Each address in addr_offsets has a corresponding entry in
// addr_info_offsets which points to the location of the
// information for that address like the function name and size,
// and the address to file and line mappings.
.align(sizeof(uint32_t))
uint32_t addr_info_offsets[num_addrs];
FILES TABLE
Definitions:
typedef struct {
uint32_t directory; // String table offset in the string table
uint32_t basename; // String table offset in the string table
} file_t;
Data layout on disk:
.align(sizeof(uint32_t))
uint32_t num_files;
file_t files[num_files];
ADDRESS INFOS
Each offset in the addr_info_offsets[] array points to the data for a given address. The data is designed to carry one or more types of information for an address range. This type of the information block is specified by the InfoType enumeration below. Each Information block is preceded by the 32 bit InfoType enumeration and followed by a 32 bit size in bytes of the data for that type. This allows parsers to quickly look for the data they care about for a given address and skip any data they don't need.
Definitions:
enum class uint32_t {
EndOfList = 0,
LineTable = 1,
InlineInfo = 2
} InfoType;
typedef struct {
InfoType type;
uint32_t length;
uint8_t data[length];
} InfoEntry;
Data layout on disk:
.align(sizeof(uint32_t))
uint32_t size; // Size in bytes of this function or symbol
uint32_t name; // String table offset in the string table
InfoEntry info[]; // Variable size chunks if formatted data
// associated with this address. Terminated by
// a InfoEntry struct with a type of InfoType::EndOfList.
// Each InfoEntry struct is aligned on a 32 bit
// boundary.
ADDRESS LOOKUPS
Address lookups involve taking the address passed in, subtracting the base_address from the header, then looking up the address offset using a binary search in addr_offsets to find the index of the matching address. Using this index, grab the offset for the address information from the addr_info_offsets[index], and then parse the address info for the address. The address info contains the byte size, so we must ensure that the address info contians the address before reporting the result. The address info contains the function name, function size and the line tables entries for all addresses in the function if there was debug info for the function. If the function came from the symbol table, there might not be file and line information available.
The address queries are very efficient as the address search is searching an array of offsets and we will touch a minimal number of cache lines and pages when doing address lookups. The address offset index that is found is then used access the offset of the data and we go straight to the data that contains the address info.
INFOTYPE SPECIFICATIONS
gsym can support an expandable amount of information types. This allows this file format to be used to store more information in the future for a given address range. The gsym.py script currently supports generating two types of data blocks: InfoType::LineTable and InfoType::InlineInfo.
