How To Retarget the GNU Toolchain in 21 Patches

Preamble to the github Edition

The text below was originally written for a series of daily blog posts I wrote in 2008 about porting the GNU Toolchain to a new target. The old blog site is down, so I have collected and re-hosted the articles here at github with all of the original source patches.

In retrospect, some of the original thinking behind this project was naive, however the end result was an interesting and useful project that I hope others may learn from.

The source patches apply to 2008-era GCC and SRC trees for the GNU Toolchain. They may not be useful from a development perspective today, but as a teaching tool they still have value.

The experimental 'ggx' target was eventually renamed to 'moxie', and work continues here:

http://moxielogic.org/blog
http://github.com/atgreen/moxie-cores

I'd love to hear from people who found this useful or who have questions. I can be reached at green@moxielogic.com.

Happy Hacking!

Anthony Green

Preamble: Top-Down ISA Evolution

The design process of an Instruction Set Architecture (ISA) has always struck me as being backwards. I imagine that they are mostly designed by hardware engineers whose decisions are largely influenced by the medium in which they work - hardware description languages. Every decision is influenced to some degree by implementation difficulty or limitations of the hardware.

Only the lucky compiler developers have a chance to review an ISA before it is fixed in stone. The few times I've seen this happen it has always been to everyone's benefit. Perhaps there are instructions that the compiler could never use, or perhaps the compiler would generate better code if only it had some overlooked addressing mode.

I've often wondered what would happen if we reversed this process. What would an ISA look like that was solely influenced by the needs of the compiler writer, without regard to the hardware side of the house? Instead of designing from the hardware up, start from the compiler and design top-down.

As an experiment, I've developed the start of a GNU tools port to a new architecture that was wholly undefined at the start of the project. There was no instruction set architecture definition, no register file definition, no defined ABI or OS interfaces. Instead, I defined these things on the fly, letting things evolve naturally based on the needs of the compiler. The C compiler, assembler, linker, binutils and simulator were developed concurrently. So, for instance, when the compiler required an "add" instruction, each tool would be taught about "add" and so on.

I'm going to blog a patch a day to show my progress. Each patch will be small and buildable. I hope to show that you can get from nothing to running real programs using this top-down ISA design method in a surprisingly short amount of time.

A couple of caveats first... I am not an experienced compiler writer. I've been involved in many new GNU ports over the years at Cygnus and Red Hat, but mostly in a bizdev capacity. I have a good idea how all the pieces go together, but this will be my first attempt at it and I'm certain to botch some things. Second, I'm by no means an expert in ISA design. I'm taking the lazy man's route to ISA design:

Try to build something with gcc
See what the compiler is complaining about
Implement it
Go to 1.

Every design decision is driven by whatever is easiest to implement. What I expect I'll end up with is a simple although wildly inefficient architecture. Then perhaps we can optimize this toy ISA based on real-world code generation. At the end of the day, however, my goal is for this to be a fun and interesting experiment.

I'll post my first patch later today.

Patch 1: Naming the Target

The GNU tools are maintained in two separate repositories: src and gcc. The src tree contains the GNU binutils, gdb, instruction set simulators, cgen, newlib (a C runtime library), winsup (cygwin runtime support) and more. The gcc tree contains the GNU compiler collection and related utilities. These trees are designed to be merged into a single tree (sometimes called a Cygnus tree) so you can configure and build the entire toolchain in one go.

Those of you interested in some history might want to read the mostly accurate article and thread here: http://www.sourceware.org/ml/gdb/2000-09/msg00009.html.

I'm not going to build everything from a merged tree. The unfortunate truth is that shared top level files in the two trees often get out of sync with one another. For instance, sometimes the trees depend on incompatible versions of the autotools and keeping everything happy together is more work that I care to do at this point.

Now that we've decided on two trees, I'm going to start in the src tree.

The only thing we need to decide at this point is the architecture's name. I'm calling this "ggx" for now.

The patch after the jump adds the top level configury for our port. It just tells the build system to recognize our target name, and to not configure/build any of the subdirectories requiring target specific work.

Once patched, you should be able to configure and build the toolchain like so:

$ ./configure --target=ggx-elf
$ make

The "ggx-elf" target simply tells the build system that we want a toolchain to generate ELF object files for the ggx architecture.

The only thing this builds are some support libraries for the host system. It's not much, but it's a start!

Patches:

http://github.com/atgreen/ggx/blob/master/ggx-01-src.patch

Patch 2: BFD!

Today we'll build BFD, which is a library for reading and writing object files. We're mostly concerned with ELF, but BFD handles a number of other formats as well.

BFD is a backronym for Binary File Descriptor, but we all know it originally stood for Big F'ing Deal. The BFD manual puts it thusly:

The name came from a conversation David Wallace was having with Richard Stallman about the library: RMS said that it would be quite hard--David said "BFD". Stallman was right, but the name stuck.

David Henkel-Wallace (aka Gumby) was one of Cygnus' founders, and, at least as I recall, his California license plate at one time was "GNU BFD".

Now we have to make some real decisions about the ggx architecture, but not very difficult ones. The most significant features we're committing to in this patch are 32-bit words, 32-bit addresses and 8-bit bytes (see cpu-ggx.c). Two reasons for picking these values:

the GNU tools are really good at targeting 32-bit systems and
8- and 16-bit systems won't run any interesting Free Software.

Also note that we're defining the ELF machine number for ggx (0xFEED in include/elf/common.h). This number is encoded in all ggx object files and executables so other tools (objdump, gdb, etc) can identify them as such. There's a standards body somewhere that maintains the master list of ELF machine numbers. I just picked a random number right now, but it's something you need to worry about if you start working with other tool vendors for your processor (JTAG hardware debuggers, for instance).

Apply this patch to your src tree, rebuild, and you'll have a libbfd. There's really not much to it. Most of this patch is made up of configury changes. Click on the link below to see the patch.

We're very close to having a working assembler for ggx. There's just one more infrastructure step to take care of first.

Patches:

http://github.com/atgreen/ggx/blob/master/ggx-02-src.patch

Patch 3: Bad Instructions

Much like yesterday's BFD patch, the patch below is mostly configury. It builds the opcodes library for our new target: the ggx machine.

The opcodes library describes how instructions are encoded in memory and how to disassemble them into text. It's used by tools like objdump, gdb and gas.

And, finally, details of our future architecture are beginning to emerge. This patch defines the first ggx instruction: "bad". This instruction represents an illegal opcode, and should issue something like an illegal instruction trap if we ever attempt to execute it.

There's no real reason to ever write a program with a "bad" instruction, but there's no avoiding defining it either. We need well defined behaviour when the core attempts to decode an undefined instruction. In truth, there are many "bad" instruction variants: one for each undefined opcode. It just so happens that they are all "bad" right now!

We haven't written anything yet about how opcodes are encoded, or even how wide the instructions are. If you look carefully at the disassembler in ggx-dis.c you'll see that we're just pulling single byte "instructions" out of memory and looking them up in an opcode table. This is just scaffolding and will be replaced with a real instruction decoder shortly.

Before we do that, we'll take our first major step tomorrow by creating a working assembler.

Patches:

http://github.com/atgreen/ggx/blob/master/ggx-03-src.patch

Patch 4: Cooking with GAS

Today's patch to the src tree will let us build the GNU Assembler, gas, for the ggx architecture. You may recall from yesterday's post that we've defined our first and only instruction: "bad". This assembler will let us create our first object file of bad code.

Two routines of note are md_begin(), which populates a hashtable with all of our opcodes (only one so far), and md_assemble(), which parses ggx assembly text t

Ggx

Install / Use

README

How To Retarget the GNU Toolchain in 21 Patches

Preamble to the github Edition

Preamble: Top-Down ISA Evolution

Patch 1: Naming the Target

Patch 2: BFD!

Patch 3: Bad Instructions

Patch 4: Cooking with GAS

Related Skills