Skip to content

Compilerbook For macOS arm64

Original reference: https://www.sigbus.info/compilerbook

This repository is a native Apple Silicon/macOS rewrite track for compilerbook. The original book targets Linux x86-64; this track keeps the incremental compiler-building style, but changes the backend to AArch64 assembly assembled and linked by Apple clang.

The important rule for this rewrite: the compiler must not evaluate the input program in C and print a constant result. It should tokenize, parse, build an AST, and generate assembly that computes the result at runtime.

Target

  • Host OS: macOS
  • CPU: Apple Silicon arm64 / AArch64
  • Assembler/linker driver: Apple clang
  • Output path: compiler emits .s, then clang tmp.s -o tmp
  • Executable format: Mach-O
  • First practical goal: compile small C-like snippets into a runnable program whose exit status is the expected result

Basic workflow

sh
make
./armcc tmp.c > tmp.s
clang tmp.s -o tmp
./tmp
echo $?

For early calculator stages, direct source-string input is also useful:

sh
./armcc 'main(){ return 1+2*3; }' > tmp.s
clang tmp.s -o tmp
./tmp
echo $?

Expected exit code: 7.

Run the current test set:

sh
make test

Run every phase snapshot:

sh
make phase-test

Important differences from the original target

The original project assumes Linux x86-64. On macOS arm64, adjust these areas:

  • Use AArch64 instructions instead of x86-64 instructions.
  • Use Apple arm64 calling convention.
  • Prefix C symbols with _, such as _main and _printf.
  • Keep the stack 16-byte aligned.
  • Use Mach-O section names and relocation syntax.
  • Use Apple clang as the assembler/linker driver.
  • Do not rely on Linux ELF details.
  • Do not rely on Linux-style static linking.

Minimal generated program

The smallest useful generated assembly is:

asm
.globl _main
_main:
    mov x0, #42
    ret

Build and run:

sh
clang tmp.s -o tmp
./tmp
echo $?

Expected exit code: 42.

Apple arm64 ABI notes

General-purpose registers:

  • x0 to x7: integer/pointer arguments
  • x0: integer/pointer return value
  • x29: frame pointer
  • x30: link register
  • sp: stack pointer

Rules to respect:

  • Stack alignment at call boundaries must be 16 bytes.
  • Function calls use bl _function_name.
  • Return uses ret.
  • C-visible symbols generally need a leading underscore.

Example call:

asm
    mov x0, #3
    mov x1, #4
    bl _add

The result is in x0.

Function frame

Once local variables or nested calls exist, use a normal frame:

asm
.globl _main
_main:
    stp x29, x30, [sp, #-16]!
    mov x29, sp
    sub sp, sp, #32

    mov x0, #42

    mov sp, x29
    ldp x29, x30, [sp], #16
    ret

Local variables can live at negative offsets from x29:

asm
    str x0, [x29, #-8]
    ldr x0, [x29, #-8]

Round the local stack area up to a multiple of 16.

Stack-machine expression codegen

The compilerbook uses a stack-machine style for early expression codegen. The same idea works on arm64.

Use x0 as the current expression result.

Push:

asm
    sub sp, sp, #16
    str x0, [sp]

Pop:

asm
    ldr x1, [sp]
    add sp, sp, #16

Arithmetic:

asm
    add x0, x1, x0
    sub x0, x1, x0
    mul x0, x1, x0
    sdiv x0, x1, x0

Comparison:

asm
    cmp x1, x0
    cset x0, eq
    cset x0, ne
    cset x0, lt
    cset x0, le

For > and >=, either swap operands in the parser/codegen or use the corresponding condition after cmp.

Branches and labels

Use cmp plus conditional branches:

asm
    cmp x0, #0
    b.eq .L.else.0
    ...
    b .L.end.0
.L.else.0:
    ...
.L.end.0:

Generate unique labels with a monotonically increasing counter.

Function definitions

For a function:

c
int add(int x, int y) {
  return x + y;
}

Emit:

asm
.globl _add
_add:
    stp x29, x30, [sp, #-16]!
    mov x29, sp
    sub sp, sp, #16

    str x0, [x29, #-8]
    str x1, [x29, #-16]

    ldr x0, [x29, #-8]
    sub sp, sp, #16
    str x0, [sp]
    ldr x0, [x29, #-16]
    ldr x1, [sp]
    add sp, sp, #16
    add x0, x1, x0

    mov sp, x29
    ldp x29, x30, [sp], #16
    ret

Store incoming arguments into local stack slots first. This keeps later codegen simple because parameters and local variables are accessed the same way.

Globals

Writable global integer:

asm
.data
.globl _g
_g:
    .quad 3

Load global address:

asm
    adrp x0, _g@PAGE
    add x0, x0, _g@PAGEOFF

Load global value:

asm
    adrp x0, _g@PAGE
    add x0, x0, _g@PAGEOFF
    ldr x0, [x0]

Store global value:

asm
    adrp x1, _g@PAGE
    add x1, x1, _g@PAGEOFF
    str x0, [x1]

String literals

String literals should go in a Mach-O string section:

asm
.section __TEXT,__cstring
.L.str.0:
    .asciz "hello"

Load string address:

asm
    adrp x0, .L.str.0@PAGE
    add x0, x0, .L.str.0@PAGEOFF

Then switch back to text before functions:

asm
.text

Suggested phase map

The phase directories intentionally track compilerbook's step granularity more closely than a conventional project milestone plan.

  • phase-01-int: integer literal, e.g. 42
  • phase-02-add-sub: 5+20-4
  • phase-03-tokenizer: tokenizer and whitespace handling
  • phase-04-errors: source-location errors
  • phase-05-mul-div-parens: *, /, precedence, parentheses
  • phase-06-unary: unary + and -
  • phase-07-comparisons: ==, !=, <, <=, >, >=
  • phase-08-file-split: split compiler source files
  • phase-09-single-letter-locals: one-letter local variables
  • phase-10-multiletter-locals: multi-letter local variables
  • phase-11-return: return
  • phase-12-control-flow: if, else, while, for
  • phase-13-blocks: { ... }
  • phase-14-function-calls: function calls
  • phase-15-function-definitions: function definitions and parameters

Next phase: types, pointers, arrays

Add int, pointer types, &, *, pointer arithmetic, arrays, indexing, and sizeof.

Tests:

c
int main() { int x; x=3; int *p; p=&x; return *p; }
int main() { int a[3]; a[0]=3; a[1]=4; return a[0]+a[1]; }
int main() { int a[3]; return sizeof(a); }

Later phase: globals and strings

Add global variables, global arrays, and string literals.

Tests:

c
int g; int main() { g=3; return g; }
int g=5; int main() { return g; }
int main() { return *"A"; }
int *s="B"; int main() { return *s; }

Later phase: richer file input and C tests

Read source from a file and run a C-snippet test harness.

Test harness shape:

sh
printf '%s\n' "$input" > tmp.c
./cc-arm64 tmp.c > tmp.s
clang tmp.s -o tmp
./tmp

Common macOS arm64 pitfalls

  • Forgetting the leading _ on exported C symbols.
  • Misaligning the stack before bl.
  • Treating Mach-O sections like ELF sections.
  • Trying to use Linux static-linking assumptions.
  • Using x86-64 stack/register examples directly.
  • Loading addresses without @PAGE and @PAGEOFF.
  • Forgetting that x30 must be preserved when making nested calls.
  • Letting generated labels collide.
text
Makefile
compilerbook-macos-arm64.md
phases/
  phase-01-int/
  phase-02-add-sub/
  phase-03-tokenizer/
  phase-04-errors/
  phase-05-mul-div-parens/
  phase-06-unary/
  phase-07-comparisons/
  phase-08-file-split/
  phase-09-single-letter-locals/
  phase-10-multiletter-locals/
  phase-11-return/
  phase-12-control-flow/
  phase-13-blocks/
  phase-14-function-calls/
  phase-15-function-definitions/

Keep every phase runnable. A phase should be small enough that make phase-test proves its behavior without depending on unfinished later work.