gcc / ld madness

November 7, 2012

So, I started reading [The Definitive Guide to the Xen Hypervisor] (again :P), and I thought it would be fun to start with the example guest kernel, provided by the author, and extend it a bit (ye, there’s mini-os already in extras/, but I wanted to struggle with all the peculiarities of extended inline asm, x86_64 asm, linker scripts, C macros etc, myself :P).

After doing some reading about x86_64 asm, I ‘ported’ the example kernel to 64bit, and gave it a try. And of course, it crashed. While I was responsible for the first couple of crashes (for which btw, I can write at least 2-3 additional blog posts :P), I got stuck with this error:

traps.c:470:d100 Unhandled bkpt fault/trap [#3] on VCPU 0 [ec=0000]
RIP:    e033:<0000000000002271>

when trying to boot the example kernel as a domU (under xen-unstable).

0x2000 is the address where XEN maps the hypercall page inside the domU’s address space. The guest crashed when trying to issue any hypercall (HYPERCALL_console_io in this case). At first, I thought I had screwed up with the x86_64 extended inline asm, used to perform the hypercall, so I checked how the hypercall macros were implemented both in the Linux kernel (wow btw, it’s pretty scary), and in the mini-os kernel. But, I got the same crash with both of them.

After some more debugging, I made it work. In my Makefile, I used gcc to link all of the object files into the guest kernel. When I switched to ld, it worked. Apparently, when using gcc to link object files, it calls the linker with a lot of options you might not want. Invoking gcc using the -v option will reveal that gcc calls collect2 (a wrapper around the linker), which then calls ld with various options (certainly not only the ones I was passing to my ‘linker’). One of them was –build-id, which generates a .note.gnu.build-id” ELF note section in the output file, which contains some hash to identify the linked file.

Apparently, this note changes the layout of the resulting ELF file, and ‘shifts’ the .text section to 0x30 from 0x0, and hypercall_page ends up at 0x2030 instead of 0x2000. Thus, when I ‘called’ into the hypercall page, I ended up at some arbitrary location instead of the start of the specific hypercall handler I was going for. But it took me quite some time of debugging before I did an objdump -dS [kernel] (and objdump -x [kernel]), and found out what was going on.

The code from bootstrap.x86_64.S looks like this (notice the .org 0x2000 before the hypercall_page global symbol):

        .text
        .code64
	.globl	_start, shared_info, hypercall_page
_start:
	cld
	movq stack_start(%rip),%rsp
	movq %rsi,%rdi
	call start_kernel

stack_start:
	.quad stack + 8192
	
	.org 0x1000
shared_info:
	.org 0x2000

hypercall_page:
	.org 0x3000	

One solution, mentioned earlier, is to switch to ld (which probalby makes more sense), instead of using gcc. The other solution is to tweak the ELF file layout, through the linker script (actually this is pretty much what the Linux kernel does, to work around this):

OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)

PHDRS {
	text PT_LOAD FLAGS(5);		/* R_E */
	data PT_LOAD FLAGS(7);		/* RWE */
	note PT_NOTE FLAGS(0);		/* ___ */
}

SECTIONS
{
	. = 0x0;			/* Start of the output file */
	_text = .;			/* Text and ro data */
	.text : {
		*(.text)
	} :text = 0x9090 

	_etext = .;			/* End ot text section */

	.rodata : {			/* ro data section */
		*(.rodata)
		*(.rodata.*)
	} :text

	.note : { 
		*(.note.*)
	} :note

	_data = .;
	.data : {			/* Data */
		*(.data)
	} :data

	_edata = .;			/* End of data section */	
}

And now that my kernel boots, I can go back to copy-pasting code from the book … erm hacking. :P

Disclaimer: I’m not very familiar with lds scripts or x86_64 asm, so don’t trust this post too much. :P

Here are some things I learned while reading the Linux kernel source code(some of which took me a couple of hours of googling and searching through documentation, git commit posts, threads on lkml etc etc :P).

1)You cannot write extended toplevel inline assembly, ie when you want to use extended inline assembly to pass the value of some C variables or constants, you can only do it inside a function. And as I found out, someone had filed a bug at the GCC bugzilla. So something like this

static const char foo[] = "Hello, world!";
enum { bar = 17 };
asm(".pushsection baz; .long %c0, %c1, %c2; .popsection"
    : : "i" (foo), "i" (sizeof(foo)), "i" (bar));

won’t work.

2)I didn’t search very much the documentation about inline asm, but I couldn’t find what’s the difference between %c0 and %0. It’s used at the example code above, and in a kernel macro I saw. I understood that it had to do with some ‘constant casting’, but I couldn’t find anywhere the exact difference. So I wrote a simple piece of code to clarify that:

main() {
	asm("movl %0, %%eax; movl %c0, %%eax"
		:: "i" (0xff) );
}

and after

gcc -S foo.c

I get:

movl $255, %eax
movl 255, %eax

So %0 is used when we want an integer constant to be used as an immediate value in instructions like mov, add etc, which means that it should be prefixed with $, while %c0 is used when we want the number itself for instructions like .long, .size etc which demand an absolute expression/value as ‘arguments’.

3) When using the section attribute on a variable, in order to change the section it belongs, you cannot change the section’s type to nobits, it’ll be progbits by default. progbits means that the section will actually get space allocated inside the executable(like text and data sections), in contrast to nobits sections like bss for example.
i.e. you can’t do this

static char foo __attribute__(section("bar", nobits));

4)I also found out about the pushsection and popsections asm directives, which manipulate the ELF section stack, and seem to be very useful in certain occasions. pushsection obviously pushes the current section to the section stack, and replace it with the argument passed to the directive, while popsection replaces the current section with the section on top of the section stack.

5)Finally the ‘used’ attribute, which indicates that the symbol(function in our case) is actually used/called/referenced even if the compiler can’t ‘see’ it(otherwise I think that the compiler optimizations would omit code generation for that function).

And now a kernel macro which includes all of the above:

/*
 * Reserve space in the brk section.  The name must be unique within
 * the file, and somewhat descriptive.  The size is in bytes.  Must be
 * used at file scope.
 *
 * (This uses a temp function to wrap the asm so we can pass it the
 * size parameter; otherwise we wouldn't be able to.  We can't use a
 * "section" attribute on a normal variable because it always ends up
 * being @progbits, which ends up allocating space in the vmlinux
 * executable.)
 */
#define RESERVE_BRK(name,sz)						\
	static void __section(.discard.text) __used			\
	__brk_reservation_fn_##name##__(void) {				\
		asm volatile (						\
			".pushsection .brk_reservation,\"aw\",@nobits;" \
			".brk." #name ":"				\
			" 1:.skip %c0;"					\
			" .size .brk." #name ", . - 1b;"		\
			" .popsection"					\
			: : "i" (sz));					\
	}

And a bit more detailed explanation from the git commit

The C definition of RESERVE_BRK() ends up being more complex than
one would expect to work around a cluster of gcc infelicities:

The first attempt was to simply try putting __section(.brk_reservation)
on a variable. This doesn’t work because it ends up making it a
@progbits section, which gets actual space allocated in the vmlinux
executable.

The second attempt was to emit the space into a section using asm,
but gcc doesn’t allow arguments to be passed to file-level asm()
statements, making it hard to pass in the size.

The final attempt is to wrap the asm() in a function to allow
it to have arguments, and put the function itself into the
.discard section, which vmlinux*.lds drops entirely from the
emitted vmlinux.

Another thing to notice is that the wrapper function is put in the .discard.text section, which according to the vmlinux.lds(the linker script used to generate/link the vmlinux executable) will be discarded and thus not included in the executable.
From scripts/module-common.lds:

/*
 * Common module linker script, always used when linking a module.
 * Archs are free to supply their own linker scripts.  ld will
 * combine them automatically.
 */
SECTIONS {
	/DISCARD/ : { *(.discard) }
}

The purpose of the RESERVE_BRK macro, and the brk-like allocator for very early memory allocations needed during the kernel boot process is an interesting story too(which means another post coming soon)! ;)

I wanted to understand how the code for the early page tables that the kernel builds just before paging is enabled, works. x86 AT&T assembly isn’t very easy to understand, but it’s fun though. :P

page_pde_offset = (__PAGE_OFFSET >> 20);

        movl $pa(__brk_base), %edi
        movl $pa(swapper_pg_dir), %edx
        movl $PTE_IDENT_ATTR, %eax
10:
        leal PDE_IDENT_ATTR(%edi),%ecx          /* Create PDE entry */
        movl %ecx,(%edx)                        /* Store identity PDE entry */
        movl %ecx,page_pde_offset(%edx)         /* Store kernel PDE entry */
        addl $4,%edx
        movl $1024, %ecx
11:
        stosl
        addl $0x1000,%eax
        loop 11b
        /*
         * End condition: we must map up to the end + MAPPING_BEYOND_END.
         */
        movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
        cmpl %ebp,%eax
        jb 10b
        addl $__PAGE_OFFSET, %edi
        movl %edi, pa(_brk_end)
        shrl $12, %eax
        movl %eax, pa(max_pfn_mapped)

Here’s the piece of code.
Have fun! :)

Follow

Get every new post delivered to your Inbox.

Join 276 other followers