gcc / ld madness

November 7, 2012

So, I started reading [The Definitive Guide to the Xen Hypervisor] (again :P), and I thought it would be fun to start with the example guest kernel, provided by the author, and extend it a bit (ye, there’s mini-os already in extras/, but I wanted to struggle with all the peculiarities of extended inline asm, x86_64 asm, linker scripts, C macros etc, myself :P).

After doing some reading about x86_64 asm, I ‘ported’ the example kernel to 64bit, and gave it a try. And of course, it crashed. While I was responsible for the first couple of crashes (for which btw, I can write at least 2-3 additional blog posts :P), I got stuck with this error:

traps.c:470:d100 Unhandled bkpt fault/trap [#3] on VCPU 0 [ec=0000]
RIP:    e033:<0000000000002271>

when trying to boot the example kernel as a domU (under xen-unstable).

0×2000 is the address where XEN maps the hypercall page inside the domU’s address space. The guest crashed when trying to issue any hypercall (HYPERCALL_console_io in this case). At first, I thought I had screwed up with the x86_64 extended inline asm, used to perform the hypercall, so I checked how the hypercall macros were implemented both in the Linux kernel (wow btw, it’s pretty scary), and in the mini-os kernel. But, I got the same crash with both of them.

After some more debugging, I made it work. In my Makefile, I used gcc to link all of the object files into the guest kernel. When I switched to ld, it worked. Apparently, when using gcc to link object files, it calls the linker with a lot of options you might not want. Invoking gcc using the -v option will reveal that gcc calls collect2 (a wrapper around the linker), which then calls ld with various options (certainly not only the ones I was passing to my ‘linker’). One of them was –build-id, which generates a .note.gnu.build-id” ELF note section in the output file, which contains some hash to identify the linked file.

Apparently, this note changes the layout of the resulting ELF file, and ‘shifts’ the .text section to 0×30 from 0×0, and hypercall_page ends up at 0×2030 instead of 0×2000. Thus, when I ‘called’ into the hypercall page, I ended up at some arbitrary location instead of the start of the specific hypercall handler I was going for. But it took me quite some time of debugging before I did an objdump -dS [kernel] (and objdump -x [kernel]), and found out what was going on.

The code from bootstrap.x86_64.S looks like this (notice the .org 0×2000 before the hypercall_page global symbol):

        .text
        .code64
	.globl	_start, shared_info, hypercall_page
_start:
	cld
	movq stack_start(%rip),%rsp
	movq %rsi,%rdi
	call start_kernel

stack_start:
	.quad stack + 8192
	
	.org 0x1000
shared_info:
	.org 0x2000

hypercall_page:
	.org 0x3000	

One solution, mentioned earlier, is to switch to ld (which probalby makes more sense), instead of using gcc. The other solution is to tweak the ELF file layout, through the linker script (actually this is pretty much what the Linux kernel does, to work around this):

OUTPUT_FORMAT("elf64-x86-64", "elf64-x86-64", "elf64-x86-64")
OUTPUT_ARCH(i386:x86-64)
ENTRY(_start)

PHDRS {
	text PT_LOAD FLAGS(5);		/* R_E */
	data PT_LOAD FLAGS(7);		/* RWE */
	note PT_NOTE FLAGS(0);		/* ___ */
}

SECTIONS
{
	. = 0x0;			/* Start of the output file */
	_text = .;			/* Text and ro data */
	.text : {
		*(.text)
	} :text = 0x9090 

	_etext = .;			/* End ot text section */

	.rodata : {			/* ro data section */
		*(.rodata)
		*(.rodata.*)
	} :text

	.note : { 
		*(.note.*)
	} :note

	_data = .;
	.data : {			/* Data */
		*(.data)
	} :data

	_edata = .;			/* End of data section */	
}

And now that my kernel boots, I can go back to copy-pasting code from the book … erm hacking. :P

Disclaimer: I’m not very familiar with lds scripts or x86_64 asm, so don’t trust this post too much. :P

Update: Corrected fallocate and parted commands, and removed diratime mount option. Thanks to axil

Long time, no post.

For about a year now, I’ve been working at GRNET on its (OpenStack API compliant) open source IaaS cloud platform Synnefo, which powers the ~okeanos service.

Since ~okeanos is mainly aimed towards the Greek academic community (and thus has restrictions on who can use the service), we set up a ‘playground’ ‘bleeding-edge’ installation (okeanos.io) of Synnefo, where anyone can get a free trial account, experiment with the the Web UI, and have fun scripting with the kamaki API client. So, you get to try the latest features of Synnefo, while we get valuable feedback. Sounds like a fair deal. :)

Unfortunately, being the only one in our team that actually uses Gentoo Linux, up until recently Gentoo VMs were not available. So, a couple of days ago I decided it was about time to get a serious distro running on ~okeanos (the load of our servers had been ridiculously low after all :P). For future reference, and in case anyone wants to upload their own image on okeanos.io or ~okeanos, I’ll briefly describe the steps I followed.

1) Launch a Debian-base (who needs a GUI?) VM on okeanos.io

Everything from here on is done inside our Debian-base VM.

2) Use fallocate or dd seek= to create an (empty) file large enough to hold our image (5GB)

fallocate -o $((5 * 1024 * 1024 *1024)) -l 1 gentoo.img

3) Losetup the image, partition and mount it

losetup -f gentoo.img
parted /dev/loop0 mklabel msdos
parted /dev/loop0 mkpart primary ext4 2048s 5G
kpartx -a /dev/loop0
mkfs.ext4 /dev/mapper/loop0p1
losetup /dev/loop1 /dev/mapper/loop0p1 (trick needed for grub2 installation later on)
mount /dev/loop1 /mnt/gentoo -t ext4 -o noatime

4) Chroot and install Gentoo in /mnt/gentoo. Just follow the handbook. At a minimum you’ll need to extract the base system and portage, and set up some basic configs, like networking. It’s up to you how much you want to customize the image. For the Linux Kernel, I just copied directly the Debian /boot/[vmlinuz|initrd|System.map] and /lib/modules/ of the VM (and it worked! :)).

5) Install sys-boot/grub-2.00 (I had some *minor* issues with grub-0.97 :P).

6) Install grub2 in /dev/loop0 (this should help). Make sure your device.map inside the Gentoo chroot looks like this:

(hd0) /dev/loop0
(hd1) /dev/loop1

and make sure you have a sane grub.cfg (I’d suggest replacing all references to UUIDs in grub.cfg and /etc/fstab to /dev/vda[1]).
Now, outside the chroot, run:

grub-install --root-directory=/mnt --grub-mkdevicemap=/mnt/boot/grub/device.map /dev/loop0

Cleanup everything (umount, losetup -d, kpartx -d etc), and we’re ready to upload the image, with snf-image-creator.

snf-image-creator takes a diskdump as input, launches a helper VM, cleans up the diskdump / image (cleanup of sensitive data etc), and optionally uploads and registers our image with ~okeanos.

For more information on how snf-image-creator and Synnefo image registry works, visit the relevant docs [1][2][3].

0) Since snf-image-creator will use qemu/kvm to spawn a helper VM, and we’re inside a VM, let’s make sure that nested virtualization (OSDI ’10 Best Paper award btw :)) works.

First, we need to make sure that kvm_[amd|intel] is modprobe’d on the host machine / hypervisor with the nested = 1 parameter, and that the vcpu, that qemu/kvm creates, thinks that it has ‘virtual’ virtualization extensions (that’s actually our responsibility, and it’s enabled on the okeanos.io servers).

Inside our Debian VM, let’s verify that everything is ok.

grep [vmx | svm] /proc/cpuinfo
modprobe -v kvm kvm_intel

1) Clone snf-image-creator repo

git clone https://code.grnet.gr/git/snf-image-creator

2) Install snf-image-creator using setuptools (./setup.py install) and optionally virtualenv. You’ll need to install (pip install / aptitude install etc) setuptools, (python-)libguestfs and python-dialog manually. setuptools will take care of the rest of the deps.

3) Use snf-image-creator to prepare and upload / register the image:

snf-image-creator -u gentoo.diskdump -r "Gentoo Linux" -a [okeanos.io username] -t [okeanos.io user token] gentoo.img -o gentoo.img --force

If everything goes as planned, after snf-image-creator terminates, you should be able to see your newly uploaded image in https://pithos.okeanos.io, inside the Images container. You should also be able to choose your image to create a new VM (either via the Web UI, or using the kamaki client).

And, let’s install kamaki to spawn some Gentoo VMs:

git clone https://code.grnet.gr/git/kamaki

and install it using setuptools (just like snf-image-creator). Alternatively, you could use our Debian repo (you can find the GPG key here).

Modify .kamakirc to match your credentials:

[astakos]
enable = on
url = https://astakos.okeanos.io
[compute]
cyclades_extensions = on
enable = on
url = https://cyclades.okeanos.io/api/v1.1
[global]
colors = on
token = [token]
[image]
enable = on
url = https://cyclades.okeanos.io/plankton
[storage]
account = [username]
container = pithos
enable = on
pithos_extensions = on
url = https://pithos.okeanos.io/v1

Now, let’s create our first Gentoo VM:

kamaki server create LarryTheCow 37 `kamaki image list | grep Gentoo | cut -f -d ' '` --personality /root/.ssh/authorized_keys

That’s all for now. Hopefully, I’ll return soon with another more detailed post on scripting with kamaki (vkoukis has a nice script using kamaki python lib to create from scratch a small MPI cluster on ~okeanos :)).

Cheers!

Coolest hack/trick ever!

February 17, 2011

Some time ago, I wrote about lguest, a minimal x86 hypervisor for the Linux Kernel, which is mainly used for experimentation, and learning stuff about hypervisors, operating systems, even computer architecture/ISA(x86 in particular).

Today I cloned the git repo for the lguest64 port, and I started browsing through the documentation and the code. In the launcher code(the program that initializes/sets up and launches a new guest kernel), I saw the coolest programming hack/trick I’ve seen in a long time. :P

/*L:170 Prepare to be SHOCKED and AMAZED.  And possibly a trifle nauseated.
 *
 * We know that CONFIG_PAGE_OFFSET sets what virtual address the kernel expects
 * to be.  We don't know what that option was, but we can figure it out
 * approximately by looking at the addresses in the code.  I chose the common
 * case of reading a memory location into the %eax register:
 *
 *  movl <some-address>, %eax
 *
 * This gets encoded as five bytes: "0xA1 <4-byte-address>".  For example,
 * "0xA1 0x18 0x60 0x47 0xC0" reads the address 0xC0476018 into %eax.
 *
 * In this example can guess that the kernel was compiled with
 * CONFIG_PAGE_OFFSET set to 0xC0000000 (it's always a round number).  If the
 * kernel were larger than 16MB, we might see 0xC1 addresses show up, but our
 * kernel isn't that bloated yet.
 *
 * Unfortunately, x86 has variable-length instructions, so finding this
 * particular instruction properly involves writing a disassembler.  Instead,
 * we rely on statistics.  We look for "0xA1" and tally the different bytes
 * which occur 4 bytes later (the "0xC0" in our example above).  When one of
 * those bytes appears three times, we can be reasonably confident that it
 * forms the start of CONFIG_PAGE_OFFSET.
 *
 * This is amazingly reliable. */
static unsigned long intuit_page_offset(unsigned char *img, unsigned long len)
{
	unsigned int i, possibilities[256] = { 0 };

	for (i = 0; i + 4 < len; i++) {
		/* mov 0xXXXXXXXX,%eax */
		if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3)
			return (unsigned long)img[i+4] << 24;
	}
	errx(1, "could not determine page offset");
}

It’s very well commented, and so I don’t think there’s something I could explain any better.
Very very nice trick!
‘Prepare to be shocked and amazed!’ ;)

~6k lines of C code, and you have a x86 hypervisor in the Linux kernel!

lguest is a very minimal x86 hypervisor for Linux, very easy to set up, and due to its very small code base, it can be fun to hack/tweak.

Besides its small code base, it’s very well documented, thus ‘studying’ the lguest code is probably the best way to start learning a few things about virtualization and hypervisors.

Actually, the Makefile in drivers/lguest/ can be used to ‘extract’ the comments from the lguest code, in a very organized way, which makes it even easier to understand how lguest works(just cd to the directory and type make Beer, or make Puppy, if you want some ascii puppies : )

lguest requires some kernel configuration. Except for lguest itself, you must include some virtio modules, particularly virtio_blk and virtio_net, if you need networking inside the guest.

Afterwards, you must compile the Documentation/lguest/lguest.c, which you’ll run in userspace every time you want to ‘launch’ a new guest.

In order to run a new guest, you just run the lguest launher you just compiled, with some parameters, like the amount of memory the guest should have, the kernel image(usually the same image, that the host runs) and/or the initrd needed, a rootfs(you can either download a minimal rootfs for testing, or create one using qemu), and the virtual block device(/dev/vda), where the rootfs will be mounted inside the guest.
i.e.
lguest 64m vmlinux --initrd=initrd.img --block=rootfile root=/dev/vda
That’s it! You have a guest running.

Of course lguest lacks features, included in other hypervisors/virtualization solutions, but that’s the point. A small ‘experimental’ hypervisor, easy to understand and maybe tweak. ;)

Follow

Get every new post delivered to your Inbox.

Join 276 other followers