Update: Corrected fallocate and parted commands, and removed diratime mount option. Thanks to axil

Long time, no post.

For about a year now, I’ve been working at GRNET on its (OpenStack API compliant) open source IaaS cloud platform Synnefo, which powers the ~okeanos service.

Since ~okeanos is mainly aimed towards the Greek academic community (and thus has restrictions on who can use the service), we set up a ‘playground’ ‘bleeding-edge’ installation (okeanos.io) of Synnefo, where anyone can get a free trial account, experiment with the the Web UI, and have fun scripting with the kamaki API client. So, you get to try the latest features of Synnefo, while we get valuable feedback. Sounds like a fair deal. :)

Unfortunately, being the only one in our team that actually uses Gentoo Linux, up until recently Gentoo VMs were not available. So, a couple of days ago I decided it was about time to get a serious distro running on ~okeanos (the load of our servers had been ridiculously low after all :P). For future reference, and in case anyone wants to upload their own image on okeanos.io or ~okeanos, I’ll briefly describe the steps I followed.

1) Launch a Debian-base (who needs a GUI?) VM on okeanos.io

Everything from here on is done inside our Debian-base VM.

2) Use fallocate or dd seek= to create an (empty) file large enough to hold our image (5GB)

fallocate -o $((5 * 1024 * 1024 *1024)) -l 1 gentoo.img

3) Losetup the image, partition and mount it

losetup -f gentoo.img
parted /dev/loop0 mklabel msdos
parted /dev/loop0 mkpart primary ext4 2048s 5G
kpartx -a /dev/loop0
mkfs.ext4 /dev/mapper/loop0p1
losetup /dev/loop1 /dev/mapper/loop0p1 (trick needed for grub2 installation later on)
mount /dev/loop1 /mnt/gentoo -t ext4 -o noatime

4) Chroot and install Gentoo in /mnt/gentoo. Just follow the handbook. At a minimum you’ll need to extract the base system and portage, and set up some basic configs, like networking. It’s up to you how much you want to customize the image. For the Linux Kernel, I just copied directly the Debian /boot/[vmlinuz|initrd|System.map] and /lib/modules/ of the VM (and it worked! :)).

5) Install sys-boot/grub-2.00 (I had some *minor* issues with grub-0.97 :P).

6) Install grub2 in /dev/loop0 (this should help). Make sure your device.map inside the Gentoo chroot looks like this:

(hd0) /dev/loop0
(hd1) /dev/loop1

and make sure you have a sane grub.cfg (I’d suggest replacing all references to UUIDs in grub.cfg and /etc/fstab to /dev/vda[1]).
Now, outside the chroot, run:

grub-install --root-directory=/mnt --grub-mkdevicemap=/mnt/boot/grub/device.map /dev/loop0

Cleanup everything (umount, losetup -d, kpartx -d etc), and we’re ready to upload the image, with snf-image-creator.

snf-image-creator takes a diskdump as input, launches a helper VM, cleans up the diskdump / image (cleanup of sensitive data etc), and optionally uploads and registers our image with ~okeanos.

For more information on how snf-image-creator and Synnefo image registry works, visit the relevant docs [1][2][3].

0) Since snf-image-creator will use qemu/kvm to spawn a helper VM, and we’re inside a VM, let’s make sure that nested virtualization (OSDI ’10 Best Paper award btw :)) works.

First, we need to make sure that kvm_[amd|intel] is modprobe’d on the host machine / hypervisor with the nested = 1 parameter, and that the vcpu, that qemu/kvm creates, thinks that it has ‘virtual’ virtualization extensions (that’s actually our responsibility, and it’s enabled on the okeanos.io servers).

Inside our Debian VM, let’s verify that everything is ok.

grep [vmx | svm] /proc/cpuinfo
modprobe -v kvm kvm_intel

1) Clone snf-image-creator repo

git clone https://code.grnet.gr/git/snf-image-creator

2) Install snf-image-creator using setuptools (./setup.py install) and optionally virtualenv. You’ll need to install (pip install / aptitude install etc) setuptools, (python-)libguestfs and python-dialog manually. setuptools will take care of the rest of the deps.

3) Use snf-image-creator to prepare and upload / register the image:

snf-image-creator -u gentoo.diskdump -r "Gentoo Linux" -a [okeanos.io username] -t [okeanos.io user token] gentoo.img -o gentoo.img --force

If everything goes as planned, after snf-image-creator terminates, you should be able to see your newly uploaded image in https://pithos.okeanos.io, inside the Images container. You should also be able to choose your image to create a new VM (either via the Web UI, or using the kamaki client).

And, let’s install kamaki to spawn some Gentoo VMs:

git clone https://code.grnet.gr/git/kamaki

and install it using setuptools (just like snf-image-creator). Alternatively, you could use our Debian repo (you can find the GPG key here).

Modify .kamakirc to match your credentials:

[astakos]
enable = on
url = https://astakos.okeanos.io
[compute]
cyclades_extensions = on
enable = on
url = https://cyclades.okeanos.io/api/v1.1
[global]
colors = on
token = [token]
[image]
enable = on
url = https://cyclades.okeanos.io/plankton
[storage]
account = [username]
container = pithos
enable = on
pithos_extensions = on
url = https://pithos.okeanos.io/v1

Now, let’s create our first Gentoo VM:

kamaki server create LarryTheCow 37 `kamaki image list | grep Gentoo | cut -f -d ' '` --personality /root/.ssh/authorized_keys

That’s all for now. Hopefully, I’ll return soon with another more detailed post on scripting with kamaki (vkoukis has a nice script using kamaki python lib to create from scratch a small MPI cluster on ~okeanos :)).

Cheers!

Abusing the C preprocessor

August 29, 2011

Both tricks shown here are related with a change (by Peter Zijlstra) in the kmap_atomic() and kunmap_atomic() macros/functions. LWN has an excellent article about what that change involved. It basically ‘dropped’ support for atomic kmap slots, switching to a more general stack-based approach.

Now with this change, the number of arguments passed to the kmap_atomic() function changed too, and thus you end up with a huge patch covering all the tree, which fixed the issue (changing kmap_atomic(p, KM_TYPE) to kmap_atomic(p)).

But there was another way to go. Some C preprocessor magic.

#define kmap_atomic(page, args...) __kmap_atomic(page)

Yes, the C preprocessor supports va_args. :)
(which I found out when going through the reptyr code, but I’ll talk about it in an other post.)

Today, I saw a thread at the lkml, which actually did the cleanup I described. Andrew Morton responded:

I’m OK with cleaning all these up, but I suggest that we leave the back-compatibility macros in place for a while, to make sure that various stragglers get converted. Extra marks will be awarded for working out how to make unconverted code generate a compile warning

And Nick Bowler responded with a very clever way to do this (which involved abusing heavily the C preprocessor :P):

  #include <stdio.h>

  int foo(int x)
  {
     return x;
  }

  /* Deprecated; call foo instead. */
  static inline int __attribute__((deprecated)) foo_unconverted(int x, int unused)
  {
     return foo(x);
  }

  #define PASTE(a, b) a ## b
  #define PASTE2(a, b) PASTE(a, b)
  
  #define NARG_(_9, _8, _7, _6, _5, _4, _3, _2, _1, n, ...) n
  #define NARG(...) NARG_(__VA_ARGS__, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)

  #define foo1(...) foo(__VA_ARGS__)
  #define foo2(...) foo_unconverted(__VA_ARGS__)
  #define foo(...) PASTE2(foo, NARG(__VA_ARGS__)(__VA_ARGS__))

  int main(void)
  {
    printf("%d\n", foo(42));
    printf("%d\n", foo(54, 42));
    return 0;
  }

The actual warning is printed due to the deprecated attribute of the foo_unconverted() function.

The fun part, however, is how we get to use the foo ‘identifier’/name to call either foo() or foo_uncoverted() depending on the number of arguments given. :)

The trick is to use the __VA_ARGS__ to ‘shift’ the numbers 9-0 in the NARG macro, so that when calling the NARG_ macro, _9 will match with the first __VA_ARGS__ argument, _8 with the second etc, and so n will match with actual number of arguments (I’m not sure I described it very well, but if you try doing it by hand, you’ll understand how it’s working).

Now that we have the number of arguments given to foo, we use the PASTE macro to ‘concatenate’ the number of the arguments with the function name, and the actual arguments given, and call the appropriate wrapper macro (foo1, foo2 etc).

Another interesting thing, which I didn’t know, is about argument expansion in macros. For macros that concatenate (##) or stringify (#) the arguments are not expanded beforehand. That’s why we have to use PASTE2 as a wrapper, to get the NARG() argument/macro fully expanded before concatenating.

Ok, C code can get at times a bit obfuscated, and yes you don’t have type safety etc etc, but, man, you can be really creative with the C language (and the C preprocessor)!
And the Linux kernel development(/-ers) prove just that. :)

For some reason, whenever I open up the Wikipedia, I end up with tons of tabs in my web browser, and usually the tabs are completely unrelated to each other. :P

Yesterday, I ended up looking the xargs Wikipedia article, and there I found an interesting note:

Under the Linux kernel before version 2.6.23, arbitrarily long lists of parameters could not be passed to a command,[1] so xargs breaks the list of arguments into sublists small enough to be acceptable.

Along with a link to the GNU coreutils FAQ.

And from there a link to the Linux Kernel mainline git repository.

After a bit of googling, I found a very nice article describing in great detail the ARG_MAX variable, which defines the maximum length of the arguments passed to execve.

Traditionally Linux used a hardcoded:

#define MAX_ARG_PAGES 32

to limit the total size of the arguments passed to the execve() (including the size of the ‘environment’). That limited the maxlen of the arguments passed to about 128KB (minus the size of the ‘environment’).

(Note: actually, very early Linux kernels did not have support for ARG_MAX and didn’t use MAX_ARG_PAGES, but back then I was probably 2-3 years old, so it’s ancient history for me :P)

With Linux-2.6.33, this hardcoded limit was removed. Actually it was replaced by a more ‘flexible’ limit. The maximum length of the arguments can now be as big as the 1/4th of the user-space stack. For example, in my desktop, using ulimit -s I get a stack size of 8192KB, which means 2097152 maxlength for the arguments passed. The same value you can obtain by using getconf. Now, if I increase the soft limit on the stack size, the maxlength allowed will also increase, although with a 8192KB soft limit, the ‘ARGS_MAX’ is already big enough. Two new limits where also introduced, one on the maxlength of each argument (equal to PAGE_SIZE * 32), and the total number of arguments, equal to 0x7FFFFFFF, or as big as a signed integer can be.

Linux headers however use the MAX_ARG_STRLEN, I think, as the ARG_MAX limit, which forces libc to #undef it in its own header files. I’m not sure, since I haven’t looked into code yet, but at least for Linux, ARG_MAX is not statically defined anymore by libc (ie in a header file), but libc computes its value from the userspace stack size.
(edit: that’s indeed how it works for >=linux-2.6.33 — code in sysdeps/unix/sysv/linux/sysconf.c:

    case _SC_ARG_MAX:
  #if __LINUX_KERNEL_VERSION < 0x020617
        /* Determine whether this is a kernel 2.6.23 or later.  Only
           then do we have an argument limit determined by the stack
           size.  */
        if (GLRO(dl_discover_osversion) () >= 0x020617)
  #endif
          {
            /* Use getrlimit to get the stack limit.  */
            struct rlimit rlimit;
            if (__getrlimit (RLIMIT_STACK, &rlimit) == 0)
              return MAX (legacy_ARG_MAX, rlimit.rlim_cur / 4);
          }
  
        return legacy_ARG_MAX;

).

And the kernel code that enforces that limit:

               struct rlimit *rlim = current->signal->rlim;
               unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;

               /*
                * Limit to 1/4-th the stack size for the argv+env strings.
                * This ensures that:
                *  - the remaining binfmt code will not run out of stack space,
                *  - the program will have a reasonable amount of stack left
                *    to work from.
                */
               if (size > rlim[RLIMIT_STACK].rlim_cur / 4) {
                       put_page(page);
                       return NULL;
               }

The whole kernel patch is a bit complicated for me to understand, since I don’t have digged much into kernel mm code, but from what I understand, instead of copying the arguments into pages, and then mapping those pages into the new process address space, it setups a new mm_struct, and populates it with a stack VMA. It then copies the arguments into this VMA (expanding it as needed), and then takes care to ‘position’ it correctly into the new process. But since I’m not very familiar with the Linux Kernel mm API, it’s very likely that what I said is totally wrong (I really have to read the mm chapters from “Understanding the Linux Kernel” :P).

A couple of months ago I found out about ketchup (credits to Daniel Drake, and his blog).

ketchup is an awesome utility/script, written by Matt Mackall in Python, which makes it very easy to manage kernel sources. You can very easily upgrade to a newer kernel version, downgrade to older releases, and even switch between different patchsets. The ketchup ebuild I found in Portage (and in every Linux distro I know about) was fetching the original and out-of-date version of ketchup. Steven Rostedt had pulled the original ketchup code (v0.9) into his git repo @ kernel.org. However, there were no commits/updates to ketchup for 1-2 years, I think.

So, I decided to cleanup some of the old trees that ketchup supported, but were no longer maintained, and add support for new trees (or some updated ‘versions’ of the old trees). I sent the patches to Steven Rostedt, and he proposed that I take over and maintain ketchup. :)

I cloned the ketchup git repo to Github, applied the patches I’d written, plus quite a lot of patches that the Debian ketchup package provided.

Now, with the Linux-3.0 release approaching, I tried to add (at least) partial support for the new 2 digit version numbers, but there are still some issues, which will hopefully get resolved once Linux-3.0 gets released, and the new versioning scheme gets standarized (for example the EXTRAVERSION Makefile variable will probably not get removed from 3.0, as it breaks some userspace utils, like uptime etc from procps utils, some depmod issues etc).

The new code for 3.x kernels is currently in the linux-3 branch, from which I took a snapshot and pushed it to Portage as dev-util/ketchup-1.1_beta. I will hopefully merge it back with master, after the first -stable release comes out (Linux-3.0.1), just to make sure that everything works.

Feel free to give it a try, and report any bugs/issues.

Hey!

July 9, 2011

I finally became a Gentoo Developer. :)

I’ll be helping the Gentoo Kernel Project, with bug fixing at first, and help with the maintenance of some of the kernel sources in the tree.

Many thanks to mpagano for mentoring me, tampakrap for his help with the quizzes, and of course hwoarang, who had no problem to do all of the review sessions during his vacations. :)

Follow

Get every new post delivered to your Inbox.

Join 276 other followers