Abusing the C preprocessor

August 29, 2011

Both tricks shown here are related with a change (by Peter Zijlstra) in the kmap_atomic() and kunmap_atomic() macros/functions. LWN has an excellent article about what that change involved. It basically ‘dropped’ support for atomic kmap slots, switching to a more general stack-based approach.

Now with this change, the number of arguments passed to the kmap_atomic() function changed too, and thus you end up with a huge patch covering all the tree, which fixed the issue (changing kmap_atomic(p, KM_TYPE) to kmap_atomic(p)).

But there was another way to go. Some C preprocessor magic.

#define kmap_atomic(page, args...) __kmap_atomic(page)

Yes, the C preprocessor supports va_args. :)
(which I found out when going through the reptyr code, but I’ll talk about it in an other post.)

Today, I saw a thread at the lkml, which actually did the cleanup I described. Andrew Morton responded:

I’m OK with cleaning all these up, but I suggest that we leave the back-compatibility macros in place for a while, to make sure that various stragglers get converted. Extra marks will be awarded for working out how to make unconverted code generate a compile warning

And Nick Bowler responded with a very clever way to do this (which involved abusing heavily the C preprocessor :P):

  #include <stdio.h>

  int foo(int x)
  {
     return x;
  }

  /* Deprecated; call foo instead. */
  static inline int __attribute__((deprecated)) foo_unconverted(int x, int unused)
  {
     return foo(x);
  }

  #define PASTE(a, b) a ## b
  #define PASTE2(a, b) PASTE(a, b)
  
  #define NARG_(_9, _8, _7, _6, _5, _4, _3, _2, _1, n, ...) n
  #define NARG(...) NARG_(__VA_ARGS__, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0)

  #define foo1(...) foo(__VA_ARGS__)
  #define foo2(...) foo_unconverted(__VA_ARGS__)
  #define foo(...) PASTE2(foo, NARG(__VA_ARGS__)(__VA_ARGS__))

  int main(void)
  {
    printf("%d\n", foo(42));
    printf("%d\n", foo(54, 42));
    return 0;
  }

The actual warning is printed due to the deprecated attribute of the foo_unconverted() function.

The fun part, however, is how we get to use the foo ‘identifier’/name to call either foo() or foo_uncoverted() depending on the number of arguments given. :)

The trick is to use the __VA_ARGS__ to ‘shift’ the numbers 9-0 in the NARG macro, so that when calling the NARG_ macro, _9 will match with the first __VA_ARGS__ argument, _8 with the second etc, and so n will match with actual number of arguments (I’m not sure I described it very well, but if you try doing it by hand, you’ll understand how it’s working).

Now that we have the number of arguments given to foo, we use the PASTE macro to ‘concatenate’ the number of the arguments with the function name, and the actual arguments given, and call the appropriate wrapper macro (foo1, foo2 etc).

Another interesting thing, which I didn’t know, is about argument expansion in macros. For macros that concatenate (##) or stringify (#) the arguments are not expanded beforehand. That’s why we have to use PASTE2 as a wrapper, to get the NARG() argument/macro fully expanded before concatenating.

Ok, C code can get at times a bit obfuscated, and yes you don’t have type safety etc etc, but, man, you can be really creative with the C language (and the C preprocessor)!
And the Linux kernel development(/-ers) prove just that. :)

For some reason, whenever I open up the Wikipedia, I end up with tons of tabs in my web browser, and usually the tabs are completely unrelated to each other. :P

Yesterday, I ended up looking the xargs Wikipedia article, and there I found an interesting note:

Under the Linux kernel before version 2.6.23, arbitrarily long lists of parameters could not be passed to a command,[1] so xargs breaks the list of arguments into sublists small enough to be acceptable.

Along with a link to the GNU coreutils FAQ.

And from there a link to the Linux Kernel mainline git repository.

After a bit of googling, I found a very nice article describing in great detail the ARG_MAX variable, which defines the maximum length of the arguments passed to execve.

Traditionally Linux used a hardcoded:

#define MAX_ARG_PAGES 32

to limit the total size of the arguments passed to the execve() (including the size of the ‘environment’). That limited the maxlen of the arguments passed to about 128KB (minus the size of the ‘environment’).

(Note: actually, very early Linux kernels did not have support for ARG_MAX and didn’t use MAX_ARG_PAGES, but back then I was probably 2-3 years old, so it’s ancient history for me :P)

With Linux-2.6.33, this hardcoded limit was removed. Actually it was replaced by a more ‘flexible’ limit. The maximum length of the arguments can now be as big as the 1/4th of the user-space stack. For example, in my desktop, using ulimit -s I get a stack size of 8192KB, which means 2097152 maxlength for the arguments passed. The same value you can obtain by using getconf. Now, if I increase the soft limit on the stack size, the maxlength allowed will also increase, although with a 8192KB soft limit, the ‘ARGS_MAX’ is already big enough. Two new limits where also introduced, one on the maxlength of each argument (equal to PAGE_SIZE * 32), and the total number of arguments, equal to 0x7FFFFFFF, or as big as a signed integer can be.

Linux headers however use the MAX_ARG_STRLEN, I think, as the ARG_MAX limit, which forces libc to #undef it in its own header files. I’m not sure, since I haven’t looked into code yet, but at least for Linux, ARG_MAX is not statically defined anymore by libc (ie in a header file), but libc computes its value from the userspace stack size.
(edit: that’s indeed how it works for >=linux-2.6.33 — code in sysdeps/unix/sysv/linux/sysconf.c:

    case _SC_ARG_MAX:
  #if __LINUX_KERNEL_VERSION < 0x020617
        /* Determine whether this is a kernel 2.6.23 or later.  Only
           then do we have an argument limit determined by the stack
           size.  */
        if (GLRO(dl_discover_osversion) () >= 0x020617)
  #endif
          {
            /* Use getrlimit to get the stack limit.  */
            struct rlimit rlimit;
            if (__getrlimit (RLIMIT_STACK, &rlimit) == 0)
              return MAX (legacy_ARG_MAX, rlimit.rlim_cur / 4);
          }
  
        return legacy_ARG_MAX;

).

And the kernel code that enforces that limit:

               struct rlimit *rlim = current->signal->rlim;
               unsigned long size = bprm->vma->vm_end - bprm->vma->vm_start;

               /*
                * Limit to 1/4-th the stack size for the argv+env strings.
                * This ensures that:
                *  - the remaining binfmt code will not run out of stack space,
                *  - the program will have a reasonable amount of stack left
                *    to work from.
                */
               if (size > rlim[RLIMIT_STACK].rlim_cur / 4) {
                       put_page(page);
                       return NULL;
               }

The whole kernel patch is a bit complicated for me to understand, since I don’t have digged much into kernel mm code, but from what I understand, instead of copying the arguments into pages, and then mapping those pages into the new process address space, it setups a new mm_struct, and populates it with a stack VMA. It then copies the arguments into this VMA (expanding it as needed), and then takes care to ‘position’ it correctly into the new process. But since I’m not very familiar with the Linux Kernel mm API, it’s very likely that what I said is totally wrong (I really have to read the mm chapters from “Understanding the Linux Kernel” :P).

A couple of months ago I found out about ketchup (credits to Daniel Drake, and his blog).

ketchup is an awesome utility/script, written by Matt Mackall in Python, which makes it very easy to manage kernel sources. You can very easily upgrade to a newer kernel version, downgrade to older releases, and even switch between different patchsets. The ketchup ebuild I found in Portage (and in every Linux distro I know about) was fetching the original and out-of-date version of ketchup. Steven Rostedt had pulled the original ketchup code (v0.9) into his git repo @ kernel.org. However, there were no commits/updates to ketchup for 1-2 years, I think.

So, I decided to cleanup some of the old trees that ketchup supported, but were no longer maintained, and add support for new trees (or some updated ‘versions’ of the old trees). I sent the patches to Steven Rostedt, and he proposed that I take over and maintain ketchup. :)

I cloned the ketchup git repo to Github, applied the patches I’d written, plus quite a lot of patches that the Debian ketchup package provided.

Now, with the Linux-3.0 release approaching, I tried to add (at least) partial support for the new 2 digit version numbers, but there are still some issues, which will hopefully get resolved once Linux-3.0 gets released, and the new versioning scheme gets standarized (for example the EXTRAVERSION Makefile variable will probably not get removed from 3.0, as it breaks some userspace utils, like uptime etc from procps utils, some depmod issues etc).

The new code for 3.x kernels is currently in the linux-3 branch, from which I took a snapshot and pushed it to Portage as dev-util/ketchup-1.1_beta. I will hopefully merge it back with master, after the first -stable release comes out (Linux-3.0.1), just to make sure that everything works.

Feel free to give it a try, and report any bugs/issues.

I had blogged about this some time ago. The configuration I described in that post worked fine on my laptop, with Debian installed, but when I tried it on my Desktop, where I use Gentoo, it wouldn’t work.

It took me *3 days* of ‘debugging’, until I was able to find why that happened!

I tried various changes to the iptables and iproute2 configuration, giving more hints to both utilities in order to use the correct routing table, mark the packets correctly etc, but it still wouldn’t work.

After a lot of time tweaking the configuration, without results, I saw that, although ping -Ieth0 ${VPN_SERVER}, didn’t ‘work’ (with openvpn running, and tap0 configured with the correct address/netmask), I could see with tcpdump the ‘ECHO REPLY’ packets sent by the VPN server, with correct source and destination addresses.

After stracing the ping command, I saw that when ping issued a recvmsg syscall, recvmsg returned with -EAGAIN. So, now I know that the packets do arrive to the interface with correct addresses, but they couldn’t ‘reach’ the upper network stacks of the kernel.

The problem was that both machines were running vanilla kernels, so I couldn’t blame any Debian or Gentoo specific patches. But since I knew that the problem was in the kernel, I tried to see if any kernel .config options, regarding NETFILTER, and multiple routing tables didn’t match between the two configs. But I couldn’t find anything that could cause that ‘bug’.

So since the kernel sources are the same, and I can’t find anything in the .configs that could cause the problem, I try tweaking some /proc/sys/net ‘files’, although I couldn’t see why these would differ between the two machines. And then I saw some /proc/sys/net/ipv4/ files in Gentoo, that didn’t show up in Debian (/proc/sys/net/ipv4/cipso*).

I googled to find what cipso is, and I finally found out that it was part of the NetLabel project. CIPSO (Common IP Security Option) is an IETF draft (it’s quite old actually) and is implemented like a ‘security module’ in the Linux Kernel, and it was what it caused the problem, probably because it tried to do some verification on the inbound packets, which failed, and therefore the packets were ‘silently’ dropped. LWN has an article with more infromation about packet labeling and CIPSO, and there’s also related Documentation in the Linux Kernel.

make defconfig enbales Netlabel, but Debian’s default configuration had it disabled, and that’s why Openvpn/iproute2/iptables configuration worked with Debian, but failed on Gentoo.

Instead of compiling a new kernel, one can just do

echo 0 > /proc/sys/net/ipv4/cipso_rbm_strict_valid

and disable CIPSO verification on inbound packets, so that multiple routing tables and packet marking work as expected.

my first kernel patch!

March 11, 2011

Here it is! :P

Well, the patch itself isn’t a big deal, since I didn’t write any code. It was a cleanup of asm-offsets.

Afaik, Linux has quite a lot of assembly code, which needs the offsets of various struct members. Of course, assembly code(or even toplevel inline assembly) cannot use the offsetof marco. That’s also the case for some C constants, eg:

#define PAGE_SIZE (1UL << PAGE_SHIFT)) 

Thus, these offsets and constants are ‘calcualted’ at build time as ‘absolute values’ so that gas will be ok. :P
Using the macros defined in incldue/linux/kbuild.h

#define DEFINE(sym, val) \
        asm volatile("\n->" #sym " %0 " #val : : "i" (val))

#define BLANK() asm volatile("\n->" : : )

#define OFFSET(sym, str, mem) \
	DEFINE(sym, offsetof(struct str, mem))

#define COMMENT(x) \
	asm volatile("\n->#" x)

the asm-offsets.c is used to create a ‘special’ asm fle, which is then parsed by the kernel build system, in order to create the include/generated/asm-offsets.h file.

However, a patch introduced two new macros in include/linux/const.h

#ifdef __ASSEMBLY__
#define _AC(X,Y)	X
#define _AT(T,X)	X
#else
#define __AC(X,Y)	(X##Y)
#define _AC(X,Y)	__AC(X,Y)
#define _AT(T,X)	((T)(X))
#endif

so that some constants defined in C work with gas too.

And now, the PAGE_SIZE is defined as

#define PAGE_SIZE       (_AC(1,UL) << PAGE_SHIFT

and thus, the PAGE_SIZE_asm defined in x86/asm-offsets.c, and used in some places in x86 code was no longer needed(that’s also the case with PAGE_SHIFT and THREAD_SIZE).

So, I deleted PAGE_SIZE_asm/PAGE_SHIFT_asm/THREAD_SIZE_asm from x86/kernel/asm-offsets.c, and replaced them with their non-asm counterparts, in the code that used them.

I posted it to lkml(after some hours experimenting with git format-patch and git send-email :P), and it got accepted. :)

Here are some things I learned while reading the Linux kernel source code(some of which took me a couple of hours of googling and searching through documentation, git commit posts, threads on lkml etc etc :P).

1)You cannot write extended toplevel inline assembly, ie when you want to use extended inline assembly to pass the value of some C variables or constants, you can only do it inside a function. And as I found out, someone had filed a bug at the GCC bugzilla. So something like this

static const char foo[] = "Hello, world!";
enum { bar = 17 };
asm(".pushsection baz; .long %c0, %c1, %c2; .popsection"
    : : "i" (foo), "i" (sizeof(foo)), "i" (bar));

won’t work.

2)I didn’t search very much the documentation about inline asm, but I couldn’t find what’s the difference between %c0 and %0. It’s used at the example code above, and in a kernel macro I saw. I understood that it had to do with some ‘constant casting’, but I couldn’t find anywhere the exact difference. So I wrote a simple piece of code to clarify that:

main() {
	asm("movl %0, %%eax; movl %c0, %%eax"
		:: "i" (0xff) );
}

and after

gcc -S foo.c

I get:

movl $255, %eax
movl 255, %eax

So %0 is used when we want an integer constant to be used as an immediate value in instructions like mov, add etc, which means that it should be prefixed with $, while %c0 is used when we want the number itself for instructions like .long, .size etc which demand an absolute expression/value as ‘arguments’.

3) When using the section attribute on a variable, in order to change the section it belongs, you cannot change the section’s type to nobits, it’ll be progbits by default. progbits means that the section will actually get space allocated inside the executable(like text and data sections), in contrast to nobits sections like bss for example.
i.e. you can’t do this

static char foo __attribute__(section("bar", nobits));

4)I also found out about the pushsection and popsections asm directives, which manipulate the ELF section stack, and seem to be very useful in certain occasions. pushsection obviously pushes the current section to the section stack, and replace it with the argument passed to the directive, while popsection replaces the current section with the section on top of the section stack.

5)Finally the ‘used’ attribute, which indicates that the symbol(function in our case) is actually used/called/referenced even if the compiler can’t ‘see’ it(otherwise I think that the compiler optimizations would omit code generation for that function).

And now a kernel macro which includes all of the above:

/*
 * Reserve space in the brk section.  The name must be unique within
 * the file, and somewhat descriptive.  The size is in bytes.  Must be
 * used at file scope.
 *
 * (This uses a temp function to wrap the asm so we can pass it the
 * size parameter; otherwise we wouldn't be able to.  We can't use a
 * "section" attribute on a normal variable because it always ends up
 * being @progbits, which ends up allocating space in the vmlinux
 * executable.)
 */
#define RESERVE_BRK(name,sz)						\
	static void __section(.discard.text) __used			\
	__brk_reservation_fn_##name##__(void) {				\
		asm volatile (						\
			".pushsection .brk_reservation,\"aw\",@nobits;" \
			".brk." #name ":"				\
			" 1:.skip %c0;"					\
			" .size .brk." #name ", . - 1b;"		\
			" .popsection"					\
			: : "i" (sz));					\
	}

And a bit more detailed explanation from the git commit

The C definition of RESERVE_BRK() ends up being more complex than
one would expect to work around a cluster of gcc infelicities:

The first attempt was to simply try putting __section(.brk_reservation)
on a variable. This doesn’t work because it ends up making it a
@progbits section, which gets actual space allocated in the vmlinux
executable.

The second attempt was to emit the space into a section using asm,
but gcc doesn’t allow arguments to be passed to file-level asm()
statements, making it hard to pass in the size.

The final attempt is to wrap the asm() in a function to allow
it to have arguments, and put the function itself into the
.discard section, which vmlinux*.lds drops entirely from the
emitted vmlinux.

Another thing to notice is that the wrapper function is put in the .discard.text section, which according to the vmlinux.lds(the linker script used to generate/link the vmlinux executable) will be discarded and thus not included in the executable.
From scripts/module-common.lds:

/*
 * Common module linker script, always used when linking a module.
 * Archs are free to supply their own linker scripts.  ld will
 * combine them automatically.
 */
SECTIONS {
	/DISCARD/ : { *(.discard) }
}

The purpose of the RESERVE_BRK macro, and the brk-like allocator for very early memory allocations needed during the kernel boot process is an interesting story too(which means another post coming soon)! ;)

I wanted to understand how the code for the early page tables that the kernel builds just before paging is enabled, works. x86 AT&T assembly isn’t very easy to understand, but it’s fun though. :P

page_pde_offset = (__PAGE_OFFSET >> 20);

        movl $pa(__brk_base), %edi
        movl $pa(swapper_pg_dir), %edx
        movl $PTE_IDENT_ATTR, %eax
10:
        leal PDE_IDENT_ATTR(%edi),%ecx          /* Create PDE entry */
        movl %ecx,(%edx)                        /* Store identity PDE entry */
        movl %ecx,page_pde_offset(%edx)         /* Store kernel PDE entry */
        addl $4,%edx
        movl $1024, %ecx
11:
        stosl
        addl $0x1000,%eax
        loop 11b
        /*
         * End condition: we must map up to the end + MAPPING_BEYOND_END.
         */
        movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
        cmpl %ebp,%eax
        jb 10b
        addl $__PAGE_OFFSET, %edi
        movl %edi, pa(_brk_end)
        shrl $12, %eax
        movl %eax, pa(max_pfn_mapped)

Here’s the piece of code.
Have fun! :)

Linux uses a rather complicated(for me) driver model, which is built upon the kobject abstratction. There’s a lot of documentation about the linux driver model, and describing it in detail is out of the scope of this post(well, actually I can’t do it : P ).

In short:
kobjects are structs that hold some information(a name, a reference count, a parent pointer etc), and are usually embedded into other structs, typically structs for devices/device drivers(ie struct cdev, for a character device), creating a hieararchy which is ‘exported’ to userspace through the sysfs(mounted on /sys).
The question is how the code that works with kobjects can reference the struct that contains the kobject. The kernel provides a macro, (not surprisingly) called container_of, which does exactly that.
The definition of the macro can be found in include/linux/kernel.h:

/**
* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
*
*/
#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})

In order to understand what this macro does, we have to be somewhat familiar with the C Preprocessor, and some non-standard GCC extensions:
1)A parenthesis followed by a brace. This is called by the GCC statement expression. It lets us use a compound statement as an expression. Here, we want to use the container_of macro as an expression, but we want to declare a local variable(__mptr) inside the macro, so we need a compound statement.
2)The typeof GCC extension that lets us refer to the type of an expression, and can be used to declare variables.

The previous two extensions let us write safe macros(ie side effects of the operands are calculated only once) that work for any type(some kind of polymorphism), and can be used as expressions.

This macro also uses the offsetof, which computes the byte offset of a field within a structure. Linux uses the compiler-provided offsetof, if the compiler provides one, else it defines the offsetof macro as

#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

A few words about offsetof. offsetof is a valid ANSI C macro. Cfaq gives a possible implementation of offsetof(though non-portable) which is a bit different than the one that the kernel defines. I’m not sure about this, but as far as I can understand, the cfaq offsetof subtracts a NULL pointer from the ((type *)0)->member, to ensure that the offset is correct, even if the internal representation of the NULL pointers isn’t actually zero. I guess Linux has good reasons to assume that’s OK to ommit that subtraction.

And the last trick is the ((type *)0) cast. Actually, we pretend that there’s an instance of the struct at address 0. If we tried to reference it, we would be in big trouble, but that never happens. So we trick the compiler and we can legally get the type of the struct member, which is used to declare the __mptr as a pointer to that struct member. It’s also used by offsetof to get the byte offset of the mebmer within the struct(since it uses as a ‘base address’ for the struct the address 0).

Now we can understand(at least partially) what the macro does. It declares a pointer to the member of the struct that ptr points to, and assigns ptr to it. Now __mptr points to the same address as ptr. Then it gets the offset of that mebmer within the struct, and subtracts it from the actual address of the member of the struct ‘instance’(ie __mptr). The (char *)__mptr cast is necessary, so that ‘pointer arithmetic’ will work as intended, ie subtract from __mptr exactly the (size_t) bytes that offsetof ‘returns’.

At this point, I really can’t understand why we couldn’t use the ptr pointer directly. We could ommit the first line, and the macro could be

#define container_of(ptr, type, member) (type *)( (char *)(ptr) - offsetof(type,member) )

ptr is used only once — we don’t need to worry about side effects.
Maybe it’s just good coding practice.

EDIT: Apparently, the first line is there for ‘type checking’. It ensures that type has a member called member(howerver this is done by offsetof macro too, I think), and if ptr isn’t a pointer to the correct type(the type of the member), the compiler will print a warning, which can be useful for debuging.

mmap for Linux drivers

June 26, 2009

Due to an assignment for the Operating Systems Lab, which I’m attending this semester, I started reading about character device drivers for the Linux Kernel. We were a given an incomplete driver, which handled the communication with the hardware, as well as some other issues, and we had to implement the ‘upper layer’ of the driver, which did the communication with the userspace, and handled some locking issues.

Linux Device Drivers(LDD) was very helpful to start with(trying to read kernel code for the first time can be a terrifying experience). However, when I got to the point, where I had to implement the mmap method for the driver, LDD was a bit dated. After some googling and with the help of the Linux Cross Reference, I found out the changes in the Linux Kernel API, and sucessfully implemented mmap(or so I hope :P).

Due to a number of reasons, the nopage method, as well as the populate and the nopfn methods, have been completely removed from the vm_opeartions struct, and the new fault method is used instead for handling page faults.

Besides the changes in the fault handlers, recent kernel releases added another method to map pages to userspace(remap_pfn_range is often used for that purpose). The method is called vm_insert_page, which allows the driver to map a single page to userspace, and can be very useful when you need to map just one page-aligned buffer, which was allocated inside the driver, to userspace.

So, replacing nopage with fault, and using vm_insert_page (simpler code and better suited to the needs of the driver I was writing) instead of remap_pfn_range, did the job.
;)

Btw, in LDD Ch15, it states that it’s not possible to map conventional RAM with remap_pfn_range(ie pages you get from get_free_page) because remap_pfn_range only maps pages with the PG_reserved flag set. However, drivers would manually set the PG_reserved flag to make remap_pfn_range work, although the proper way of remapping ordinary RAM was with the nopage method(described in LDD). Tweaking the PG_reserved flag was considered bad practice, and so Linux 2.6.15 practically removed the PG_reserved flag. A couple of changes were made to the VMA flags as well, including the new vm_insert_page method, and it was made possible to map ‘ordinary’ RAM with remap_pfn_range. Since my knowledge of the memory management in Linux and the VM subsystem is minimal, I can’t explain much more(maybe in another post after a few months ; ).

Follow

Get every new post delivered to your Inbox.

Join 276 other followers