‘weird’ kernel macros — container_of

July 1, 2009

Linux uses a rather complicated(for me) driver model, which is built upon the kobject abstratction. There’s a lot of documentation about the linux driver model, and describing it in detail is out of the scope of this post(well, actually I can’t do it : P ).

In short:
kobjects are structs that hold some information(a name, a reference count, a parent pointer etc), and are usually embedded into other structs, typically structs for devices/device drivers(ie struct cdev, for a character device), creating a hieararchy which is ‘exported’ to userspace through the sysfs(mounted on /sys).
The question is how the code that works with kobjects can reference the struct that contains the kobject. The kernel provides a macro, (not surprisingly) called container_of, which does exactly that.
The definition of the macro can be found in include/linux/kernel.h:

* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})

In order to understand what this macro does, we have to be somewhat familiar with the C Preprocessor, and some non-standard GCC extensions:
1)A parenthesis followed by a brace. This is called by the GCC statement expression. It lets us use a compound statement as an expression. Here, we want to use the container_of macro as an expression, but we want to declare a local variable(__mptr) inside the macro, so we need a compound statement.
2)The typeof GCC extension that lets us refer to the type of an expression, and can be used to declare variables.

The previous two extensions let us write safe macros(ie side effects of the operands are calculated only once) that work for any type(some kind of polymorphism), and can be used as expressions.

This macro also uses the offsetof, which computes the byte offset of a field within a structure. Linux uses the compiler-provided offsetof, if the compiler provides one, else it defines the offsetof macro as

#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

A few words about offsetof. offsetof is a valid ANSI C macro. Cfaq gives a possible implementation of offsetof(though non-portable) which is a bit different than the one that the kernel defines. I’m not sure about this, but as far as I can understand, the cfaq offsetof subtracts a NULL pointer from the ((type *)0)->member, to ensure that the offset is correct, even if the internal representation of the NULL pointers isn’t actually zero. I guess Linux has good reasons to assume that’s OK to ommit that subtraction.

And the last trick is the ((type *)0) cast. Actually, we pretend that there’s an instance of the struct at address 0. If we tried to reference it, we would be in big trouble, but that never happens. So we trick the compiler and we can legally get the type of the struct member, which is used to declare the __mptr as a pointer to that struct member. It’s also used by offsetof to get the byte offset of the mebmer within the struct(since it uses as a ‘base address’ for the struct the address 0).

Now we can understand(at least partially) what the macro does. It declares a pointer to the member of the struct that ptr points to, and assigns ptr to it. Now __mptr points to the same address as ptr. Then it gets the offset of that mebmer within the struct, and subtracts it from the actual address of the member of the struct ‘instance'(ie __mptr). The (char *)__mptr cast is necessary, so that ‘pointer arithmetic’ will work as intended, ie subtract from __mptr exactly the (size_t) bytes that offsetof ‘returns’.

At this point, I really can’t understand why we couldn’t use the ptr pointer directly. We could ommit the first line, and the macro could be

#define container_of(ptr, type, member) (type *)( (char *)(ptr) - offsetof(type,member) )

ptr is used only once — we don’t need to worry about side effects.
Maybe it’s just good coding practice.

EDIT: Apparently, the first line is there for ‘type checking’. It ensures that type has a member called member(howerver this is done by offsetof macro too, I think), and if ptr isn’t a pointer to the correct type(the type of the member), the compiler will print a warning, which can be useful for debuging.


11 Responses to “‘weird’ kernel macros — container_of”

  1. scaryreasoner Says:

    Check out the linked list implentation for more macro fun:

    BUILD_BUG_ON is a nice one too.

  2. Nakul Says:

    how come ((type *)0)->member does not generate a segmentation fault.Since 0 is being used in pointer context (due to the typecasting) and we are dereferencing the value pointed by 0 (Since is equivalent to doing (*0).member). We would be dereferencing a NULL pointer. right ?

    • psomas Says:

      actually, you’re not dereferencing a null pointer…
      you are dereferncing the pointer, at the address 0… it may or may not be a null pointer…
      however, generally, dereferncing any pointer which is located at page zero, will generate a segmenation fault, in userspace(there are ways to bypass that)…
      in kernel space(and that’s what i’m talking about), things are different, but I think that dereferencing a pointer located at page zero, would probably raise an error or sth(there are ways to “bypass” that too :P, and based on both “bypasses”, many kernel exploits work)…
      the point is that with the way gcc handles structs, you don’t dereference anything…
      it’s probably just a matter of additions/substractions to get the address of the member of the struct…
      if you wanted the actual value of that mebmer, you would get a seg fault, because you would dereference an “invalid pointer”….
      here’s example code to demonstrate that:

      struct foo {
      int i;
      int j;

      main() {
      struct foo *a = NULL;
      printf("%p %p\n", &(a->j), &((*a).j));

      no seg fault here…

      now, if you try to remove the &, you get a seg fault..
      if you try a gcc -S, you’ll see that gcc has already caluclated the 0x4, and it just copies it to a register for the printf…
      so, i assume that gcc gets the pointer(the address of the struct) and adds sizeof(int)…
      thus, you don’t get a seg fault…

      • V-ille Says:

        The “dereference” inside container_of is inside typeof, which is not evaluated at runtime, so there’s no chance of a segfault there.

  3. Gamer Girls Says:

    Thankfully some bloggers can write. Thanks for this blog post

  4. czar Says:

    This reference also provide more insights

  5. Siyuan Hua Says:

    really helpful!! GCC has a lot non-traditional c extensions that I’m not familiar with, but your post helps a lot!!

  6. Indeed a thorough explanation on container_of(). I really appreciate your post here.

  7. Devendra Singh Says:

    “Apparently, the first line is there for ‘type checking’. It ensures that type has a member called member(howerver this is done by offsetof macro too, I think), and if ptr isn’t a pointer to the correct type(the type of the member), the compiler will print a warning, which can be useful for debugging.”

    Regarding the above statement:
    In that case why we are type casting it again in the second statement as “(type *)( (char *)__mptr – ….. );” when it is already taken care in the first statement?

    • psomas Says:

      It’s been a long time since I wrote this. 🙂 In the second statement, we’re casting the result to the required (return) type, no type-checking, just casting to return a correctly typed pointer. In the first statement, we cast the member pointer (ptr) to a pointer (mptr) to the type of the member (plus the const qualifier), in order to type check the member pointer (ptr). Btw, there was an issue with this implementation, and it was rewritten. The current version probably is even more interesting. [1]

      [1] https://patchwork.kernel.org/patch/9742995/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: