January 19, 2010

Here’s the deal.
We have an OpenVPN server, part of a network, for instance network, and server’s IP
We connect to the OpenVPN server, using [b]UDP[/b], and a virtual [b]tap[/b] interface, let’s say, tap0.
After we’ve connected successfully with the vpn server, we run a dhcp client on the tap0 interface, and get an IP inside the network, let’s say
Along with the IP assignment, a new route will be added in the routing table, a route to the network, with no gateway(=link), through the tap0 interface.
Howerver, now we can’t contact the OpenVPN server. After the new route is added all of our packets to the VPN server, including the vpn packets, will be routed through the tap0 interface, and therefore VPN will stop working.
So, we add to the routing a table a route to the vpn server(, via our local gateway(for instance, through our physical network interface(for instance eth0).
Now, we can communicate with every other host inside the network over a VPN encrypted channel. But all of our connections to the VPN server will go through the unencrypted channel( route, bypassing the VPN/tap0 interafce).
But that’s not what we actually want.
Actually, we want to communicate with the VPN server over the VPN ‘tunnel’(and through tap0) for all the connections we make, except for the VPN connection.
That’s possible if we use iptables and iproute2.
We’ll mark the packets of the VPN connection using iptables(ie the packets using UDP, with destination address the VPN server, and destination port the port to which the server listens — port 1194 most likely).
iptables -t mangle -A OUTPUT -p udp -d --dport 1194 -j MARK --set-mark 1
Now, we’ll create a rule with iproute2, which will route the marked packets using a different routing table.
First we create the new table.
echo 200 vpn.out >> /etc/iproute2/rt_tables
We add the rule.
ip rule add fwmark 1 lookup vpn.out
And we add the route for the vpn server to the vpn.out table.
ip route add via dev eth0 table vpn.out
One last thing.
With this configuration, there’s a problem in the selection of the source address for the vpn packets to the vpn server. Because the marking and the change of the route is done later, VPN will see the “, no gateway, dev tap0″ route in the main routing table, and will select the tap0 IP as the source address, which is obviously wrong since we want to get routed through the eth0 interface(with IP for instance). This is fixed if we add the local option in our vpn client configutaion file, so that OpenVPN binds to that address and selects it correctly as the source address.
That’s it!
We send only vpn packets through the route, and everything else, including all other connections to the VPN server, are sent over vpn.
This ‘trick’ is very useful when you want to be able to ssh to the VPN server, but you want to prohibit ssh from IPs outside the local network.

~6k lines of C code, and you have a x86 hypervisor in the Linux kernel!

lguest is a very minimal x86 hypervisor for Linux, very easy to set up, and due to its very small code base, it can be fun to hack/tweak.

Besides its small code base, it’s very well documented, thus ‘studying’ the lguest code is probably the best way to start learning a few things about virtualization and hypervisors.

Actually, the Makefile in drivers/lguest/ can be used to ‘extract’ the comments from the lguest code, in a very organized way, which makes it even easier to understand how lguest works(just cd to the directory and type make Beer, or make Puppy, if you want some ascii puppies : )

lguest requires some kernel configuration. Except for lguest itself, you must include some virtio modules, particularly virtio_blk and virtio_net, if you need networking inside the guest.

Afterwards, you must compile the Documentation/lguest/lguest.c, which you’ll run in userspace every time you want to ‘launch’ a new guest.

In order to run a new guest, you just run the lguest launher you just compiled, with some parameters, like the amount of memory the guest should have, the kernel image(usually the same image, that the host runs) and/or the initrd needed, a rootfs(you can either download a minimal rootfs for testing, or create one using qemu), and the virtual block device(/dev/vda), where the rootfs will be mounted inside the guest.
lguest 64m vmlinux --initrd=initrd.img --block=rootfile root=/dev/vda
That’s it! You have a guest running.

Of course lguest lacks features, included in other hypervisors/virtualization solutions, but that’s the point. A small ‘experimental’ hypervisor, easy to understand and maybe tweak. ;)

rsnapshot tips&tricks

July 19, 2009

rsnapshot is a great application for taking backups. It uses rsync and hard links, and makes backup management very easy. It comes with a nice perl script, rsnapreport, which reads the output of the rsync commands used by rsnapshot, and prints a useful report, with the stats of each rsync command.

For detailed info about rsnapshot, you can visit the website of the project.

A typical configuration would set up a cronjob, in which the output of the rsnapshot sync command(or rsnapshot daily, if the sync_first option is not enabled), is piped to rsnapreport, and the output of rsnapreport is piped to a CLI SMTP client, to send us a mail with the stats of the sync operation.(sendEmail is a very nice SMTP client ;) ).

However, if you’re taking backups from multiple machines, the sync operation can last longer than expected. So, if the datablock timeout in our SMTP server isn’t large enough, we will never get an email.

This is solved if we use a wrapper script for the rsnapshot sync operation. We use that script for the cronjob, and inside the script we have something like this:

rsnapshot sync > /tmp/rsync_stats 2>&1
cat /tmp/rsync_stats | | sendEmail -f backup@foo -t user@bar -u rsnapshot report
rm -f /tmp/rsync_stats

And problem solved! ;)

Linux uses a rather complicated(for me) driver model, which is built upon the kobject abstratction. There’s a lot of documentation about the linux driver model, and describing it in detail is out of the scope of this post(well, actually I can’t do it : P ).

In short:
kobjects are structs that hold some information(a name, a reference count, a parent pointer etc), and are usually embedded into other structs, typically structs for devices/device drivers(ie struct cdev, for a character device), creating a hieararchy which is ‘exported’ to userspace through the sysfs(mounted on /sys).
The question is how the code that works with kobjects can reference the struct that contains the kobject. The kernel provides a macro, (not surprisingly) called container_of, which does exactly that.
The definition of the macro can be found in include/linux/kernel.h:

* container_of - cast a member of a structure out to the containing structure
* @ptr: the pointer to the member.
* @type: the type of the container struct this is embedded in.
* @member: the name of the member within the struct.
#define container_of(ptr, type, member) ({ \
const typeof( ((type *)0)->member ) *__mptr = (ptr); \
(type *)( (char *)__mptr - offsetof(type,member) );})

In order to understand what this macro does, we have to be somewhat familiar with the C Preprocessor, and some non-standard GCC extensions:
1)A parenthesis followed by a brace. This is called by the GCC statement expression. It lets us use a compound statement as an expression. Here, we want to use the container_of macro as an expression, but we want to declare a local variable(__mptr) inside the macro, so we need a compound statement.
2)The typeof GCC extension that lets us refer to the type of an expression, and can be used to declare variables.

The previous two extensions let us write safe macros(ie side effects of the operands are calculated only once) that work for any type(some kind of polymorphism), and can be used as expressions.

This macro also uses the offsetof, which computes the byte offset of a field within a structure. Linux uses the compiler-provided offsetof, if the compiler provides one, else it defines the offsetof macro as

#define offsetof(TYPE, MEMBER) ((size_t) &((TYPE *)0)->MEMBER)

A few words about offsetof. offsetof is a valid ANSI C macro. Cfaq gives a possible implementation of offsetof(though non-portable) which is a bit different than the one that the kernel defines. I’m not sure about this, but as far as I can understand, the cfaq offsetof subtracts a NULL pointer from the ((type *)0)->member, to ensure that the offset is correct, even if the internal representation of the NULL pointers isn’t actually zero. I guess Linux has good reasons to assume that’s OK to ommit that subtraction.

And the last trick is the ((type *)0) cast. Actually, we pretend that there’s an instance of the struct at address 0. If we tried to reference it, we would be in big trouble, but that never happens. So we trick the compiler and we can legally get the type of the struct member, which is used to declare the __mptr as a pointer to that struct member. It’s also used by offsetof to get the byte offset of the mebmer within the struct(since it uses as a ‘base address’ for the struct the address 0).

Now we can understand(at least partially) what the macro does. It declares a pointer to the member of the struct that ptr points to, and assigns ptr to it. Now __mptr points to the same address as ptr. Then it gets the offset of that mebmer within the struct, and subtracts it from the actual address of the member of the struct ‘instance’(ie __mptr). The (char *)__mptr cast is necessary, so that ‘pointer arithmetic’ will work as intended, ie subtract from __mptr exactly the (size_t) bytes that offsetof ‘returns’.

At this point, I really can’t understand why we couldn’t use the ptr pointer directly. We could ommit the first line, and the macro could be

#define container_of(ptr, type, member) (type *)( (char *)(ptr) - offsetof(type,member) )

ptr is used only once — we don’t need to worry about side effects.
Maybe it’s just good coding practice.

EDIT: Apparently, the first line is there for ‘type checking’. It ensures that type has a member called member(howerver this is done by offsetof macro too, I think), and if ptr isn’t a pointer to the correct type(the type of the member), the compiler will print a warning, which can be useful for debuging.

mmap for Linux drivers

June 26, 2009

Due to an assignment for the Operating Systems Lab, which I’m attending this semester, I started reading about character device drivers for the Linux Kernel. We were a given an incomplete driver, which handled the communication with the hardware, as well as some other issues, and we had to implement the ‘upper layer’ of the driver, which did the communication with the userspace, and handled some locking issues.

Linux Device Drivers(LDD) was very helpful to start with(trying to read kernel code for the first time can be a terrifying experience). However, when I got to the point, where I had to implement the mmap method for the driver, LDD was a bit dated. After some googling and with the help of the Linux Cross Reference, I found out the changes in the Linux Kernel API, and sucessfully implemented mmap(or so I hope :P).

Due to a number of reasons, the nopage method, as well as the populate and the nopfn methods, have been completely removed from the vm_opeartions struct, and the new fault method is used instead for handling page faults.

Besides the changes in the fault handlers, recent kernel releases added another method to map pages to userspace(remap_pfn_range is often used for that purpose). The method is called vm_insert_page, which allows the driver to map a single page to userspace, and can be very useful when you need to map just one page-aligned buffer, which was allocated inside the driver, to userspace.

So, replacing nopage with fault, and using vm_insert_page (simpler code and better suited to the needs of the driver I was writing) instead of remap_pfn_range, did the job.

Btw, in LDD Ch15, it states that it’s not possible to map conventional RAM with remap_pfn_range(ie pages you get from get_free_page) because remap_pfn_range only maps pages with the PG_reserved flag set. However, drivers would manually set the PG_reserved flag to make remap_pfn_range work, although the proper way of remapping ordinary RAM was with the nopage method(described in LDD). Tweaking the PG_reserved flag was considered bad practice, and so Linux 2.6.15 practically removed the PG_reserved flag. A couple of changes were made to the VMA flags as well, including the new vm_insert_page method, and it was made possible to map ‘ordinary’ RAM with remap_pfn_range. Since my knowledge of the memory management in Linux and the VM subsystem is minimal, I can’t explain much more(maybe in another post after a few months ; ).


Get every new post delivered to your Inbox.

Join 276 other followers