January 07, 2004 Edition

By Jorge Castro (mailto:jorge@whiprush.org), Amit Gurdasani (mailto:amit@arslinux.com)



Welcome back to Linux.Ars, as we break in the New Year with a supersized issue. First we've got some goodies for the KDE folk. Following that, we have part two of our diskless compute farms piece o no more excuses, get crunching for Lamb Chop. Finally, we wrap it up as we always do: with /dev/random.


Getting ready for KDE 3.2

This month should be a good one for KDE users. First off, KDE-apps (http://kde-apps.org/) gets off the ground, bringing the same quality that you've come to expect from its sister site, KDE-look (http://www.kde-look.org/), which generally concentrates on themes for KDE.

With 3.2 around the corner, now is a good time to start playing with the betas. While scripts like konstruct (http://developer.kde.org/build/konstruct/) are good for experienced Linux users, many times new users become frustrated attempting to build projects like this, or maybe they don't want to spend the time building a beta. So what's an impatient KDE lover to do?

Enter SLAX (http://www.slax.org/), a LiveCD that brings KDE 3.2 Beta 2 and KOffice 1.3 Beta 2. All this in a mini-CD at less than 180 megabytes. We played around with the CD on a few machines, and it's a great way to play with the latest KDE betas without interrupting the host system. This disc is a must for any KDE fan. With nearly a year gone by since 3.1, this new release brings plenty to the table, so make sure you snag (http://www.slax.org/) a copy.


TTT: Tools, Tips and Tweaks diskless compute farms, or how to keep Team Lamb Chop on top

In this issue, we show you how to set up a network of inexpensive systems to crunch numbers for a distributed computing project. These days, hardware is cheap, with newer and better hardware being released ever quicker and at continually-dropping prices. The wide availability of inexpensive processors and memory as well as low-cost motherboards that build in everything from display device to LAN interfaces enables some (http://episteme.arstechnica.com/eve/ubb.x?a=tpc&s=50009562&f=122097561&m=4400994775) hobbyists (http://episteme.arstechnica.com/eve/ubb.x?a=tpc&s=50009562&f=122097561&m=4220982775) to create "farms" meant for a single purpose o to process data brought in by distributed computing project clients. For reasons of both cost and manageability, it's advantageous to do this without using a hard disk for each machine. Fortunately, this isn't hard to do.

We covered the network boot process in an earlier issue (http://www.arstechnica.com/etc/linux/2003/linux.ars-12172003-1.html). In summary, the computer's BIOS will pass control to the boot ROM on the network adapter, which will obtain a DHCP lease and then load a bootloader (in our case, pxelinux) via TFTP. The bootloader will then load the kernel and an initial RAM disk (initrd) into memory, and then boot the kernel. The kernel will then initialize devices, use the BOOTP protocol to get a lease (again, since it cannot use information from the boot ROM or the bootloader) and run a script in the initrd that will mount the root filesystem via NFS. The root filesystem's init program will take care of starting system services and the distributed computing client.

Client filesystem

For our purposes, we will construct the initrd and root filesystem from a small umsdos Linux distribution such as ZipSlack (ftp://ftp.slackware.com/pub/slackware/slackware-9.1/zipslack/). ZipSlack comes in the form of a Zip archive of a umsdos filesystem. umsdos is a certain naming and metadata storage scheme on top of FAT filesystems that will allow things like file permissions, symbolic links, hard links, case-sensitive long filenames, etc. to work on top of the simple 8.3 naming scheme of FAT. This involves mangling filenames to the 8.3 format, and storing the filesystem information in special files. The end result is that the files would need to be placed on a FAT filesystem, then the filesystem mounted as umsdos, and finally, the files copied off the filesystem. Fortunately, this is rather easy to do as ZipSlack is sized to fit on Iomega 100 MB Zip disks, so the filesystem size is bounded.

amitg@athena:~$ dd if=/dev/zero of=zipslack.img count=1 bs=100M
1+0 records in
1+0 records out
104857600 bytes transferred in 22.765074 seconds (4606073 bytes/sec)
amitg@athena:~$ /sbin/mkdosfs -F 16 zipslack.img
mkdosfs 2.8 (28 Feb 2001)
amitg@athena:~$ su -
root@athena:~# mkdir -p /mnt/zipslack /var/clients/default
root@athena:~# mount -t msdos -o loop zipslack.img /mnt/zipslack
root@athena:~# cd /mnt/zipslack
root@athena:/mnt/zipslack# unzip -qq ~amitg/zipslack.zip
root@athena:/mnt/zipslack# cd
root@athena:~# umount /mnt/zipslack
root@athena:~# mount -t umsdos -o loop zipslack.img /mnt/zipslack
root@athena:~# cd /mnt/zipslack
root@athena:/mnt/zipslack# find . -xdev -print0 | cpio -pa0Vdmu --sparse /var/clients/default
<lots of dots later>
root@athena:/mnt/zipslack# cd
root@athena:~# umount /mnt/zipslack
root@athena:~# rmdir /mnt/zipslack

The above snippet essentially creates a 100 MB disk file called zipslack.img, then creates a FAT16 filesystem on it. Then, the resulting filesystem is mounted at /var/zipslack, and the zipslack.zip archive downloaded off a Slackware mirror is unpacked onto it. The filesystem is then unmounted, and then mounted again as umsdos. (For fans of -o remount, we should point out that filesystem type cannot be changed on a remount.) Finally, all the files are copied off the umsdos filesystem into a directory called /var/clients/default, and then the filesystem is unmounted and deleted. (Thanks to Kyle "greenfly" Rankin for the technique to copy over all the files efficiently.)

Constructing the initrd

Some of you may be aware that the kernel has the ability to boot straight off an NFS root filesystem. Why aren't we doing this? The answer: manageability.

In order to boot off an NFS root filesystem directly, the kernel must know the NFS export to mount at startup time. That is, this information must be provided to it by the bootloader. The bootloader, in turn, must somehow know what NFS root filesystem to mount and, more importantly, the filesystems must preexist uniquely for each client machine, which in turn must be consistently able to mount them. This makes managing the systems a nightmare. For one thing, we would need a unique identifier for the machine so that the filesystem can be identified. That's easy: just use the MAC address of the network adapter, which is unlikely to match another MAC address on the same LAN. However, this also means that the bootloader must be configured to provide the name of the NFS export to the kernel for each machine, and this in turn means that there must be a separate configuration for the bootloader for each machine. This would mean that every single machine's MAC address would have to be catalogued, and the administrator would need to create a bootloader configuration and an appropriate NFS export for each machine before they can be connected.

There is a better way that avoids all this tedium. We can create an initial RAM disk that is the same for all computers, loaded by the bootloader. Since the initial RAM disk is the same, so is the bootloader configuration. The initial RAM disk can incorporate a script that can detect the MAC address of the network adapter, see if an NFS export already exists corresponding to it, and if not, a copy of a template installation can be made. Then, once the NFS export is known to exist, it can be mounted as the root filesystem and the initrd can go away. This allows for far fewer management headaches.

Our means of constructing the initrd is simple: we copy over files corresponding to various Slackware packages from the ZipSlack directory. This is sort of a hack, and results in a large initrd (about 20MB uncompressed), so it isn't exactly the most efficient means of doing this. However, other methods (that can probably be used to strip it down to about 2MB), such as using uClibc instead of a full GNU C library, and using a BusyBox binary instead of regular Slackware programs, are fairly involved and beyond the scope of this write-up.

We copy a few packages: aaa_base (the "skeleton" directory structure of the filesystem), etc (the basic configuration files), glibc-solibs (the GNU C runtime library), bin (basic Linux executables), coreutils (the GNU core utilities), util-linux (programs needed for booting, mounting, etc.), sed, grep and gawk (text-processing utilities needed for various scripts), bash (the Bourne-again shell, needed as a POSIX-compliant shell), elflibs (bash depends on this) and tcpip (for network interface and route management).

amitg@athena:~$ dd if=/dev/zero of=initrd count=1 bs=20M
1+0 records in
1+0 records out
20971520 bytes transferred in 1.172311 seconds (17889042 bytes/sec)
amitg@athena:~$ su -
root@athena:~# cd ~amitg
root@athena:/home/amitg# mke2fs initrd
mke2fs 1.33 (21-Apr-2003)
initrd is not a block special device.
Proceed anyway? (y,n) y
<lots of stuff>
root@athena:/home/amitg# tune2fs -c0 -i0 initrd
tune2fs 1.33 (21-Apr-2003)
Setting maximal mount count to -1
Setting interval between check 0 seconds
root@athena:/home/amitg# mkdir -p /mnt/loop
root@athena:/home/amitg# mount -o loop -t ext2 initrd /mnt/loop
root@athena:/home/amitg# cd /var/clients/default
root@athena:/var/clients/default# for pkg in aaa_base etc glibc-solibs bin coreutils \
> util-linux sed gawk grep bash elflibs tcpip
> do grep -A10000 '^FILE LIST:$' var/log/packages/${pkg}* \
>  | sed 's/\/incoming\//\//' | grep -v '^FILE LIST:$' | \
> sed 's/\.new$//' | grep -v '^install\/' | cpio -paVdmu --sparse /mnt/loop
> done
<lots of dots later>
root@athena:/var/clients/default# cp bin/bash /mnt/loop/bin
root@athena:/var/clients/default# mkdir mnt/initrd
root@athena:/var/clients/default# cd /mnt/loop
root@athena:/mnt/loop# ln bin/bash bin/sh
root@athena:/mnt/loop# ldconfig -r .
root@athena:/mnt/loop# mkdir mnt/nfsroot

What the heck was that?!

We created a 20 MB disk file called initrd on which an ext2 filesystem, which was mounted on /mnt/loop. Then the contents of the selected packages were copied over to it from /var/clients/default, where we placed our ZipSlack installation. That really long command effectively parsed out the package file listing for each selected package and copied its contents over. You might notice that a number of files are not found. These were either culled from Slackware to get ZipSlack to fit in 100 MB, or were named differently in the original package, then renamed by the package installation scripts to something else. (bash did this, so we needed to copy it over afterward.) Finally, ldconfig was used to create necessary symbolic links to libraries.

Now, we need to create a script that'll mount the NFS root filesystem, creating it if necessary. Here's a script that'll do just that. We place it in the initrd filesystem mounted at /mnt/loop, and call it linuxrc.

PATH=/usr/bin:/bin:/sbin:/usr/sbin export PATH
(note that the above needs to be on a single line)
$ECHO MAC address found to be ${MACADDRESS}.
# We need to mount the client export first to see if our intended root directory
# exists. If it doesn't, we need to create it.
if [ ! -d ${MOUNTPOINT}/${MACADDRESS} ]; then
        if [ ! -d ${MOUNTPOINT}/default ]; then
                # No template, exit to shell.
                $ECHO The template directory was not found. Dropping to an emergency shell.
                exec $SH
                exit 127
                # Create NFS root filesystem
                $MKDIR -p ${MOUNTPOINT}/${MACADDRESS}
                cd ${MOUNTPOINT}/default
                # First the hard links.
                $CP -ldpR bin boot lib mnt proc sbin usr ../${MACADDRESS}
                # Then the actual file copy.
                $CP -dpR opt etc dev home root tmp var ../${MACADDRESS}
                cd -
# Now that we know our intended root filesystem exists, let's mount it.
# Switch to new root using pivot_root. This part is inspired by the man page.
exec ${CHROOT} . ${SH} -c "${UMOUNT} ${OLDROOTMTPT}/dev; ${UMOUNT} ${OLDROOTMTPT}; \
    exec ${INIT}" < /dev/console > /dev/console 2>&1
# (Hopefully) never reached.
$ECHO Switching to new root failed. Trying to drop to an emergency shell.
exec $SH
exit 126

Now, an explanation. The first thing that the script does is to determine the MAC address of the first network interface (eth0). Next, it mounts /var/clients off the NFS server ( here), and sees if there's a directory whose name matches the MAC address under /var/clients. If not, it copies files from /var/clients/default into a new directory with that name. (Note that it creates hard links for most of the files, since we don't expect them to be written to o only directories containing files that are expected to change are copied.) Then /var/clients is unmounted, and the intended root filesystem o /var/clients/MA:CA:DD:RE:SS:xx o is mounted and the pivot_root tool used to switch root filesystems. Finally, the old root filesystem o the initrd will be unmounted (and freed).

You might have noticed that one of the mount options is "nolock." Why aren't we using NFS locking? Isn't that dangerous?

Locking requires quite a bit more complexity. For one thing, portmap and statd are required on the client-side. These must be started by the initrd, killed before switching the NFS root directory, then started up again. Additionally, they require the loopback interface, lo, to be set up to function correctly. And then, realize that we don't really need locking, since we don't expect two or more clients to write to the same export at the same time (each client gets its own exclusive area to write in anyway). It's left as an exercise to the paranoid reader to modify the script and the initrd to add in NFS locking. (Besides, ZipSlack doesn't seem to come with statd, so for the sake of simplicity, we left out locking. ;) )

root@athena:/mnt/loop# chmod 700 linuxrc
root@athena:/mnt/loop# cd
root@athena:~# umount /mnt/loop
root@athena:~# exit
amitg@athena:~$ gzip -9 initrd

The next thing to do is to create a script on the template ZipSlack installation that will start the distributed computing client, as well as to set up the distributed computing client itself. As an example, we use the Distributed Folding (http://www.distributedfolding.org/) client for Linux.

amitg@athena:~$ su -
root@athena:~# cd /var/clients/default
root@athena:/var/clients/default# mkdir opt
root@athena:/var/clients/default# cd opt
root@athena:/var/clients/default/opt# tar xzf ~amitg/distribfold-current-linux-i386-icc.tar.gz
root@athena:/var/clients/default/opt# cd distribfold
root@athena:/var/clients/default/opt/distribfold# ./foldtrajlite -f protein -n native -if
<configuration happens here. Get it to be as quiet as possible. Enable things like automatic
updates, etc.>
root@athena:/var/clients/default/opt/distribfold# cd ../../etc/rc.d

The next thing is to create an etc/rc.d/rc.local script that can start up the client at boot time. This usually involves cleaning up (e.g., any lock files left behind if the client or machine crashed the last time, running any scripts to start up the client). Here's an example for Distributed Folding:

# /etc/rc.d/rc.local:  Local system initialization script.
# Put any local setup commands in here:
cp /proc/mounts /etc/mtab
cd /opt/distribfold
./foldit &

(The cp command essentially gets /etc/mtab to reflect the actual mounted filesystems.)

The next thing on our plate is building an appropriate kernel to do a network boot.


Kernel setup

The kernel used in this diskless setting should be able to initialize the Ethernet adapter, use BOOTP to set up IP information, and boot off the initrd by itself. Often, distribution-supplied kernels are capable of this; however, this is not always the case. (Some such kernels are unable to even boot off the hard disk by themselves, since all their drivers are built as loadable kernel modules that are stuffed into an initial RAM disk that is loaded into RAM by the boot loader along with the kernel.) We build a stripped-down kernel that's capable of providing enough application support to be able to run the distributed computing client, and also able to boot and run off the network. This means that, at minimum, it must support (built-in, rather than as a module) the PCI bus (or whatever the network device sits on), the network device itself, NFS client support (preferably with NFSv3), IP kernel autoconfiguration support, NFS root filesystem support and, of course, RAM disk support along with support for an initial RAM disk. Since the initrd has an ext2 filesystem, it must support this (called "Second extended filesystem support" in the kernel configuration system). Also, several base scripts and the like use /proc and /dev/pts, so enable support for the proc and devpts filesystems, as well as UNIX98 pty support. Since we didn't copy any files for /dev into the initrd to save space, we will also enable devfs support, as well as the ability to mount devfs at boot time.

We won't go into building a kernel in detail here, especially since there are plenty of excellent guides to do this on the Web. This one (http://www.justlinux.com/nhf/Compiling_Kernels/Kernel_Configuration_and_Compilation.html) at JustLinux (http://www.justlinux.com/) (formerly LinuxNewbie.org) ought to help. Don't install it on your server; stop after the make bzImage step.

After the kernel is built, we set up the network boot facilities, starting with a DHCP server.

DHCPd setup

The most commonly-available and -used DHCP server on Linux is dhcpd (http://www.isc.org/products/DHCP/) from the Internet Software Consortium (http://www.isc.org/). It can handle both DHCP and BOOTP requests.

Here is a basic configuration for DHCPd, in /etc/dhcpd.conf:

option domain-name "localdomain";
ddns-update-style none;
default-lease-time 600;
max-lease-time 7200;
subnet netmask {
        deny duplicates;
        one-lease-per-client true;
        option domain-name-servers;
        option broadcast-address;
        option routers;
        range dynamic-bootp;
        filename "pxelinux.0";
        allow unknown-clients;

This essentially allows requesting DHCP and BOOTP clients to get IP addresses between and, forcing one IP address per MAC address. Network-booting clients that perform DHCP or BOOTP requests will be told to ask the TFTP server on for a file called pxelinux.0 and to execute its contents.

If you use Etherboot, tell it to load a kernel NBI image instead, called, say, vmlinuz.nbi.

filename "vmlinuz.nbi";

We start up DHCPd.

root@athena:~# /etc/init.d/dhcp start

TFTPd setup

Each one of the PXE BIOS, pxelinux and Etherboot uses the TFTP protocol to obtain the next stage files during the boot process. For this, we use H. Peter Anvin's version of tftpd. This is packaged with most Linux distributions, and you can obtain source code at kernel.org (ftp://ftp.kernel.org/pub/software/network/tftp/).

tftpd is typically started by inetd or xinetd; setup is often done by the package installer itself. If not, edit /etc/inetd.conf or /etc/xinetd.d/tftp appropriately, as may be the case, and then get the daemon to reread the configuration file.

root@athena:~# /etc/init.d/inetd reload

You can test whether tftpd is working as expected by using the tftp command-line client.

amitg@athena:~$ fgrep tftp /etc/inetd.conf 
tftp           dgram   udp     wait    root  /usr/sbin/in.tftpd in.tftpd -s /usr/local/tftpboot
amitg@athena:~$ netstat -a | fgrep tftp
udp        0      0 *:tftp                  *:*                                 
[ amitg @ athena | ~ ] ls -ld /usr/local/tftpboot
drwxr-xr-x    3 nobody   nogroup      4096 Dec 14 15:29 /usr/local/tftpboot/
amitg@athena:~$ ls -l /usr/local/tftpboot/bzImage
-r--r--r--    1 root     root       638620 Dec 14 15:29 /usr/local/tftpboot/bzImage
amitg@athena:~$ tftp
tftp> get bzImage
Received 643442 bytes in 2.2 seconds
tftp> quit

Here, we've placed the TFTP directory in /usr/local/tftpboot. Note that tftpd drops root privileges and runs as the user nobody, so any files we place in that directory must be readable by the user nobody, and the the directory itself must be owned by that user.

We place the bzImage of the kernel we built (we retrieve it from arch/i386/boot) in the TFTP directory. If using PXE, we place pxelinux.0 that is installed with SYSLINUX there. (On Debian, the syslinux package puts it in /usr/lib/syslinux.) This should be sufficient for a diskless machine to retrieve the pxelinux bootloader and attempt to boot off it.

In the case of Etherboot, we build an NBI from bzImage using the mknbi tool that comes with Etherboot, supplying the kernel boot parameters needed. Etherboot will perform a BOOTP request and load the NBI using TFTP, providing the kernel the parameters specified in the NBI.

root@athena:/usr/local/tftpboot# ls -l bzImage
-r--r--r--    1 root     root       855693 2003-12-17 06:10 bzImage
root@athena:/usr/local/tftpboot# mknbi-linux --rootdir=/dev/ram0 
--rootmode=ro --ip=bootp bzImage > vmlinuz.nbi

Otherwise, if using PXE, we configure pxelinux to load the kernel.

pxelinux setup

The first thing pxelinux does is to look for its configuration file. It looks for files in the directory pxelinux.cfg under the TFTP directory and then takes the IP address assigned to the system and encodes it in hexadecimal. For instance, the IP address will become 0A000000, and will become C0A800FD. It looks for a file with that name first. If it cannot find that file, it drops the last character and looks again (C0A800F). It continues until it's dropped all of them, in which case, it looks for a file named default. So the easiest thing is to create a file called /usr/local/tftpboot/pxelinux.cfg/default that contains configuration information for pxelinux. (Since we're using the network block, we can name it C0A800, for the 192.168.0 portion, so it boots faster.)

This is what it can look like:

default linux
label linux
      kernel bzImage
      append vga=3847 root=/dev/ram0 ip=bootp init=/linuxrc initrd=initrd.gz ramdisk_size=20480

This tells pxelinux to load the file bzImage as the kernel image, and to provide parameters that tell the kernel to mount its root filesystem off an initrd, and to run the script /linuxrc. By default, a RAM disk is 4 MB in size. Since our initrd is 20 MB in size uncompressed, we tell the kernel to size its RAM disk appropriately. The initrd parameter is also recognized by the bootloader, which will load initrd.gz via TFTP for the kernel.

Why BOOTP? Why not DHCP?

DHCP leases are of finite length; after a while, they expire and the IP address may be reassigned. The trouble is that usually, the DHCP client will invoke a number of scripts that do the work of changing the IP address. This will involve bringing the interface down and up again. Once the interface is down, the scripts become inaccessible, and the interface can never be brought up again. BOOTP, however, provides leases for an indefinite period, so they never expire. This suits us, since we don't want the interface to be brought down for a fresh lease. So we tell the kernel to obtain a BOOTP lease that DHCPd will not expire. (This is also why we got the installer not to set up the network interface automatically; it'll already have been set up by the time the startup process gets that far.)

Lastly, we need to set up the NFS exports on the boot server.

NFSd setup

The kernel NFS server has far outrun the user-space NFS server in terms of feature set and performance, so it's better to use it over the user-space server. Usually, distributions' package managers will set up a usable NFS server configuration; if not, you can probably follow the various NFS HOWTO documents on The Linux Documentation Project (http://www.tldp.org/) to set up the NFS kernel server, portmap and mountd. Once this is done, we can define the export in /etc/exports:

/var/clients   ,async,no_root_squash)

Start the NFS kernel server and check the system logs to see if it has been set up correctly.

At this point, all the pieces to run the diskless machines have been put together. Chances are something broke or was misconfigured along the way; however, most such problems are not hard to track down and fix, and once it's working, it'll work smoothly. Clients can auto-update themselves, as well as draw down data and send up finished work units. This setup does not include any means to manage updates so that the same file isn't redownloaded by every node, but something like a transparent caching proxy server like Squid set up on their main Internet gateway should solve most such issues. This also does not take care of situations where Internet connectivity is sporadic ("nonetting"). Nor does it allow for system management using SNMP or similar system. We also haven't covered sending logs to a syslog server. These are left as exercises for the reader. ;)

Note that this technique is fairly generic; you can build diskless clusters of machines like this for any purpose. You could build a load-balanced, high-availability cluster with the help of openMosix (http://openmosix.sourceforge.net/) and/or the Linux Virtual Server project (http://www.linuxvirtualserver.org/). You could also build diskless display terminals cheaply. Indeed, there is a project called the Linux Terminal Server Project (http://www.ltsp.org/) that is aimed to make this sort of thing easy to do (which it does), and there are projects based on it, such as K12LTSP (http://www.k12ltsp.org/). (Unfortunately, LTSP is not conducive to running applications on the diskless stations themselves, so we could not demonstrate it instead.)