January 23, 2004 Edition

By Jorge Castro (mailto:jorge@whiprush.org)



Just a few weeks ago the Linux.Ars (http://arstechnica.com/2004/linux.ars-20040331-1.html) crew started delving (http://arstechnica.com/etc/linux/2003/linux.ars-12242003.html) into the world of the new Linux Kernel, version 2.6. Since that time they have received a number of questions about other parts of the kernel, particularly the work done with preemption. Rather than attempt to answer these questions themselves, they decided to ask one of the most prominent Linux kernel hackers of today to answer a few questions about the kernel, and boy, did he ever answer them.

Recently hired by Ximian (http://www.ximian.com/) (now a subsidiary of Novell (http://www.novell.com/)) in order to further improve the Linux kernel, Mr. Love has another, more interesting task ahead of himointegration of all this low level work into the Linux desktop, specifically the GNOME Desktop and Developer Platform (http://www.gnome.org/). The work is already coming to fruition as developer releases of "Project Utopia" (as it has been dubbed) have already been released.

So what does exactly does Project Utopia bring to the Linux desktop? As an example of its benefits, all sorts of devices like cameras, mp3 players, and memory sticks will not only work out of the box when plugged in, but will be fully integrated into the desktop to provide the user with a transparent experience. No manual mounting, no driver disc from a third party, and no arcane knowledge of Linux is required. So sit back and let's see how Robert Love plans to make the Linux Desktop "Just Work".


Why "Linux on the Desktop" Is no longer a joke

Those of you who have tried the new 2.6 Linux kernels will undoubtedly have noticed how much more responsive the system feels under interactive use than earlier kernels. Others who have tried the kernel preemption patches (ftp://ftp.kernel.org/pub/linux/kernel/people/rml/preempt-kernel/v2.4/) or Con Kolivas' patches for interactive use (http://www.plumlocosoft.com/kernel/) will appreciate the difference as well. A large part of the credit for this work goes to Robert M. Love.

The Linux Desktop continues to evolve at a rapid pace. Now that kernel 2.6 has been with released with its many improvements in latency, other integration work has begun. Some of the most interesting work toward these goals (improved latency and integration of the kernel and the rest of the desktop) is Project Utopia (http://primates.ximian.com/~rml/project_utopia/), a project to improve the way the Linux desktop deals with device management and event notification from the kernel. Major components of it are HAL (http://hal.freedesktop.org/) (an abstraction layer for hardware that provides a unified model of the devices in the system to interested applications, along with notification of any hardware changes), D-BUS (http://dbus.freedesktop.org/) (a means for applications to communicate with one another; it's used by HAL and e.g. desktop environments to talk to each other about things like device discovery and changes), udev (http://www.kernel.org/pub/linux/utils/kernel/hotplug/udev-FAQ) by Greg Kroah-Hartman (http://www.kroah.com/) (a means to maintain the special files in /dev based on devices and drivers present in the system) and the GNOME volume manager (http://primates.ximian.com/~rml/blog/archives/000315.html) (automatically mounts hot-plugged storage volumes).

We asked Mr. Love some questions, and he answered them comprehensively. If you find the content going over your head, try hitting the footnotes (linked in superscript.)

Ars Technica: [1] (http://arstechnica.com/#rml1)1 People have been voicing some concerns on the linux-kernel mailing list (http://vger.kernel.org/vger-lists.html#linux-kernel) about how the new thread scheduler tends to make the effects of setting static priorities (e.g. using nice or renice) less predictable (so, for instance, a thread with a nice level of -10 may not get all that much more attention than a thread with a nice level of 0 if they demand CPU time in certain patterns, even if these usage patterns change frequently). Is this true, and if so, are there any workarounds that would make things more predictable? Is Nick Piggin's work (http://www.kerneltrap.org/~npiggin/) on this going to figure in future kernels, and if so, how?

Robert Love: The 2.6 process scheduler intentionally dynamically modifies the priority of processes to better optimize the system for I/O and interactive use. This is done via an "interactivity estimator" that gives a small priority bonus to I/O-bound processes and a small priority punishment to CPU-bound processes. Processes can receive as much as ±5 nice levels in either direction from their given static priority. Processes at some theoretical medium of I/O-vs-CPU usage receive zero points and thus remain at their given static priority.

The intention behind this is twofold. First, optimizing for I/O is usually a good thing to do. I/O-bound processes, by definition, spend much of their time sleeping and waiting on I/O (whether it be disk I/O, keyboard activity, sound buffers, etc.). Giving preference to an I/O-bound process allows it to quickly run, dispatch more I/O, and continue to wait. This enhances the overall performance of the system.

Second, favoring I/O-bound processes implies favoring interactive processes (such as your text editor, mailer, web browser, and so on) since interactive processes are I/O-bound (usually blocked on keyboard or mouse input). Favoring interactive processes improves the smoothness and "feel" of the desktop, improving the user experience.

Other operating systems accomplish these goals in other ways. For example, last I checked, the default timeslice in Windows was ridiculously small — like 10ms. This favors I/O-bound processes. Both Windows and Solaris also give a priority bonus to the process that has window focus in the GUI. The kernel developers, myself included, feel these approaches have shortcomings and that our approach is more robust.

I have not heard complaints that the interactivity estimator is "unpredictable" in the sense that it does all sorts of wild things. Since it can only reward/punish a task ±5 nice levels, the effect should never be too dramatic a departure from what the user intended. The interactivity estimator can incorrectly estimate a task's interactivity, however, and that could result in diminished system performance. A lot of work went into tuning the estimator late in the 2.6 pre-kernel and these issues are hopefully resolved.

As far as Nick Piggin's work goes, I watch it closely. He is a great hacker and definitely has some good ideas in his policy changes. If it tends to work out for the better, then I certainly think his work may end up in 2.6 proper.

Ars: What's going on with explicit hyperthreading [2] (http://arstechnica.com/#rml2)2 support for Pentium 4? As we understand it, the 2.6 scheduler treats logical processor pairs as independent entities with independent caches and independent functional units. There's a batch scheduler (http://kerneltrap.org/node/view/1877) in the works that promises to schedule things with an awareness that resources are shared, as well as scheduling similar-priority threads together. What's planned, ultimately, for this work?

Love: The batch scheduler is altogether unrelated (it implements the equivalent of a SCHED_IDLE class with batch-scheduling-like behavior).

Optimizing for HT — or, actually, SMT in general — is a different problem, although not a huge one. The issue is that the scheduler treats each logical processor has a separate processor and thus gives each logical processor its own runqueue. This is what one would expect, actually.

In a multiprocessor system, however, with multiple physical processors each with SMT, load balancing among the processors is not perfect with this layout. For example, consider the load-balancing situation in a dual P4 (which has four virtual processors, total) where three virtual processors are free and the other one has two processes. The goal of the load balancer is to "balance" the load, evening out the distribution of processes. Ideally, we would want one of the processes moved to a virtual processor on a different physical processor. Moving it to the free virtual processor on the same physical package provides only a small performance increase, since the HT units share so many chip resources.

The easiest way to solve this is to just stick some logic in the load balancer to understand SMT and try to load balance across different physical processors more readily than other local virtual processors. But this is a hack.

The better solution is to introduce the concept of shared runqueues, where the SMT units in a given physical package can all share a runqueue. This means that the load balancer automatically only balances between physical processors and that we can get a better understanding of the cost of balancing, since cache is shared among the local virtual processors.

I think that this work, too, will eventually find its way into the kernel.


1 The Linux 2.5/2.6 series kernels added in huge changes to thread scheduling. We went over the 2.6 scheduler in December 2003 (http://www.arstechnica.com/etc/linux/2003/linux.ars-12242003.html). There were a number of enhancements in the scheduler that were meant to dole out processor time to certain tasks that were deemed to be interactive (I/O bound) over those that were deemed to be CPU hogs (processor-bound or memory bandwidth–bound). This makes programs like media players and UI components whose behavior is largely I/O-bound — waiting for user input — receive CPU time much sooner than, say, a Folding@Home (http://fah.stanford.edu/) task, which spends most of its time doing calculations. This works really well, but a few people (e.g. this individual (http://lkml.org/lkml/boring/2004/1/4/66), among a number of others) found that the prioritizing behavior wasn't to their tastes. This prompted Nick Piggin, author of the anticipatory I/O scheduler that is now default in Linux 2.6, to make some changes to scheduler policy (http://kerneltrap.org/node/view/754). These changes seem to be appreciated by the critics.

2 While enumerating and enabling the logical processors on a Pentium 4 Hyperthreading processor is supported in Linux 2.6, the current scheduler in the stock kernel doesn't do anything special to distinguish the shared resources on logical processors from resources on different physical processors. The result is that there is a greater demand on things like caches, execution units and the like than would be absolutely necessary. A number of people are working on a Hyperthreading-aware scheduler (notably Ingo Molnar and Nick Piggin). Con Kolivas produced a patch that added Hyperthreading support to the aforementioned batch scheduler to deal with the fact that the Pentium 4 and Xeon cannot honor priorities in Hyperthreading.


CFQ and Project Utopia

Ars: How will the new CFQ (http://lwn.net/Articles/22429/)3 I/O scheduler work? In what cases does it improve upon the anticipatory and deadline I/O schedulers currently in 2.6? Is it intended to go into a future 2.6 kernel?

Love: The CFQ (complete fair queuing) I/O scheduler is something that I am very interested in. It is going to be part of the desktop kernel package I am putting together at Ximian. I think it is very well suited to desktop systems.

The idea behind it is to round robin I/O requests from each process, evenly distributing the disk's bandwidth among processes on the system, thus being "fair" on a per-process scale. This ensures that no one process can hog the disk's bandwidth. Thus disk latency is greatly improved at the potential cost to overall throughput. The CFQ I/O scheduler is best suited when disk response is the primary concern, such as with desktop and multimedia workloads.

I think the CFQ I/O scheduler will definitely make it into 2.6 very soon. There is no reason not to as I/O schedulers are now pluggable components in the 2.6 kernel.

I/O schedulers are a complicated subject. I wrote a primer to I/O schedulers for this month's Linux Journal (http://www.linuxjournal.com/modules.php?op=modload&name=NS-lj-issues/issue118&file=index). My book, Linux Kernel Development (http://www.amazon.com/exec/obidos/tg/detail/-/0672325128/qid=1074703009/sr=1-1/ref=sr_1_1/002-5805941-7672855?v=glance&s=books), also discusses this topic.

Ars [3] (http://arstechnica.com/#rml4)4 : A lot of work is going into Project Utopia to provide a user-space framework for dynamic device management (things like device detection, automatic driver loading, even things like filesystem mounts and notification via D-BUS or similar to subscribed apps so they can do something about it), and you're involved in this deeply. How much is this work tied into the GNOME desktop, or for matter, any desktop environment? For instance, if we wanted to have a network daemon running on a headless machine (no desktop environment installed) deal with something like additional storage attached via FireWire, could the daemon use this framework to deal with detection, driver loading, mounting, etc. easily, even if there was no trace of a Freedesktop.org (http://www.freedesktop.org/)-compliant desktop environment on the computer? Would distributions be able to pick up these pieces and integrate them in their base system and their initscripts in place of (or along with) things like kudzu (http://rhlinux.redhat.com/kudzu/), mdetect (http://packages.debian.org/unstable/utils/mdetect) and hotplug (http://linux-hotplug.sourceforge.net/)?

Love: Project Utopia's goal is to fully integrate the Linux system, from the kernel on up the stack, through the GNOME desktop, its applications, and finally to the user. Therefore, Project Utopia is very GNOME-specific.

But Project Utopia is composed of many small components, and each component is intentionally being developed separately and abstractly. Thus, a GNOME desktop (or any desktop) is not required for much of the functionality and another desktop environment could (and should!) provide the missing pieces.

The system is architected in such a way that the only components actually at the desktop layer are policy mechanisms, such as gnome-volume-manager, and glue layers/libraries, such as any forthcoming notification system.

Components such as udev and hotplug are obviously entirely agnostic to the rest of the system, as they are (or will be) required pieces of nearly any Linux system. Other components, such as D-BUS and HAL, can likewise fit into any system. I very much hope that both of those projects find wide adoption.

In response to your example, I think that a server with no desktop environment would still benefit from this work. In fact, it would just use Project Utopia as far up the stack as needed, definitely making use of udev, D-BUS, and HAL.

Ars: Regarding system status changes — you've indicated that this is all done in userspace, without polling. This suggests that a filesystem change notification framework such as dnotify [4] (http://arstechnica.com/#rml5)5 is being used as the basis. People are considering replacing dnotify, though, since it's a bit clunky (not to mention inefficient) to monitor entire directories when the app is only interested in one or a few files. What is intended as a replacement, if anything, and will the projects you're working on be adapted to use it?

Love: Indeed, dnotify is one of the mechanisms used to avoid polling. Others include good ol' blocking on read and the forthcoming kernel events layer that I am working on.

I also agree that — let's be honest here — dnotify sucks. Calling it clunky is nice. It is cumbersome and awkward to use, although at the end of the day it does get the job done.

I think a better person to ask about a replacement for dnotify is someone who uses the API extensively. I know firsthand that the Nautilus maintainers could readily describe a more ideal interface. Unfortunately there is no replacement under development, although I am sure people would be happy to use a sane replacement.


3 Jens Axboe's "Complete Fair Queueing" I/O scheduler is a recent development, and one that is much-anticipated for desktop systems. An I/O scheduler is essentially is a policy on the pattern in which the kernel should order its requests from disk devices and the like, in order to maximize aspects of I/O performance. The primary I/O schedulers in the 2.6 Linux kernel are the anticipatory scheduler and the deadline scheduler.

4 Project Utopia is slated to be a big part of future GNOME desktops, providing the ability for the GUI to do things like automatically mounting filesystems off iPods, USB keychain drives and the like, to automatically start a media player to play a DVD inserted in the drive, and so on. We were wondering if the framework could be used for things other than the desktop.

5 dnotify is a mechanism to let applications monitor directories on the filesystem for changes, including file insertion, deletion, rename, update, etc. Readers who are more interested can find information in the file Documentation/dnotify.txt in their copy of the Linux kernel source tree.



Ars: The 2.6 kernel was made fully preemptible [5] (http://arstechnica.com/#rml6)6 thanks largely to your efforts (http://www.tech9.net/rml/linux/). However, there still remain some bits of preemption-unfriendly code. Do you think that preemption is progressing well, and in general are you happy with the state of the 2.6 scheduler? Specifically, are you satisfied with the interactivity of the new scheduler and the preemptible kernel? What changes do you think need to be made, if any?

Love: Yes, I am very happy with the state of scheduling latency in the 2.6 kernel.

A lot of tuning is still needed, but also a lot of bad areas have been fixed and the kernel is overall much more fair than before. Some specific areas of tuning are in filesystem code and RCU [6] (http://arstechnica.com/#rml7)7. People are working on the RCU issues now.

Ars: The push for device management in user space is now in full swing it would seem; udev, D-BUS and HAL are all progressing. What advantages does pushing the device management into user space actually bring? Do you see a complete transition from kernel-based solutions in the future? What implications for Linux on the desktop does user space device management carry, and specifically what implications for GNOME?

Love: A user-space device naming solution, in particular udev, offers six main benefits.

First, and namely, it provides a mechanism for persistent device naming. Using the logic in udev and a simple configuration file, a given disk partition can always be "hda5" and your favorite joystick can always be named "snake_eyes," regardless of where, when, or in what order the device was connected to the system.

Second, and also important, we no longer have to worry about minor/major numbers anymore. They no longer matter one bit, whatsoever. They simply become an arbitrary cookie that user-space uses to communicate with the kernel. We can, and will, randomly generate them.

Third, we no longer have to manually maintain /dev with hacks like the MAKEDEV script, which have to be updated whenever a new type of device is added to the kernel. And we get a /dev tree that only contains the valid devices on the system, and not a big pile of stink that is the current /dev.

Fourth, udev can do neat things via its configuration script and the fact it emits a D-BUS signal. This means HAL can listen to it and know all about device node additions and removals.

Fifth, udev is a small and simple binary, in user-space. Unlike kernel memory, user-space memory is abundant, swappable, and protected. Also, since udev is in user-space, policy is entirely up to the user — if there is not a good reason for something to be in the kernel, then it should not be.

Finally, this is all done elegantly and without any hacks, simply by leveraging information and mechanism that already exists, today, in the form of hotplug and sysfs [7] (http://arstechnica.com/#rml8)8. It is just the Right Thing to do.

Ars: The addition of gnome-volume-manager will certainly make GNOME more media friendly, and should prove that udev is what it claims to be. Have you been satisfied with D-BUS and udev while working on this addition to GNOME? You mentioned in your blog (http://primates.ximian.com/~rml/blog/) that parts of gnome-volume-manager had to use the kernel events interface. Do you think in the near future gnome-volume-manager and other applications that monitor hardware events will be able to be pulled completely away from the kernel events layer [8] (http://arstechnica.com/#rml9)9 or that it is here for at least a while more?

Love: I am very satisfied with all levels of the Project Utopia stack, including udev and D-BUS. I am most satisfied, however, with HAL. HAL is definitely the shining centerpiece of Project Utopia. HAL made gnome-volume-manager a simple policy engine, implementable as a finite-state machine, which simply listens for certain HAL events and reacts with user-configured policy. Any information that gnome-volume-manager needs it gets from HAL. In fact, it keeps no internal state whatsoever, aside from its configuration settings. HAL seriously rocks.

I do not think that needing the kernel events layer is a bad thing. It is a good thing, that is why I am writing it. It allows us to asynchronously send event-related messages to user-space, via D-BUS signals. I hope more things move to it, not away from it!

Ars: There is obviously a great deal of work going into Linux at the moment to get it more usable on the desktop; however, the overwhelming majority of Linux users are either in the server or embedded market. Do you think that any of the changes that have been occurring to both the Linux kernel and the user-space device management could benefit the server market or are they specifically targeted for the desktop user?

Love: No, this stuff is very important for both of these markets, too.

Things like udev are needed both in the embedded space and the server space. Removal of major/minor limitations, persistent device naming, and so on are crucial to many facets of Linux. HAL, too, will greatly simplify and improve both enduser management of Linux and application development under Linux.



6 In kernels 2.4 and earlier, if a task performed a system call that required the kernel to do something that took a while to do, the kernel would keep processing the system call on behalf of the task on the processor where it was running until it was done with that portion of processing. Mr. Love's work enables even kernel system calls to be preempted in favor of other tasks, and then continued later. This causes tasks to spend less time waiting to run while the kernel is doing something, and makes the system feel more responsive. Also, a lot of work was done to cause long-running system calls to themselves yield the processor for other tasks.

7 See this Linux Journal article for a nice overview of the whats, hows and whys of read-copy updates (RCUs).

8 For those who are not familiar with sysfs, a lot of system device and bus information was moved from /proc into a new filesystem called sysfs, typically mounted at /sys, in the Linux 2.5/2.6 kernels.

9 Mr. Love is working on a new kernel interface that programs can use to be notified of system events, particularly device change/enumeration events. Parts of Project Utopia are based on this.