Monday, March 5, 2012

openSUSE on Panda

For the last couple of months, we have been working hard to get openSUSE ported over to ARMv7, now that ARM devices are out that are actually powerful enough to host a full-fledged distribution like openSUSE.

The really interesting part to this project is the way we build. We compile everything using QEMU on x86. And we have a "don't treat this different than native" rule. So if there is a bug in QEMU, it needs to be fixed. This is different from previous approaches, where people modified packages so they would work fine in their build environment, rather than fix the actual underlying issues. If a package breaks in the QEMU build, but works in a native build, we have to fix QEMU.

As a result of this, we have contributed about 40 patches to QEMU recently - all in linux user emulation code.

So this weekend, I figured I would like to actually see what we're currently producing and started installing the current build of openSUSE with XFCE on a Pandaboard:

Unfortunately, the graphics situation on there is not perfect. The Xorg omapfb driver thinks there is an HDMI display connected in parallel to the DVI one that is really attached and figures that it really should run on that one instead. Ergo, I get a black screen.

I haven't found a reliable way of fixing that, other than forcing X to use the DVI output as its main display using these 2 lines in various Xorg startup scripts:

/usr/bin/xrandr --output hdmi --off
/usr/bin/xrandr --output dvi --primary --auto

With this, I at least get output. For a while. After a few minutes of using the system, I immediately ran into an old upstream bug where the display driver thinks it's a good idea to just disable all output when it gets too many requests. In good old Linaro/Ubuntu fashion, this has been fixed in the Ubuntu tree, but not upstream. Oh well. Let's hope that changes anytime soon. I already nagged them about it ;).

But despite this, things work really smoothly. All the packages that actually are built successfully already work just fine and I was even able to run bb - my favorite demo of all times!

We're still trying hard to get the full distribution compiled - without any build errors - to enable us to ship openSUSE 12.2 with an ARM Beta. Let's keep our fingers crossed this works out!

Friday, October 22, 2010

Making an unstable clock source (reasonably) stable

Why is the KVM clock unstable?

To understand why people call the KVM clock unstable, we first have to learn how it works. The first version of the KVM clock is basically a 1:1 copy of the Xen PV clock. The only difference is the initialization sequence.

For every guest CPU, the guest requests a shared page with the hypervisor. This page is accessible from both the hypervisor and the guest without traps, so it can be used for live updates of information. The layout of that page is as follows:

struct pvclock_vcpu_time_info {
        u32   version;
        u32   pad0;
        u64   tsc_timestamp;
        u64   system_time;
        u32   tsc_to_system_mul;
        s8    tsc_shift;
        u8    pad[3];
} __attribute__((__packed__)); /* 32 bytes */

On registration of this structure or whenever the guest vcpu switches to another host cpu, this structure gets updated by the host. It then contains a timestamp of the last update time, the TSC frequency of the current CPU and the TSC value at the time of the last update.

Why use the TSC? Simple answer: Reading the TSC doesn't trap, so the guest stays within its context. It's also very fast to read in general.

Using that information, we can take system_time + scale(tsc_now - tsc_timestamp) and we should know where we are now in system time scale.

This is all nice and great in theory. The issue however is that we now rely 100% on the TSC. The TSC however is not always reliable. On older systems, it's not guaranteed to by synced between different CPUs. Neither on value nor in speed. Older x86 systems even stopped the TSC in deeper power saving modes.

So imagine we have 2 virtual CPUs, each running on a single host core each. One CPU's TSC is slightly faster than the other. For the sake of simplicity, let's say they have the same sync point.

After running for a while, the TSC values will have skewed. One vcpu is further ahead in time than the other.

If your user space process now gets scheduled from the vcpu that's ahead to the one that's behind and it reads the time on both, time appears to go backwards!

Recent CPUs (AMD Barcelona, Intel Nehalem) have nice mechanisms in place to keep the TSC in sync between different CPUs in the system. The problem is that these mechanisms only work across a single board. If you have a huge NUMA system with several boards interconnected, you end up getting the very same problems again.

How do we get it stable then?

Looking at the issue at hand, two main points become clear:

  1. Time is a system wide property. You can't leave it to every vcpu in the system to evaluate it. There has to be a single controlling instance in the system that ensures time is monotonically increasing.
  2. The longer the gaps are between the sync point and "now", the more inaccurate our time is.

So in order to get the clock at least remotely stable, we have to at first make sure we don't let the TSC delta become too big. What I did there was to add a check on every clock read that makes sure the TSC delta does not exceed 1/5th of a jiffie.

#define PVCLOCK_DELTA_MAX ((1000000000ULL / HZ) / 5)

static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow,
                                   int swing)
        u64 delta, r;
        delta -= shadow->tsc_timestamp;

        r = scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);

        /* Don't use suspicious values, better rerequest them */
        if (r > swing) {
                /* fetch new time value, this makes the current value invalid */
                r = swing;

        return r;

To ensure that time goes forward in a system-wide fashion, I did something very ugly. I added a locked section that keeps a system-wide variable of the last time value and ensures that we're always increasing. If now is older than any last read value, we set it to the last read value. Because we keep the maximum TSC skew window at 1/5th of a jiffie, that is the biggest jitter we can receive. And if your hardware is broken, that's what we have to live with.

At least now time definitely doesn't go backwards anymore and is reasonably accurate. What more can we wish for?

Thursday, October 14, 2010

Timekeeping is hard

I finally received my first Level-3 support call for KVM. Timekeeping in SLES10 is off.

For those of you who don't know about SLES10 - it's the most widely used version of SLES these days. It's based on 2.6.16 (read: ancient) and especially when it comes to timekeeping it's a lot of fun.

The thing is that SLES10 has two completely different frameworks for time in place on x86_32 and x86_64. The x86_32 one looks almost like what we have in recent kernels with structures describing clock sources:

static struct timer_opts kvm_clock = {
.name = "kvm-clock",

.mark_offset = mark_offset_kvm,
.get_offset = get_offset_kvm,
.monotonic_clock = monotonic_clock_kvm,
.delay = delay_kvm,

This is very convenient, as we can just take the current time from the KVM clocksource and pass it on to the framework which takes care of the rest.

On x86_64 however, there is no framework. There is only a big pile of mess that has a bunch of ifs in there to decide what to do. And the worst of all this is that time is calculated as offset since the last tick.

static inline unsigned int do_gettimeoffset_kvm(void)
unsigned long t;
unsigned long last_kvm;

/* returns time in ns */
t = kvm_clock_read();

last_kvm = vxtime.last_kvm;
if (t < last_kvm)
return 0;
t -= last_kvm;

return t;

void do_gettimeofday(struct timeval *tv)
unsigned long seq, t;
unsigned int sec, usec;
unsigned long time;

do {
seq = read_seqbegin(&xtime_lock);

sec = xtime.tv_sec;
usec = xtime.tv_nsec / 1000;

t = (jiffies - wall_jiffies) * (1000000L / HZ) +
(ignore_lost_ticks ?
min((unsigned int)USEC_PER_SEC / HZ, do_gettimeoffset()) :
usec += t;

} while (read_seqretry(&xtime_lock, seq));

tv->tv_sec = sec + usec / 1000000;
tv->tv_usec = usec % 1000000;

Can you spot the bug in there?

do_gettimeoffset is supposed to return the offset in usecs, not nsecs. So there we go with the first stupid bug I introduced. Imagine the following scenario:

tv_usec is 100
do_gettimeoffset_kvm returns 100000 ns == 100 usec
tv_usec gets adjusted to 100100

Now the next tick arrives, the following happens:

tv_usec gets bumped to 200
do_gettimeoffset_kvm now returns 0, because the last tick was just now
tv_usec gets adjusted to 200

Tadaa~. We have successfully warped back in time! Awesome, eh? That's not all there is to it though. Expect more subtile clock fun posts to come.