Friday, October 22, 2010

Making an unstable clock source (reasonably) stable

Why is the KVM clock unstable?

To understand why people call the KVM clock unstable, we first have to learn how it works. The first version of the KVM clock is basically a 1:1 copy of the Xen PV clock. The only difference is the initialization sequence.

For every guest CPU, the guest requests a shared page with the hypervisor. This page is accessible from both the hypervisor and the guest without traps, so it can be used for live updates of information. The layout of that page is as follows:

struct pvclock_vcpu_time_info {
        u32   version;
        u32   pad0;
        u64   tsc_timestamp;
        u64   system_time;
        u32   tsc_to_system_mul;
        s8    tsc_shift;
        u8    pad[3];
} __attribute__((__packed__)); /* 32 bytes */

On registration of this structure or whenever the guest vcpu switches to another host cpu, this structure gets updated by the host. It then contains a timestamp of the last update time, the TSC frequency of the current CPU and the TSC value at the time of the last update.

Why use the TSC? Simple answer: Reading the TSC doesn't trap, so the guest stays within its context. It's also very fast to read in general.

Using that information, we can take system_time + scale(tsc_now - tsc_timestamp) and we should know where we are now in system time scale.




This is all nice and great in theory. The issue however is that we now rely 100% on the TSC. The TSC however is not always reliable. On older systems, it's not guaranteed to by synced between different CPUs. Neither on value nor in speed. Older x86 systems even stopped the TSC in deeper power saving modes.

So imagine we have 2 virtual CPUs, each running on a single host core each. One CPU's TSC is slightly faster than the other. For the sake of simplicity, let's say they have the same sync point.



After running for a while, the TSC values will have skewed. One vcpu is further ahead in time than the other.

If your user space process now gets scheduled from the vcpu that's ahead to the one that's behind and it reads the time on both, time appears to go backwards!

Recent CPUs (AMD Barcelona, Intel Nehalem) have nice mechanisms in place to keep the TSC in sync between different CPUs in the system. The problem is that these mechanisms only work across a single board. If you have a huge NUMA system with several boards interconnected, you end up getting the very same problems again.


How do we get it stable then?

Looking at the issue at hand, two main points become clear:

  1. Time is a system wide property. You can't leave it to every vcpu in the system to evaluate it. There has to be a single controlling instance in the system that ensures time is monotonically increasing.
  2. The longer the gaps are between the sync point and "now", the more inaccurate our time is.

So in order to get the clock at least remotely stable, we have to at first make sure we don't let the TSC delta become too big. What I did there was to add a check on every clock read that makes sure the TSC delta does not exceed 1/5th of a jiffie.

#define PVCLOCK_DELTA_MAX ((1000000000ULL / HZ) / 5)

static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow,
                                   int swing)
{
        u64 delta, r;
        rdtscll(delta);
        delta -= shadow->tsc_timestamp;

        r = scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);

        /* Don't use suspicious values, better rerequest them */
        if (r > swing) {
                /* fetch new time value, this makes the current value invalid */
                kvm_register_clock(NULL);
                r = swing;
        }

        return r;
}

To ensure that time goes forward in a system-wide fashion, I did something very ugly. I added a locked section that keeps a system-wide variable of the last time value and ensures that we're always increasing. If now is older than any last read value, we set it to the last read value. Because we keep the maximum TSC skew window at 1/5th of a jiffie, that is the biggest jitter we can receive. And if your hardware is broken, that's what we have to live with.

At least now time definitely doesn't go backwards anymore and is reasonably accurate. What more can we wish for?

Thursday, October 14, 2010

Timekeeping is hard

I finally received my first Level-3 support call for KVM. Timekeeping in SLES10 is off.

For those of you who don't know about SLES10 - it's the most widely used version of SLES these days. It's based on 2.6.16 (read: ancient) and especially when it comes to timekeeping it's a lot of fun.

The thing is that SLES10 has two completely different frameworks for time in place on x86_32 and x86_64. The x86_32 one looks almost like what we have in recent kernels with structures describing clock sources:

static struct timer_opts kvm_clock = {
.name = "kvm-clock",

.mark_offset = mark_offset_kvm,
.get_offset = get_offset_kvm,
.monotonic_clock = monotonic_clock_kvm,
.delay = delay_kvm,
};

This is very convenient, as we can just take the current time from the KVM clocksource and pass it on to the framework which takes care of the rest.

On x86_64 however, there is no framework. There is only a big pile of mess that has a bunch of ifs in there to decide what to do. And the worst of all this is that time is calculated as offset since the last tick.

static inline unsigned int do_gettimeoffset_kvm(void)
{
unsigned long t;
unsigned long last_kvm;

/* returns time in ns */
t = kvm_clock_read();

last_kvm = vxtime.last_kvm;
if (t < last_kvm)
return 0;
t -= last_kvm;

return t;
}

void do_gettimeofday(struct timeval *tv)
{
unsigned long seq, t;
unsigned int sec, usec;
unsigned long time;

do {
seq = read_seqbegin(&xtime_lock);

sec = xtime.tv_sec;
usec = xtime.tv_nsec / 1000;

t = (jiffies - wall_jiffies) * (1000000L / HZ) +
(ignore_lost_ticks ?
min((unsigned int)USEC_PER_SEC / HZ, do_gettimeoffset()) :
do_gettimeoffset());
usec += t;

} while (read_seqretry(&xtime_lock, seq));

tv->tv_sec = sec + usec / 1000000;
tv->tv_usec = usec % 1000000;
}

Can you spot the bug in there?

do_gettimeoffset is supposed to return the offset in usecs, not nsecs. So there we go with the first stupid bug I introduced. Imagine the following scenario:

tv_usec is 100
do_gettimeoffset_kvm returns 100000 ns == 100 usec
tv_usec gets adjusted to 100100

Now the next tick arrives, the following happens:

tv_usec gets bumped to 200
do_gettimeoffset_kvm now returns 0, because the last tick was just now
tv_usec gets adjusted to 200

Tadaa~. We have successfully warped back in time! Awesome, eh? That's not all there is to it though. Expect more subtile clock fun posts to come.