To understand why people call the KVM clock unstable, we first have to learn how it works. The first version of the KVM clock is basically a 1:1 copy of the Xen PV clock. The only difference is the initialization sequence.
For every guest CPU, the guest requests a shared page with the hypervisor. This page is accessible from both the hypervisor and the guest without traps, so it can be used for live updates of information. The layout of that page is as follows:
struct pvclock_vcpu_time_info {
u32 version;
u32 pad0;
u64 tsc_timestamp;
u64 system_time;
u32 tsc_to_system_mul;
s8 tsc_shift;
u8 pad[3];
} __attribute__((__packed__)); /* 32 bytes */
On registration of this structure or whenever the guest vcpu switches to another host cpu, this structure gets updated by the host. It then contains a timestamp of the last update time, the TSC frequency of the current CPU and the TSC value at the time of the last update.
Why use the TSC? Simple answer: Reading the TSC doesn't trap, so the guest stays within its context. It's also very fast to read in general.
Using that information, we can take system_time + scale(tsc_now - tsc_timestamp) and we should know where we are now in system time scale.
This is all nice and great in theory. The issue however is that we now rely 100% on the TSC. The TSC however is not always reliable. On older systems, it's not guaranteed to by synced between different CPUs. Neither on value nor in speed. Older x86 systems even stopped the TSC in deeper power saving modes.
So imagine we have 2 virtual CPUs, each running on a single host core each. One CPU's TSC is slightly faster than the other. For the sake of simplicity, let's say they have the same sync point.
If your user space process now gets scheduled from the vcpu that's ahead to the one that's behind and it reads the time on both, time appears to go backwards!
Recent CPUs (AMD Barcelona, Intel Nehalem) have nice mechanisms in place to keep the TSC in sync between different CPUs in the system. The problem is that these mechanisms only work across a single board. If you have a huge NUMA system with several boards interconnected, you end up getting the very same problems again.
How do we get it stable then?
Looking at the issue at hand, two main points become clear:
- Time is a system wide property. You can't leave it to every vcpu in the system to evaluate it. There has to be a single controlling instance in the system that ensures time is monotonically increasing.
- The longer the gaps are between the sync point and "now", the more inaccurate our time is.
So in order to get the clock at least remotely stable, we have to at first make sure we don't let the TSC delta become too big. What I did there was to add a check on every clock read that makes sure the TSC delta does not exceed 1/5th of a jiffie.
#define PVCLOCK_DELTA_MAX ((1000000000ULL / HZ) / 5)
static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow,
int swing)
{
u64 delta, r;
rdtscll(delta);
delta -= shadow->tsc_timestamp;
r = scale_delta(delta, shadow->tsc_to_nsec_mul, shadow->tsc_shift);
/* Don't use suspicious values, better rerequest them */
if (r > swing) {
/* fetch new time value, this makes the current value invalid */
kvm_register_clock(NULL);
r = swing;
}
return r;
}
To ensure that time goes forward in a system-wide fashion, I did something very ugly. I added a locked section that keeps a system-wide variable of the last time value and ensures that we're always increasing. If now is older than any last read value, we set it to the last read value. Because we keep the maximum TSC skew window at 1/5th of a jiffie, that is the biggest jitter we can receive. And if your hardware is broken, that's what we have to live with.
At least now time definitely doesn't go backwards anymore and is reasonably accurate. What more can we wish for?