Thursday, October 14, 2010

Timekeeping is hard

I finally received my first Level-3 support call for KVM. Timekeeping in SLES10 is off.

For those of you who don't know about SLES10 - it's the most widely used version of SLES these days. It's based on 2.6.16 (read: ancient) and especially when it comes to timekeeping it's a lot of fun.

The thing is that SLES10 has two completely different frameworks for time in place on x86_32 and x86_64. The x86_32 one looks almost like what we have in recent kernels with structures describing clock sources:

static struct timer_opts kvm_clock = {
.name = "kvm-clock",

.mark_offset = mark_offset_kvm,
.get_offset = get_offset_kvm,
.monotonic_clock = monotonic_clock_kvm,
.delay = delay_kvm,

This is very convenient, as we can just take the current time from the KVM clocksource and pass it on to the framework which takes care of the rest.

On x86_64 however, there is no framework. There is only a big pile of mess that has a bunch of ifs in there to decide what to do. And the worst of all this is that time is calculated as offset since the last tick.

static inline unsigned int do_gettimeoffset_kvm(void)
unsigned long t;
unsigned long last_kvm;

/* returns time in ns */
t = kvm_clock_read();

last_kvm = vxtime.last_kvm;
if (t < last_kvm)
return 0;
t -= last_kvm;

return t;

void do_gettimeofday(struct timeval *tv)
unsigned long seq, t;
unsigned int sec, usec;
unsigned long time;

do {
seq = read_seqbegin(&xtime_lock);

sec = xtime.tv_sec;
usec = xtime.tv_nsec / 1000;

t = (jiffies - wall_jiffies) * (1000000L / HZ) +
(ignore_lost_ticks ?
min((unsigned int)USEC_PER_SEC / HZ, do_gettimeoffset()) :
usec += t;

} while (read_seqretry(&xtime_lock, seq));

tv->tv_sec = sec + usec / 1000000;
tv->tv_usec = usec % 1000000;

Can you spot the bug in there?

do_gettimeoffset is supposed to return the offset in usecs, not nsecs. So there we go with the first stupid bug I introduced. Imagine the following scenario:

tv_usec is 100
do_gettimeoffset_kvm returns 100000 ns == 100 usec
tv_usec gets adjusted to 100100

Now the next tick arrives, the following happens:

tv_usec gets bumped to 200
do_gettimeoffset_kvm now returns 0, because the last tick was just now
tv_usec gets adjusted to 200

Tadaa~. We have successfully warped back in time! Awesome, eh? That's not all there is to it though. Expect more subtile clock fun posts to come.


  1. Wow, this blog is like a journey inside the head of Alex Graf! If it continues along that path, it's going to be a favorite of mine.