[Valgrind-developers] valgrind and subtle floating point problem

Discussion:

Vince Weaver

2007-03-22 21:46:49 UTC

Hello,

I know in general I shouldn't expect floating point to be _exact_, but
I've found a problem where valgrind is just slightly off and it majorly
effects results.

I've made a valgrind plugin that calculates Basic Block Vectors for use
with the Simpoint analysis tool. It gets an instruction count using
methods simiar to cachegrind and I've validated it with performance
counters on a P3 system (one special case has to be added; with the
"rep" prefix and string instructions an actual machine counts up to 4096
reps as one instruction retired, not as 4096 separate ones)

In any case I've run this on the spec2k benchmarks, and all of them are
close except for art. The art benchmark finishes in half the number of
instructions than it should.

It turns out that art is using the "==" operator to compare two floating
point numbers. And valgrind returns values that have the LSB wrong
on 64-bit fmul and fadd instructions. This is enough to make the program
finish early.

Looking through the valgrind code, I am guessing maybe this is a problem
with the rounding mode, but I haven't been able to track down a good fix.

I've attached code after this that shows the problem.
On a native system I get

xr=0.426335 qr=0.505253
3fdb4914520a783a
3fe02b07a0efb19b
v=5.478862 4015ea5ace4c4585

Under valgrind with --tool=none I get

xr=0.426335 qr=0.505253
3fdb4914520a783a
3fe02b07a0efb19b
v=5.478862 4015ea5ace4c4586

Notice that only the very last bit of the result is off, which is why I
think it might be rounding related.

Any help with this problem would be appreciated... I am using
valgrind 3.2.3

Thanks,

Vince

#include <stdio.h>

void print_hex(double value) {

long long *blah;

blah=(long long *)&value;
printf("%llx\n",*(blah));
}

int main(int argc, char **argv) {

unsigned long long xr_l=0x3fdb4914520a783aULL;
unsigned long long qr_l=0x3fe02b07a0efb19bULL;

double xr,qr,v;
long long *int_ptr;
unsigned short cw;

int_ptr=(unsigned long long *)&xr;
*int_ptr=xr_l;

int_ptr=(unsigned long long *)&qr;
*int_ptr=qr_l;

printf("xr=%lg qr=%lg\n",xr,qr);
print_hex(xr); print_hex(qr);

// asm ("fstcw %0":"=m"(cw)::"memory");
// printf("cw=%x, rounding=%d\n",cw,(cw>>9)&3);

v=xr+10.0*qr;
printf(" v=%lf ",v);
print_hex(v);

return 0;
}

Julian Seward

2007-03-23 14:32:38 UTC

Permalink

This post might be inappropriate. Click to display it.

Nicholas Nethercote

2007-03-23 23:00:18 UTC

Permalink

Post by Julian Seward
Valgrind's handling of x86 FP is something of a kludge. In short
it regards all operations internally as 64-bit, to increase commonality
of Valgrind's internals with other platforms and reduce overall
engineering effort. Unfortunately this can give rise to the kinds
of problem you saw. Fixing it properly would take a significant amount
of time hacking around the internals of VEX.

To augment Julian's comment: for x86, Valgrind simulates a machine with
64-bit FP registers. The important question here: is that 64-bit
implementation correct? Ie. is 'art' relying on the extra 16 bits of
accuracy? If so, then 'art' is arguably (some would disagree) at fault,
because it's doing non-portable things. Given that it is in SPEC, that
would be surprising. Or, Valgrind's 64-bit FP implementation may have bugs.

Nick

Julian Seward

2007-03-24 02:07:30 UTC

Permalink

Post by Nicholas Nethercote

To augment Julian's comment: for x86, Valgrind simulates a machine with
64-bit FP registers. The important question here: is that 64-bit
implementation correct?

Having looked at Vince's test case, I didn't see any place where
Valgrind incorrectly double-rounds the value. At least that's one
good thing, even though it doesn't help Vince.

I did notice that valgrind's register allocator was using 64-bit loads/
stores to spill FP registers, which isn't really right -- it
means a spill-reload event isn't "transparent" to the value. I
fixed it to do 80-bit spilling. This made no difference whatsoever
to the big FP suite I use for testing (GNU gsl 1.6), alas.

J

Bart Van Assche

2007-03-24 08:23:18 UTC

Permalink

Post by Nicholas Nethercote
To augment Julian's comment: for x86, Valgrind simulates a machine with
64-bit FP registers. The important question here: is that 64-bit
implementation correct? Ie. is 'art' relying on the extra 16 bits of
accuracy? If so, then 'art' is arguably (some would disagree) at fault,
because it's doing non-portable things. Given that it is in SPEC, that
would be surprising. Or, Valgrind's 64-bit FP implementation may have bugs.

My opinion is that the art program is flawed: it is never a good idea to
compare floating point numbers with the "==" or "!=" operator. Floating
point numbers must be compared via fabs(... - ...) < ... or fabs(... - ...)

Post by Nicholas Nethercote
... This is something you can find in any decent FAQ about numerical

computing.

Bart.

Julian Seward

2007-03-24 12:15:53 UTC

Permalink

Post by Bart Van Assche
My opinion is that the art program is flawed: it is never a good idea to
compare floating point numbers with the "==" or "!=" operator. [...]

I agree, floating point comparison is not good. On the other hand,
SPEC CPU is designed to be portable and I would be amazed if the SPEC
folks had not looked into these problems in depth. Perhaps they
fixed all problems they encountered in testing, but this one did not
happen at that time, and it is only triggered by Valgrind's extra
inaccuracy on x86. Who knows.

J

Vince Weaver

2007-03-24 01:52:00 UTC

Permalink

I think the problem does lie with 'art', but unfortunately it's a bit late
to do anything about this (it's the spec2k version of art, I don't think
it is even included in spec2k6).

I've done the same experiment with art compiled with
-msse2 and found that the problem goes away; since the sse2
floating point code only uses 64-bit math this seems to indicate valgrind
is behaving properly with regards to 64-bit math.

Unfortunately for me the machine I am using to do the performance
counter measurements is an older pentium3 that doesn't have sse2 support,
so I am going to have to find another way to work around this for now.

Thanks for the help,

Vince

Julian Seward

2007-03-24 02:03:00 UTC

Permalink

Post by Vince Weaver
Unfortunately for me the machine I am using to do the performance
counter measurements is an older pentium3 that doesn't have sse2 support,
so I am going to have to find another way to work around this for now.

If this comparison is not in a an inner loop, can you do some nasty
kludge like masking off the lowest couple of mantissa bits before
doing the comparison?

J

Vince Weaver

2007-03-24 02:20:47 UTC

Permalink

Post by Julian Seward
If this comparison is not in a an inner loop, can you do some nasty
kludge like masking off the lowest couple of mantissa bits before
doing the comparison?

I was thinking about trying that as a last resort.

I came across an interesting paper:
http://www.wrcad.com/linux_numerics.txt

Maybe I can try forcing art to use FPU_DOUBLE mode instead of FPU_EXTENDED
in the manner described in the paper...

Vince

Vince Weaver

2007-03-24 18:27:30 UTC

Permalink

Post by Julian Seward

Post by Bart Van Assche
My opinion is that the art program is flawed: it is never a good idea to
compare floating point numbers with the "==" or "!=" operator. [...]

While you would think SPEC would have done a good job picking portable
benchmarks, in actual fact they are a mess. What passed for acceptable
code ~1998 when the spec2000 codes were frozen just wouldn't fly today.
They've had to release a number of service packs along the way because
probably at least half the original spec2k code release won't compile
with a gcc more recent than 2.8 or so.

Even now, with the most recent spec2k release, you can't compile the
'vortex' benchmark with optimizations turned on with gcc 4.0 or it will
crash on x86-linux.

I do wonder if any of the compiler vendors noticed this problem with art..
you could in theory make your compiler look better on the FP score by
having art finish in half the time if you made sure it ran in 64-bit
rather than 80-bit mode on x86...

For my purposes I hacked the art code and added a few lines of code
at the beginning to force the x87 fpu state to be FPU_DOUBLE (instead of
FPU_EXTENDED) and that is enough to make the valgrind runs match the
actual perf counter runs.

Nicholas Nethercote

2007-03-24 23:45:52 UTC

Permalink

Post by Vince Weaver
I do wonder if any of the compiler vendors noticed this problem with art..
you could in theory make your compiler look better on the FP score by
having art finish in half the time if you made sure it ran in 64-bit
rather than 80-bit mode on x86...

Surely the SPEC output checking would catch this, if you are doing proper,
reportable runs?

Nick

Vince Weaver

2007-03-25 20:04:25 UTC

Permalink

Post by Nicholas Nethercote

Surely the SPEC output checking would catch this, if you are doing proper,
reportable runs?

This is rapidly getting more and more off-topic, for which I apologize...

The output from the 'art' benchmark is identical in all cases...
the difference is that it converges twice as fast when using 64-bit
math rather than 80-bit math.

I only noticed this problem because the experiments I am doing depend on
the instructions_retired metric to be roughly the same across all the
tools I am testing.

Vince

Nicholas Nethercote

2007-03-26 01:57:13 UTC

Permalink

Post by Vince Weaver
The output from the 'art' benchmark is identical in all cases...
the difference is that it converges twice as fast when using 64-bit
math rather than 80-bit math.

That's awful. Is SPEC2006 any better than SPEC2000?

Nick