Sqrtf(), Ow!

soylent

New member
I was writing a little function to crank out points along a cubic bezier spline in 2 dimensions using SSE-1 and 2 in a block of inline assembler(often much faster than intrinsics, at least in msvc).

I wanted to make sure that the last point ends up exactly at the last control point without rounding errors due to the way I take constant steps along the curve and build up a rounding error that may be significant in the end. So I copied it over using regular C after the block of inline assembler; it was only one point so there's no use coding that in assembly.

I made a version of this function that figures out the tangent and normal to the curve. So I thought, why don't I determine the normal from the last two control points and copy that over so I can be sure it's exact(which is a little silly since that doesn't need to be nearly as exact for my drawing purposes. But I wasn't thinking that far). So I thought, hey, it's only one point, I can get away with using sqrtf() for calculating the norm, how much harm can it do?

I tested this assumption later using the high precision timer. It turns out that I can crank out 512 points, their normals and tangents(using the SSE instruction rsqrtps to find the norm.) in the same amount of time it takes to call a single sqrtf(). :nuts:

4 microseconds for a single sqrtf in a retail build? That's absolutely insane! That's 8000 cycles!
 
Sqrtf() is a C library function, so I doubt that it uses SSE out of the box
(to remain compatible with older, non-SSE CPUs); probably uses an
expensive algorithm to calculate the sqrt.

When you're already working with SSE why not use the SQRTSS op
for example?
 
Sqrtf() is a C library function, so I doubt that it uses SSE out of the box
(to remain compatible with older, non-SSE CPUs); probably uses an
expensive algorithm to calculate the sqrt.

When you're already working with SSE why not use the SQRTSS op
for example?

If the cost is a few percent performance, wasting less time and a lack of bugs is more important to me than optimal performance.

You don't optimize everything because that leads to lots of hard to debug, ugly code and massively increasing development time. You don't optimize before profiling unless you absolutely know something will need optimization from prior experience; intuition is usually wrong in what needs optimizing and what does not.

(BTW, you'd use RSQRTSS to avoid an divss instruction which is rather slow compared to a mulss AFAIK.)
 
Last edited:
Back
Top