soylent
New member
I was writing a little function to crank out points along a cubic bezier spline in 2 dimensions using SSE-1 and 2 in a block of inline assembler(often much faster than intrinsics, at least in msvc).
I wanted to make sure that the last point ends up exactly at the last control point without rounding errors due to the way I take constant steps along the curve and build up a rounding error that may be significant in the end. So I copied it over using regular C after the block of inline assembler; it was only one point so there's no use coding that in assembly.
I made a version of this function that figures out the tangent and normal to the curve. So I thought, why don't I determine the normal from the last two control points and copy that over so I can be sure it's exact(which is a little silly since that doesn't need to be nearly as exact for my drawing purposes. But I wasn't thinking that far). So I thought, hey, it's only one point, I can get away with using sqrtf() for calculating the norm, how much harm can it do?
I tested this assumption later using the high precision timer. It turns out that I can crank out 512 points, their normals and tangents(using the SSE instruction rsqrtps to find the norm.) in the same amount of time it takes to call a single sqrtf().
4 microseconds for a single sqrtf in a retail build? That's absolutely insane! That's 8000 cycles!
I wanted to make sure that the last point ends up exactly at the last control point without rounding errors due to the way I take constant steps along the curve and build up a rounding error that may be significant in the end. So I copied it over using regular C after the block of inline assembler; it was only one point so there's no use coding that in assembly.
I made a version of this function that figures out the tangent and normal to the curve. So I thought, why don't I determine the normal from the last two control points and copy that over so I can be sure it's exact(which is a little silly since that doesn't need to be nearly as exact for my drawing purposes. But I wasn't thinking that far). So I thought, hey, it's only one point, I can get away with using sqrtf() for calculating the norm, how much harm can it do?
I tested this assumption later using the high precision timer. It turns out that I can crank out 512 points, their normals and tangents(using the SSE instruction rsqrtps to find the norm.) in the same amount of time it takes to call a single sqrtf().
4 microseconds for a single sqrtf in a retail build? That's absolutely insane! That's 8000 cycles!