Authour: Alex 'AlexV' Voicu
Editor: Charles 'Lupine' Oliver
Date: December 24th, 2008
Micro-Stuttering: meet the grinch and fear it ... or give it a kick in the nether regions
In Dead Space, a rather nifty game, one of the plot devices used involves a certain recording that the protagonist watches on repeated occasions. In it a certain missus shares her thoughts, amongst which the following passage: “It's strange ... such a little thing”, and a more adequate opener for a micro-stuttering discussion may never be found. Because a little thing it is ... that can, under very specific circumstances, have significant effects.
A lot of virtual ink has been used treating this topic, with many interesting and novel interpretations showing up. However, since we have to justify our existence by generating some sort of competitive advantage, we'll provide a detailed and (hopefully) helpful description of this often invisible yet always “deadly” (going by some energetic reports) beast. The goal of this action is somewhat bi-faceted: the pages get filled and hopefully some of the too creative theories will be put on ice until a future time when something else apparently fitting them will arise.
The AFR Link
Point number one, and perhaps the most important: if you're using Alternate Frame Rendering (AFR), the potential for micro-stuttering exists. Now, before setting your multi-GPU setups on fire and swearing undying loyalty to single instances of everything, do consider the fact that as long as you're breathing air the potential for becoming ill exists, yet that seldom happens. Potential for manifestation does not necessarily mean experiencing the effects.
Why is micro-stuttering intrinsically tied to AFR? Quite simple really ... let's consider how normal, sequential, single-GPU rendering works:
the GPU takes its time to render the frame - we'll call this time/frame from now on
the GPU presents the rendered frame
the cycle repeats itself
This is obviously simplified since we're ignoring page flipping, aka double buffering, and its impact. However, the main idea should be rather clear: rendering is work-present-work-present in a constant cycle, with the work interval being consistent between frames, since that work is done on the same GPU with the same resources. Of course, full consistency can only exist for limited periods because it relies on workloads/frame being equal, whereas these vary: rendering the skybox may take only 1 millisecond (ms) whereas rendering a complex material may take 33ms. Both of these cases can happen in somewhat rapid succession, for example when something crosses the sky/the camera is moved rapidly, and the time/frame will differ between separate frames. In practice, things would look something like this:
Single GPU Rendering
Now, let's introduce frame-level parallelism by adding a second GPU:
2-GPU AFR Rendering
Here's what happens in the above situation:
GPU1 starts rendering frame 1, which will take 15ms
GPU2 starts rendering frame 2, which will also require 15ms of render time
frame 1 is presented, this takes 1ms
frame 2 is also done, and is sent from GPU2 to GPU1 via the Crossfire Bridge Interconnect
and now a dilemma ensues: normally you'd expect 15ms between frames 1 and 2, but frame 2 is already finished, so it can be presented right away, immediately after frame 1 ... which is neat since it means a doubling of framerate, but there's a problem:
GPU1 only started working on frame 3, so frame 3 won't be done immediately after frame 2 is presented ... it'll be done in 10ms, and here an inconsistency in time/frame shows up - this inconsistency is the much maligned micro-stuttering
And the cycle continues
Any attentive reader will notice that adding coarse frame-level parallelism introduces the risk that micro-stuttering will exist, since you get (at least) one extra frame ready and available within the same time/frame interval but can't ensure that this rhythm is maintained. Why? Because the (n+1)th frame (n=number of GPUs) will require more time/frame due to the reasons described above.
Now, all of this sounds rather evil, and certain sources of information ensured that it equaled Cruella on the scale of infamy. However, few things aren't that simple, or complex, depending on your take on the issue. Even if, in the above example, it would seem that the second frame is delivered instantly, that's not quite the case - there is a certain latency associated with swapping it via the CFBI, as well as some driver resistance to micro-stuttering. What this comes down to is that, for reasonable time/frame intervals, the differences end up in the realm of less than 5 ms, which is to say that you won't notice them. However, reasonable time/frame intervals is another way of saying good performance ... if you're rendering at 20 FPS average, with a high-end multi-GPU setup (think an X2, for example, or, more “creatively”, dual X2s), two things are safe to assume:
you'll be affected by micro-stuttering, since the deltas grow to >25 ms, which equates to quite perceivable stutter/ inter-frame asynchronisms
your expectations for your setup are somewhat too optimistic: 20 FPS on an X2 would translate to roughly 12-13 FPS on a single 1GB 4870, assuming a reasonable 1.6-1.7x scaling factor - that's unplayable, and a situation in which going multi-GPU is not going to save the day ... also, considering the performance of the single 4870 itself, if you're getting only 12 FPS on it (and assuming no driver or application bug/other limitation is involved), nothing on the market at this point in time will bring you significantly better performance at those settings, so lowering them may be the more prudent course of action.
The above two points should also underline two further aspects: solutions like Hybrid SLI or Hybrid Crossfire are very poor as performance enhancers, since the building blocks themselves are weak GPUs that will not produce sufficient performance, and Multi-GPUs are a high-end proposition, meant to turn a good experience into a great one: good performance goes to great performance, AA levels get upped within the same performance envelope etc, not a universal panacea for performance issues.
Finally, before detailing how we tried to measure micro-stuttering across today's testing suite, another aspect needs to be underlined: by increasing GPU count, the risk of perceivable micro-stuttering is reduced. That may seem counterintuitive to some, since more GPUs should mean more evil but, when analyzing the facts, it's obvious why this happens: as more frames are queued and subsequently presented, the more time GPU1 gets for rendering the (n+1)th frame, and the 2-n frames are quite tightly packed between themselves. You'll see this rather intensely in two of today's tests ... but let's not get ahead of ourselves.
When it comes to presenting micro-stuttering related data, the currently employed solution involves the well known jigsaw graph. Whilst this is strong visually, it has certain limitations: trying to represent a large data set in this fashion (think thousands of frames, since that is the typical count within a longer testing run, at adequate framerates) may result in losing out details as the small differences between discrete frames get compressed, and the graph itself may be hard to manage.
Another solution would be to simply chart the progression of frametimes, and postulate that this should be a linear function of constant slope for perfectly distributed frames ... however, we just determined that in practice there are a number of non micro-stuttering related causes for unevenness of times/frame. Also, for large frame counts, the graph once again gets compressed and supposed unevenness of the frametimes in the mGPU is less than apparent.
What we chose to do is slightly different, and it tries to ... umm ... feed two birds with one grain (this is the PETA friendly interpretation, feel free to replace it with the traditional formulation):
calculate time/frame intervals as the difference between two subsequent frame times
calculate the absolute difference between consecutive frames (|1-2|,|2-3|, etc)
calculate the frequency repartition for the data set formed from the above computed absolute differences - which means, simplifying and further detailing, finding out how many samples are present within the following intervals [0,5),[5,10),[10,15),[15,20),[20,25) and [25,inf)
express the frequencies in relative form, as a percentage of total frames in the data set
make a nice, colorful graph
This process allows us to analyze large runs, and show exactly how things panned out, without losing detail. The intervals are somewhat on the fine end, but a good rule of thumb would be that under 20 ms things are OK, between 20-25 ms it depends on the person and over 25 ms it'll be very perceptible. This is, by no means, written in stone tablets with fiery letters, but it is based on experimenting we've done, and how people reacted.
For all the measurements we'll be relying on FRAPS ... from this you should infer that whilst we're using a good tool, it's not surgically accurate. Sadly, due to time constraints and the complete disdain modern games showed for more intrusive profiling tools (in fact, none of the titles we worked with played nicely with more involved and detailed methods of logging and profiling), the alternative of delaying it indefinitely, until a way of solving the aforementioned “couple-issues” was found, was quite unfeasible. In practice, the accuracy exhibited was more than sufficient for this particular case.
One final consideration, in case someone was wondering: total frame-count differs between separate configurations/settings, like QuadCF vs single X2 or Edge-detect versus Box AA, with it being lower for the lower performing part. This is another thing that is rather logical: if you're only drawing at a hypothetical 30 FPS, you'll only draw 7200 frames across a 240 second (4 minute) run, whereas if you're running at 60 FPS the count doubles. So, whilst the run time is constant , and run contents are held as constant as possible, frame-counts will differ.
Test Configuration & Setup
Before heading to the test firing range, lets review the testing rig's specifications:
All games are patched to the latest available versions. We've decided do go straight for the jugular, with regards to settings used, which means using maximal in-game quality settings, and liberal use of anisotropic-filtering (AF)and anti-aliasing (AA), the levels for each specified on a per-game basis.
content not found