RV970 - true next generation

SirkoZ

New member
So - with RV870 launch out of the way and with that speculation on it - let's start guessing what the next big thing for ATI will be - the RV970!

I say the will keep the number of ALU's pretty much unchanged from RV870 but decouple them further - to get a truly universal design that is 99.9% efficient with any kind of shader/program.
So instead of one big and four little units in RV870 there will be 5 big, decoupled units in RV970 and also a more complex scheduler so that the chip will act as a real 1600 shader GPU and not a 320 5D machine that the "R500"-R800 are.

Of course there are other options as tile-based renderer or ray-tracer... :)
 
Their will be way more then 320 ALU's in RV970. 640+ easy. Keeping it at 320 regardless of what kind of shaders they are or how much better the dispatch processor is would be a failure.

What I think?
-It's going to be more programmable
-ALU:TEX ratio is going to increase
-going to stay at 32 pixels per clock
-ray tracing is going to be emphasized
-28nm process
-1ghz core
-512bit
 
'more programmable' would be a good trick as we are running out of areas to increase this without going down the larrabee route, which while being a possiblity until Larrabee it out the door I don't think this is realistic and we've got at least one more generation before AMD start heading in that direction.

Right now, only blending and one of the three tesselation sections remain unprogrammable beyond flipping a few options on and off. Allowing neither of which would be massively intresting... programable blending might allow for a few intresting tricks but thats about it.

I suspect, given the nature of compute shaders, ray tracing is pretty much already possible. Maybe I'll give it a quick go next weekend if I'm not stuck at work.

I suspect this will be an incremental increase, maybe some DX11.1 features but nothing more. Maybe some extra compute features like being able to read/write a "multi-sampled buffer" which is a "missing" feature right now.

Going beyond that, if AMD want to compete with NV in the GPGPU market they will need to;
- add ECC support
- improve the schedular to let them run multiple kernels at once
- improve double processing so that it is only a 50% loss instead of running at 1/8th speed (only viable option while games remain 32bit fp as trying to design a core around higher precision will just leave 50% of your transistors idle)

I would also expect to see some introduction of technology not too far from Larrabee to allow the GPU to do more work on its own, making it almost a daughter board to the CPU. Partly because by the time we get to this state GPUs for the next generation of consoles will either be in production or very close to being so and I wouldn't be surprised if MS at least would like something like this. I could see it being part of the DX12 spec; a 'shader' system which allows you to basically script what the GPU does based on what it renders and some inital input data.

So, RV9xx is unlikely to be much beyond an incremantal increase, I would hope that the "RV10xx" would bring some innovation like the above to the table (from NV and AMD).
 
512bit is very expensive, more likely to go with ramped memory clocks IMO.

and I think HD 6870 will be a somewhat expensive SKU. Depends on the direction they go, but I have a feeling they are going to hit GPGPU hard next round and they are going to make sure they have something high enough to scale down accordingly.

bobvodka@

I actually think RV9 is their new arch.
 
bobvodka@

I actually think RV9 is their new arch.

I'm not convinced; unless there is a major change coming in D3D a whole new arch would be overkill at this point. AMD right now have the advantage that they don't have to risk a whole new arch as their DX10 arch was very close to DX11 already so required little in the way of change and that will serve them for a while yet.

If they do go after GPGPU then what they need to do is match NV in double processing, memory interface and kernel scheduling. None of this requires a 'new' arch, just improvements on the current one (in the same way that Fermi is very much a G80 under the hood, just with a bunch of improvements).
 
or maybe some "integration" decisions? because next gen will definitely find its way into Fusion APU's, i suppose some sort of HyperTransport ports, that would killer for GPGPU (nVidia working on some sort of inifniband solution). Cause the GPU's can get as powerfull as they want, but Data still needs to be transfered over PCI-Express, which is a deal breaker for many GPGPU applications, a general rule in GPGPU programming:

if (transfer time > CPU execution time)
{execute on CPU;} //regardless of how fast is it on the GPU

A more aggressive scheduler is better way to go than "decoupling" the ALU's, since games are sooo SIMD, and i belive that when they where designing the ISA, 4+1 (VLIW) was the "sweet spot" at that time; and still holding, then its in the compiler and scheduler hands! which is good! i didn't have the numbers, but definitely the utilization of such design is much more better now than in the R600 days!

you can go 99% compatible with everything, but you will end up loosing up die space (thats exactly what nVidia is doing with their Super Scalar "decoupled" architecture, gigantic yet like one third/forth the number of alu's)

In fact, i guess we might just see an increase in the number of alu's per unit, though can never speculate on the number (8+2??) this needs gathering statistics by profiling near future games. Or maybe just "fatter" alu's? (FP64 alu's?) that what really help up the double precision throughput, but do games need double precision? since ATi is not geared towards GPGPU like the other camp.

many things could happen to a new architecture. and frankly its quite too early to speculate, i guess even ATi engineers didnt finilize thier designs yet! since with no R800 "refresh" coming up, they should be quite busy with R900's.

for me, to hell! i just want my 5970 first! then i could care about whatever comes next..;(
 
or maybe some "integration" decisions? because next gen will definitely find its way into Fusion APU's, i suppose some sort of HyperTransport ports, that would killer for GPGPU (nVidia working on some sort of inifniband solution). Cause the GPU's can get as powerfull as they want, but Data still needs to be transfered over PCI-Express, which is a deal breaker for many GPGPU applications, a general rule in GPGPU programming:

if (transfer time > CPU execution time)
{execute on CPU;} //regardless of how fast is it on the GPU

I dare say some intergration choices will be made, however the issue is at what level; they will still need chips with PCIe connections on them, so do they have a PCIe -> HT translation with HT going into the chip, or do they have two types of core, one with direct HT logic and one with PCIe based logic both of which feed into a common interface.

That said, 'integration' isn't, imo, the solution to the problem. As soon as you bring the GPU closer to the CPU you hit issues with memory access/contention which is the biggest stumbling block. This is part of why the XBox360 has that 10meg eDRAM buffer, so that the GPU can do what it does best while the CPU does what it does best, each with a slightly different memory access pattern and favoring one over the other will deminish the performance of the other.

Generally, HPC is going to be dealing with large chunks of data anyway, so the travel time for the data isn't a huge issue. This would also benifit from my outlined system earlier as you could batch alot of operations to go (even depending on what they return) and leave the system to crunch through the numbers.

On other issues;
- I don't think a fatter ALU will help, certainly not for games, as you are going to have half your resouces laying around 'spare'
- More might help, but only if they can execute two different two different instructions at once, otherwise you run the risk of again leaving resources spare. I think more cores is slightly more logical here.
 
and I think HD 6870 will be a somewhat expensive SKU. Depends on the direction they go, but I have a feeling they are going to hit GPGPU hard next round and they are going to make sure they have something high enough to scale down accordingly.

bobvodka@

I actually think RV9 is their new arch.

Sweet spot strategy aims for a ~$300 part and scales up and down accordingly. It has been very successful for 4-series and 5-series. For GPGPU they already have lots of MAD MUL and CRC ram; what they need is not hardware but software to drive market penetration above single figures in the workstation and HPC markets.
 
Does anyone see the R870 migrating to GF before next generation? Seems like putting too many eggs in one basket with new architecture and new foundry to boot. Getting GF ramped up, supply lines afterwards or just building infrastructure for GPU production (which I am assuming has already started).
 
You know, I've never understood this fascination people have with a 'dual core' GPU; the fact that calling such a multi-core'd chip that is a bit of a nonsense is a side issue ;)

Unless people mean they want crossfire on a single die, which would give marginal benifits at most while doing 'intresting' things to the data bus.

I guess the thing people are most drawn to is the idea of 'sharing' the ram, but unless both GPUs are working at precisiely the same time on well defined sets of data this becomes a problem as well; lets say GPU core 1 is rendering to a texture while GPU core 2 is reading from the same texture, unless there is a 2rd buffer in the mix so each core sees its own copy of the data you are going to get the joys of undefined behaviour. A problem made worse by Compute shaders because once bound as a read/write buffer a compute shader can write to anywhere in that chunk of memory; the operation can be syncronised so that other threads can see the change, but if two cpus are working on the same data at once then you have a situation where by one GPU has to signal the other.

Then we get back to the bus again, because two cores both sitting on a 256bus would be laughable even with GDDR5 as you are now demanding twice the data without some of the benifits of cache (unless you introduce a 3rd level cache shared between the gpus, which would help in some cases but not all, so you'd still see a large data increase, all of which increases costs), so you are going to have to bump to 512 anyway (or maybe a half way house of something like 384bit, depending on the cache) which is going to push costs up again.

Maybe there is something I've missed, but the whole idea seems like alot of hastle and cost increase for very little real gain.
 
i was thinking maybe the way intel made core 2 quads, just 2 duals with adjustments, transparent to the OS

or just do crossfire from a different angle (possible surprises coming in a couple months)
 
i was thinking maybe the way intel made core 2 quads, just 2 duals with adjustments, transparent to the OS

or just do crossfire from a different angle (possible surprises coming in a couple months)

Yeah, Intel Core 2 Quad has two dual-core in each die so that is two dies in a single processor.
 
The thing is, the Core 2 Quad, while it did the job, suffered from the problems I outlined above; bus contention and cache sync problems between cores.

Now, on a CPU while a problem it isn't such a big problem as normally you design your algorithms in such a way that you try to avoid needing to sync the cache across cores too often. So core 1 will work on its data set in a different section of memory to core 2 and its not until they are done doing a chunk of work you'd need to sync things up. Even this isn't perfect as the schedular can bounce threads between cores which cause all manner of cache related problems.

I can't recall where the Core 2 Duo was limited, but there is a chance that it wasn't data fetch but through put which was the problem, which is part of why they could get away with sticking them both on what is in effect a slow bus.

The fact was however this wasn't the best way to go about it is shown with Nehalem/Core i7/5/3 where a shared L3 cache was added between all four cores to get around some of these problems. It also sits on faster bus.

This is just a CPU, where the programmer has taken the time to tune his code to work with X number of cores (task based in the current thinking, which fits the model outlined above) and can query how many cores he has and split the work accordingly. In short; while there might be 2 pairs of Core 2 cores in a Core 2 Quad and while this arrange ment doesn't matter 'as such' you still need to be aware of it.

Same thing would applied with trying to just glue two cores together for a GPU; while the system might not care those writing the code certainly do. Querying for SLI and Crossfire to make sure your code does The Right Thing algorithm wise is perfectly sane; somethings just aren't going to work well when being split rendered across two GPUs.

With pixel shaders this wasn't a huge problem, beyond needing to sync the memory of say a render to texture operation, simply because the pixel shader could only write to one location at a time. D3D11 however totally relaxes that and the compute shader, and indeed the pixel shader, can write to any location in specific output buffers as they want AND read those changes back again, changes which need to be seen across various levels of thread granularity. This means an effective sync between both "cores" of the GPU.
(This is still likely to be an 'issue' with Crossfire/SLI'd GPUs, however as long as the app can enforce some sort of control over how the GPU is allocated the problem goes away a bit. I admit, I need to look into this control aspect of things a bit more.)

I still remain unconvinced it would be a good idea; it would just cost more, use more power and create more headaches.
 
was banging my head against the wall trying to optmize an opencl program and i suddenly remembered this thread!

I WANT MORE FRIEGGEN MEMORY BANDWIDTH!!!!! maybe something like a 12Gbps GDDR5 @ 512-bit! would be "sufficient" for RV970, either ATi do something about how to get more efficiency from the "global" memory or we're going to need 3.072TB\s to cope with the RV970 (given that it would probably be 2x RV870) in fact, i say they should not 2x the ALU's and keep the die space for the mem controller! since its about time we get more of that! its like having the fastest jet fighter with no enough runway to take-off!

its quite frustrating, as one cant make use of all that power if he cant feed it! without doing some programming workouts that even aliens would love to license it!

[DSV's brain temps in critical range, need to underclock for a while]
 
Last edited:
You know, I've never understood this fascination people have with a 'dual core' GPU; the fact that calling such a multi-core'd chip that is a bit of a nonsense is a side issue ;)

Unless people mean they want crossfire on a single die, which would give marginal benifits at most while doing 'intresting' things to the data bus.

I guess the thing people are most drawn to is the idea of 'sharing' the ram, but unless both GPUs are working at precisiely the same time on well defined sets of data this becomes a problem as well; lets say GPU core 1 is rendering to a texture while GPU core 2 is reading from the same texture, unless there is a 2rd buffer in the mix so each core sees its own copy of the data you are going to get the joys of undefined behaviour. A problem made worse by Compute shaders because once bound as a read/write buffer a compute shader can write to anywhere in that chunk of memory; the operation can be syncronised so that other threads can see the change, but if two cpus are working on the same data at once then you have a situation where by one GPU has to signal the other.

Then we get back to the bus again, because two cores both sitting on a 256bus would be laughable even with GDDR5 as you are now demanding twice the data without some of the benifits of cache (unless you introduce a 3rd level cache shared between the gpus, which would help in some cases but not all, so you'd still see a large data increase, all of which increases costs), so you are going to have to bump to 512 anyway (or maybe a half way house of something like 384bit, depending on the cache) which is going to push costs up again.

Maybe there is something I've missed, but the whole idea seems like alot of hastle and cost increase for very little real gain.

i guess the "dual-core" rumor came from a misunderstood pic of the RV870 having "dual banks, RV770 each". think about it, u see two identical "banks" each looks so damn close to a single RV770, what would you think?

on another note, i guess we need some kind of SuperHyperTransport kinda thing now, since i guess crossfire over PCI-E is not sufficient for effective multi-gpu platform (exactly the Core2 Quad case you stated), without taking spcial care of that issue (some sort of "macro" strip mining...:nuts:) and with GPU's getting inside the CPU's i guess something is already made about that.

EDIT: was just reading a white paper:
Although there is a memory crossbar on the GPU chips, it cannot be used for communication among cores
how about this?
 
Last edited:
Back
Top