Eric Demers and GCN, Part II

Company: AMD
Author: James Prior
Date: December 22nd, 2011

GDDR5 Memory

R3D: The grey screen of death problem which appeared to be related to memory timing and link retraining, there was an investigation and a driver fix and some new BIOS by certain partners, was there anything in there you learned for the chip design, that came into the new memory controller?

Tom: Memory clock switch, training G5 is tricky. It's a per lane and the real time to do that training is challenging. We've made improvements to cut that time down which should improve our memory state training.

Eric: I know we had some p-state issues which I think were fixed with our FIs; our sequencers for the memory are programmable with firmware associated with them, we made some changes there. This is a whole new design, [laughing] so it's a whole new set of problems is one way to look at it! Fundamentally, to Tom's point, we've made it a lot faster and a lot more stable and it's being leveraged in other parts of AMD's products. I don't expect the same problems to occur again. I don't remember hearing 'Grey Screen of Death' - catchy! It's a completely different design so I assume it wouldn't apply to these products.

R3D: You mentioned in your presentation -

Eric: I lied! :D

R3D: - that ECC support will be available on compute products, does that mean it's only going to be turned on for the FirePro/FireStream products?

Eric: At this point, yes. We thought about 'could it be used to improve yield on consumer products' and things like that and we may decide to that kind of thing, well we reserve the right do anything we want I guess! Right now those kinds of features would actually hurt performance for consumers because they do take away from memory storage (maybe not the internal, but the external DRAM) and it would certainly make the drivers more complex. The FirePro driver team are doing that because some of their customers desire that, I wouldn't say it’s a requirement but Oil & Gas, Medical, they need it for liability reasons; and the whole server play, these guys need it. For now that's our plan, not to enable it for consumer. It's not necessary to destroy it or burn a fuse or anything, but just not to enable it for consumer.

Tom: The basic consumer analysis is the reliability and the susceptibility to random bitflipping and all that, it's still very acceptable without using ECC for normal applications.

Eric: The bitflip rate for a single part is still measured in hundreds of thousands of hours if not millions of hours, so you really need tens of thousands of these cards to start making a difference. Individual user will not see it, it's something that's once or twice in a lifetime ... well, maybe in you live in Denver ...

Tom: It's some markets, they don't want to correct your data they just want it [ECC as a feature], it's a very important checkmark.

Eric: It's for reliability, it's for psychological.

R3D: What about the CRC checking, that Cypress had?

Eric: Yes, that's still part of the GDDR5 protocol. There's a built-in (at the interface level) two dimensional parity check. There's one per write and there is an accumulation over eight writes, that's still part of G5 protocol. Somebody had done a review where they keep increasing mClk and they see the performance keep on going up until they go past a certain level and the ECC kicks in and begins reducing your performance. We're below that level, and that's all still there. The interfaces themselves are all protected by ECC, this is actually memory ECC on top of that, that's stored to deal with bitflip from alpha and neutron particles that can happen.

Tom: That is nice about G5, if it fails - if you push it too hard; it fails gracefully. It retries more and you get less performance, it doesn't just start corrupting everything and make the screen go to hell.