Nvidia clients/cores use CUDA, which is Nvidia's native tongue so to speak. It's a lot more mature (meaning they've had more time to work on it) so it requires very little CPU time. Nvidia people can run full SMP with little downside.
AMD uses openCL (they use to use something called brook but dropped it for openCL). So the AMD core isn't as mature as Nvidia's hence the performance difference. Stanford is working on it through, and it got much better with v7. We can only go up from here!
Void4ever