While trying to improve the performance of my collision detection class, I found that ~80% of the time spent at the gpu, it spent on if/else conditions just trying to figure out the bounds for the buckets it should loop through.
More precisely:
each thread gets an ID, by that ID it fetches its triangle from the memory (3 integers each) and by those 3 it fetches its vertices(3 floats each).
Then it transforms the vertices into integer grid points (currently 8x8x8) and transforms them into the triangle bounds on that grid
To transform the 3 points into bounds, it finds the min/max of each dimension among each of the points
Since the programming language I am using is missing a minmax intrinsic, I made one myself, looks like this:
procedure MinMax(a, b, c): local min, max if a > b: max = a min = b else: max = b min = a if c > max: max = c else: if c < min: min = c return (min, max)
So on the average it should be 2.5 * 3 *3 = 22.5 comparisons which ends up eating up way more time than the actual triangle - edge intersection tests (around 100 * 11-50 instructions).
In fact, I found that pre-calculating the required buckets on the cpu (single threaded, no vectorization), stacking them in a gpu view along with bucket definition and making the gpu do ~4 extra reads per thread was 6 times faster than trying to figure out the bounds on the spot. (note that they get recalculated before every execution since I'm dealing with dynamic meshes)
So why is the comparison so horrendously slow on a gpu?
Asked By : user29075
Answered By : Wandering Logic
GPUs are SIMD architectures. In SIMD architectures every instruction needs to be executed for every element that you process. (There's an exception to this rule, but it rarely helps).
So in your MinMax
routine not only does every call need to fetch all three branch instructions, (even if on average only 2.5 are evaluated), but every assignment statement takes up a cycle as well (even if it doesn't actually get "executed").
This problem is sometimes called thread divergence. If your machine has something like 32 SIMD execution lanes, it will still have only a single fetch unit. (Here the term "thread" basically means "SIMD execution lane".) So internally each SIMD execution lane has a "I'm enabled/disabled" bit, and the branches actually just manipulate that bit. (The exception is that at the point where every SIMD lane becomes disabled, the fetch unit will generally jump directly to the "else" clause.)
So in your code, every SIMD execution lane is doing:
compare (a > b) assign (max = a if a>b) assign (min = b if a>b) assign (max = b if not(a>b)) assign (min = a if not(a>b)) compare (c > max) assign (max = c if c>max) compare (c < min if not(c>max)) assign (min = c if not(c>max) and c<min)
It may be the case that on some GPUs this conversion of conditionals to predication is slower if the GPU is doing it itself. As pointed out by @PaulA.Clayton, if your programming language and architecture has a predicated conditional move operation (especially one of the form if (c) x = y else x = z
) you might be able to do better. (But probably not much better).
Also, placing the c < min
conditional inside the else
of c > max
is unnecessary. It certainly isn't saving you anything, and (given that the GPU has to automatically convert it to predication) may actually be hurting to have it nested in two different conditionals.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/39871
0 comments:
Post a Comment
Let us know your responses and feedback