World's most popular travel blog for travel bloggers.

How to "convert" nested loops into code taking advantage of parallel computing?

, , No Comments
Problem Detail: 

Let's say that an algorithm uses several nested loops to accomplish it's task, for instance :

  • An algorithm treating voxels on several frames, so there would be four dimensions : t (for time), x, y and z.
  • An algorithm evaluating a function that takes n entries, searching for which entries there is a certain result. For instance if we search f(a, b, c, d, e) = 2001 where each variable is tested between 0 and 100.

From what I understand, several APIs make you model the way your kernel is executed by the GPU with a 1-2-3d grid, but many of them don't provide n-dimensional grids for execution.

So if I had to treat a simple 2D image of 1024x2048 pixels, I could have simply made my kernel execute itself over a grid of that size instead of having to nest loops, but since most grids seem to be limited to 3 dimensions, what to do when I need to execute work for more than 3 dimensions (like in my second example) ?
Should I put loops directly in my kernels ?

Thank you.

Some pseudocode : Let's say that I want to test for which inputs a given function has a certain value, I could have some brute force approach like that :

for a in 1...100 { for b in 1...100 { for c in 1...100 { for d in 1...100 {      if f(a,b,c,d) == 1984 {         equationSolutions.append((a,b,c,d))      } } 

Now with grids, a certain kernel can be executed over the grid (sorry, I may not use the right term, there is a much better explaination about grids here), so if you want to treat each pixel of an image of 2048x512 pixels, instead of doing :

for x in 0...imageWidth { for y in 0...imageHeight {    treatPixel(x,y) } 

You will define grid as a 2D grid of size 2048x512 and then your kernel won't contain any loop, it will just get it's position on the grid as an argument : func myPixelFunction(positionInGrid) { treatPixel(positionInGrid.x, positionInGrid.y) }

My question is that, since most grids are limited to 3 dimensions, what to do if you want to execute code that requires more than 3 dimensions ? (like my code trying to solve a function ?).

From what I understand, in cases like the ones I showed, grids make more sense than loops on GPUs since grids enable to parallelise the work, making several threads work in parallel on the same function/kernel, but while it's really easy to simply use position in grid when you are treating data that is representable by the grid, I don't understand how to write code using grids when I needs to word on "additional dimensions".

EDIT #2 : This SE question is similar to mine.

Asked By : Trevör Anne Denise
Answered By : D.W.

Let's say you want to iterate over a,b,c,d each ranging from 0..99. There are two ways to do that.

Approach #1: fewer nested loops

Set up a 100x100 grid, which will handle a,b. Handle c,d by explicit nested loops:

treatPixel(a,b): for c in 0..99:     for d in 0..99:         doSomething(a,b,c,d) 

Approach #2: bigger grid

Set up a 10,000x10,000 grid. Then decode the x-coordinate to a,b and the y-coordinate to c,d, like this:

treatPixel(x,y): a := floor(x/100); b := x mod 100 c := floor(y/100); d := y mod 100 doSomething(a,b,c,d) 

Which to choose?

Either one should work.

Assuming the number of iterations greatly exceeds the number of compute elements in the GPU, the difference between the two probably comes down to different memory locality effects. You can try implementing both of them, benchmarking, and seeing if one is faster. If there is little or no memory access and doSomething() is compute-bound, they'll probably both run at about the same speed.

Best Answer from StackOverflow

Question Source :

3200 people like this

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback