… that common GPGPU optimization tips such as breaking down you problem into as many different threads as possible, avoiding branches or using local memory to cache reused data are not always useful for improving performance and increasing energy efficiency? Often computing more than one output per workitem can improve performance. Parts of your calculation such as index calculations can often be reused, if multiple outputs are calculated per workitem. Having less threads active allows the GPU to allocate more registers per thread. Multiple outputs per thread can also increase instruction level parallelism. Vasily Volkov explains this with more detail in his talk: “Better Performance at Lower Occupancy“. People are also often told to avoid branches in GPGPU code, however branches are often not as expensive as it seems. Divergent branches do not reduce the performance, if you are limited by memory bandwidth. Many mobile CPU also do not use wide SIMD. ARM recommends multiples of just 4 threads per workgroup in their Mail GPUs. Optimal workgroup size on Qualcomm Adreno GPUs depends on GPU series and OpenCL driver. Local memory can be very limited or even emulated via DRAM access, so on some platforms local memory use can hurt performance. To pinpoint these performance and often also energy efficiency issues, using a profiler can be very useful. LPGPU2 CodeXL can be used for profiling on many different platforms.
Comments are closed.