Category Archives: High Performance Computing

LPGPU2 Featured in HiPEAC Newsletter

The LPGPU2 project was described in an article in the HiPEAC newsletter (edition #49) and is now available online at: The printed publication was handed out to the 500+ attendees of HiPEAC 2017 and also distributed electronically to the 1700+ members of the HiPEAC network.


TUB Paper Accepted at IPDPS 2017

The paper “E²MC: Entropy Encoding based Memory Compression for GPUs” by Sohan Lal, Jan  Lucas and Ben Juurlink has been accepted for publication at 31st IEEE International Parallel & Distributed Processing Symposium (IPDPS) to be held in Orlando, Florida USA from May 29 – June 2, 2017. This paper proposes an entropy encoding based memory compression technique for GPUs. The proposed compression technique addressed the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. It achieves higher compression ratio and performance gain compared to state of the art.

Figure: Speedup with increased memory bandwidth

Read more »

TU Berlin Paper to appear at MTAGS13 workshop, Co-located with SC 2013

The paper “FPGA-Based Prototype of Nexus++ Task Manager”, by Tamer Dallou, Ahmed Elhossini and Ben Juurlink, is accepted to appear at the 6th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers, which is Co-located with Supercomputing/SC 2013, on November 17th, 2013, Denver, Colorado, USA.

The Nexus++ task manager is designed for task-based programming Nexus++_HL2models. Furthermore, it will be ported to GPGPUSim as an extension to add dependency-awareness to GPUs, at block level granularity.

Abstract: StarSs is one of several programming models that try to relieve parallel programming. In StarSs, the programmer has to identify pieces of code that can be executed as tasks, as well as their inputs and outputs. Thereafter, the runtime system (RTS) determines the dependencies between tasks and schedules ready tasks onto worker cores. Previous work has shown, however, that the StarSs RTS may constitute a bottleneck that limits the scalability of the system and proposed a hardware task management system called Nexus++ to eliminate this bottleneck. The first prototype of Nexus++ was implemented in SystemC. Its architecture also had a nondeterministic multi-cycle search algorithm in its critical path, potentially limiting its scalability. In this paper, we improved the architecture of Nexus++ and employed a multi-way set-associative cache-like data structures to optimize its search algorithm and increase task throughput. We also modeled the new architecture in VHDL and targeted a Virtex~5 FPGA from Xilinx. Experimental results show that the new architecture is very resource-efficient utilizing only 19% of the target FPGA. It also shows that Nexus++ achieves a speedup of up to 81x using some synthetic benchmarks modeled after H.264 decoding. Hence, Nexus++ significantly enhances the scalability of applications parallelized using StarSs.


Uppsala paper at ICS 2013

Towards more efficient execution: a decoupled access-execute approach.

ICS ’13 Proceedings of the 27th international ACM conference on International conference on supercomputing

Abstract: The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the energy savings of this technique. This paper evaluates how much we can increase the effectiveness of DVFS by using a software decoupled access-execute approach. Decoupling the data access from execution allows us to apply optimal voltage-frequency selection for each phase and therefore improve energy efficiency over standard coupled execution.

The underlying insight of our work is that by decoupling access and execute we can take advantage of the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency, while maintaining good performance. To demonstrate this we built a task based parallel execution infrastructure consisting of: (1) a runtime system to orchestrate the execution, (2) power models to predict optimal voltage-frequency selection at runtime, (3) a modeling infrastructure based on hardware measurements to simulate zero-latency, per-core DVFS, and (4) a hardware measurement infrastructure to verify our model’s accuracy.

Based on real hardware measurements we project that the combination of decoupled access-execute and DVFS has the potential to improve EDP by 25% without hurting performance. On memory-bound applications we significantly improve performance due to increased MLP in the access phase and ILP in the execute phase. Furthermore we demonstrate that our method can achieve high performance both in presence or absence of a hardware prefetcher.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.