Author Archives: Uppsala

Uppsala paper accepted in HPCA 21, 2015


Hierarchical Private/Shared Classification: the Key to Simple and Efficient Coherence for Clustered Cache Hierarchies

Alberto Ros and Stefanos Kaxiras


Hierarchical clustered cache designs are becoming an appealing alternative for multicores. Grouping cores and their caches in clusters reduces network congestion by localizing traffic among several hierarchical levels, potentially enabling much higher scalability. While such architectures can be formed recursively by replicating a base design pattern, keep- ing the whole hierarchy coherent requires more effort and consideration. The reason is that, in hierarchical coherence, even basic operations must be recursive. As a consequence, intermediate-level caches behave both as directories and as leaf caches. This leads to an explosion of states, protocol-races, and protocol complexity. While there have been previ- ous efforts to extend directory-based coherence to hierarchical designs their increased complexity and verification cost is a serious impediment to their adoption.

We aim to address these concerns by encapsulating all hierarchical complexity in a simple function: that of determining when a data block is shared entirely within a cluster (sub-tree of the hierarchy) and is private from the outside. This allows us to eliminate complex recursive operations that span the hierarchy and instead employ simple coherence mechanisms such as self-invalidation and write-through—now restricted to operate within the cluster where a data block is shared.

We examine two inclusivity options and discuss the relation of our approach to the recently proposed Hierarchical-Race-Free (HRF) memory models. Finally, comparisons to both a hierarchical directory-based MOESI and TokenCMP protocols show that, despite its simplicity our approach results in competitive performance and significantly decreased network traffic.


Uppsala paper to appear at CGO 2014

Title: Fix the code. Don’t tweak the hardware: A new compiler approach to Voltage-Frequency scaling

About the conference:

2014 International symposium on code generation and optimization provides a premier venue to bring together researchers and practitioners working at the interface of hardware and software on a wide range of optimization and code generation techniques and related issues. The conferences spans the spectrum from purely static to fully dynamic approaches, including techniques ranging from pure software-based methods to architectural features and support.

Abstract: Traditional compiler approaches to optimize power efficiency aim to adjust voltage and frequency at runtime to match the code characteristics to the hardware (e.g., memory-bound vs. compute-bound to low frequency and high frequency). However, such approaches are constrained by three factors: i) voltage-frequency transitions are too slow to apply at a very fine scale, ii) larger code regions are seldom unequivocally memory- or compute-bound, and, iii) the usable voltage range for future technologies is rapidly shrinking. These factors necessitate new approaches to address power-efficiency at the code-generation level. This pa per proposes one such approach to automatically generate power-efficient code for a decoupled access/execute model in which a program is separated into coarse-grained phases
focused on data prefetch (access) and computation (execute). This generates sufficiently large regions of distinctly memory- and compute-bound code to enable effective Dynamic Voltage Frequency Scaling (DVFS). ….

Uppsala paper at ICS 2013

Towards more efficient execution: a decoupled access-execute approach.

ICS ’13 Proceedings of the 27th international ACM conference on International conference on supercomputing

Abstract: The end of Dennard scaling is expected to shrink the range of DVFS in future nodes, limiting the energy savings of this technique. This paper evaluates how much we can increase the effectiveness of DVFS by using a software decoupled access-execute approach. Decoupling the data access from execution allows us to apply optimal voltage-frequency selection for each phase and therefore improve energy efficiency over standard coupled execution.

The underlying insight of our work is that by decoupling access and execute we can take advantage of the memory-bound nature of the access phase and the compute-bound nature of the execute phase to optimize power efficiency, while maintaining good performance. To demonstrate this we built a task based parallel execution infrastructure consisting of: (1) a runtime system to orchestrate the execution, (2) power models to predict optimal voltage-frequency selection at runtime, (3) a modeling infrastructure based on hardware measurements to simulate zero-latency, per-core DVFS, and (4) a hardware measurement infrastructure to verify our model’s accuracy.

Based on real hardware measurements we project that the combination of decoupled access-execute and DVFS has the potential to improve EDP by 25% without hurting performance. On memory-bound applications we significantly improve performance due to increased MLP in the access phase and ILP in the execute phase. Furthermore we demonstrate that our method can achieve high performance both in presence or absence of a hardware prefetcher.

Uppsala paper to appear in ISCA 2013

A New Perspective for Efficient Virtual-Cache Coherence. By Stefanos Kaxiras and Alberto Ros.

Accepted to appear in the 40th International Symposium on Computer Architecture (ISCA), 2013.

This paper proposes simple and efficient Coherent shared virtual memory (cSVM) for heterogeneous architectures (CPU + GPU) based on the VIPS coherence technology developed in Uppsala. Coherent shared virtual memory (cSVM) is highly coveted for heterogeneous architectures as it will simplify programming across different cores and manycore accelerators. In this context, virtual L1 caches can be used to great advantage, e.g., saving energy consumption, by eliminating address translation for hits. Unfortunately, multicore virtual-cache coherence is complex and costly because it requires reverse translation for any coherence request directed towards a virtual L1. The reason is the ambiguity of the virtual address due to the possibility of synonyms. In this paper, we take a radically different approach than all prior work which is focused on reverse translation. This results in a new solution for virtual-cache coherence, significantly less complex and more efficient than prior proposals. Significant area, energy, and performance benefits (43.4%, 19.5%, and 5.4%, respectively) ensue as a result of simplifying the entire multicore memory organization, making this the cutting-edge approach for virtualizing the GPU cores.

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.