Shared-cache on Multi-core systems

Student: Jason Cade, Suman Vara, Pawan Gogad

The emergence of multi-core systems opens new opportunities for thread-level parallelism and dramatically increases the performance potential of applications running on these systems. However, the state of the art in performance enhancing software is far from adequate in regards to the exploitation of hardware features on this complex new architecture. As a result, much of the performance capabilities of multi-core systems are yet to be realized. Our research addresses one facet of this problem by exploring the relationship between data-locality and parallelism in the context of multi-core architectures where one or more levels of cache are shared among the different cores. We are developing a compiler model for determining a profitable synchronization interval for concurrent threads that interact in a producer-consumer fashion.

Autotuning for Scientific Applications

Student : Santosh Sarangkar, Collaboration: Qing Yi, John Mellor-Crummey

Over the last several decades we have witnessed tremendous change in the landscape of computer architecture. New architectures have emerged at a rapid pace with computing capabilities that have often exceeded our expectations. However, the rapid rate of architectural innovations has also been a source of major concern for the high-performance computing community. Each new architecture or even a new model of a given architecture has brought with it new features that have added to the complexity of the target platform. As a result, it has become increasingly difficult to exploit the full potential of modern architectures for complex scientific applications. The gap between the theoretical peak and the actual achievable performance has increased with every step of architectural innovation. As multi-core platforms become more pervasive, this performance gap is likely to increase. To deal with the changing nature of computer architecture and its ever increasing complexity, application developers laboriously re-target code, by hand, which often costs many person-months even for a single application. To address this problem, we are developing a software-based strategy that can automatically tune applications to different architectures to deliver portable high-performance. Our autotuning strategy combines architecture-aware cost models with heuristic search to find the most suitable optimization parameters for the target platform. The key contribution of this work is a novel strategy for pruning the search space of transformation parameters. By focusing on architecture-dependent model parameters instead of transformation parameters themselves, we show that we can dramatically reduce the size of the search space and yet still achieve most of the benefits of the best tuning possible with exhaustive search.

Compiler-driven Superpage Allocation

Student: Josh Magee

Most modern microprocessor-based systems provide support for superpages both at the hardware and software level. Judicious use of superpages can significantly cut down the number of TLB misses and improve overall system performance. However, indiscriminate superpage allocation results in page fragmentation and increased application footprint, which often outweigh the benefits of reduced TLB misses. Previous research has explored policies for smart allocation of superpages from an operating systems perspective. Our work proposes a compiler-based strategy for automatic and profitable memory allocation via superpages. A significant advantage of a compiler-based approach is the availability of data-reuse information within an application. Our strategy employs data-locality analysis to estimate the TLB demands of a program and uses this metric to determine if the program will benefit from superpage allocation. Apart from its obvious utility in improving TLB performance, this strategy can be used to improve the effectiveness of certain data-layout transformations and can be a useful tool in benchmarking and empirical tuning.

OR ALGORITHMS ON CMPs

Student: Hammad Rashid, Collaboration: Clara Novoa

News and Activity

03/2009

Paper on exploiting shared-caches on CMPs accepted at HPCC09

02/2009

Superpage optimization paper accepted at ACMSE09

12/2008

Josh Magee defends his thesis, Automated Compiler Driven Superpage Allocation and its Applications

12/2008

Jason Cade defends his thesis, Balancing Data Locality and Parallelism for Improved Application Performance on Multi-core Platforms

12/2008

REP funds proposal on mapping OR algorithms onto multi-core systems

11/2008

Paper on model-guided tuning accepted by Journal of High-Performance Systems Architecture

09/2008

Dr. Qasem Awarded IBM Faculty Award for work on locality optimizations on shared-cache CMPs

Funding

The HPC research group has received funding from IBM, Rice University and the research enhancement program at Tx State.

Contact

Apan Qasem
Department of Computer Science
Texas State University
601 University Dr
San Marcos, TX 78666

Office: Nueces 218
Phone: (512) 245-0347
Fax: (512) 245-8750
E-mail: apan "AT" txstate · edu