Performance Evaluation 88–89 (2015) 18–36
Contents lists available at ScienceDirect
Performance Evaluation journal homepage: www.elsevier.com/locate/peva
Characterizing communication and page usage of parallel applications for thread and data mapping
Matthias Diener a,∗, Eduardo H.M. Cruz a, Laércio L. Pilla c, Fabrice Dupros b,
Philippe O.A. Navaux a a Informatics Institute, Federal University of Rio Grande do Sul, Porto Alegre, Brazil b BRGM, Orléans, France c Department of Informatics and Statistics, Federal University of Santa Catarina, Florianópolis, Brazil a r t i c l e i n f o
Received 23 June 2014
Received in revised form 5 February 2015
Accepted 12 March 2015
Available online 20 March 2015
NUMA a b s t r a c t
The parallelism in shared-memory systems has increased significantly with the advent and evolution of multicore processors. Current systems include several multicore and multithreaded processors with Non-Uniform Memory Access (NUMA) characteristics.
These architectures require the adoption of two strategies for the efficient execution of parallel applications: (i) threads sharing data should be placed in such a way in the memory hierarchy that they execute on shared caches; and (ii) a thread should have the data that it accesses placed on the NUMA node where it is executing. We refer to these techniques as thread and data mapping, respectively. Both strategies require knowledge of the application’s memory access behavior to identify the communication between threads and processes as well as their usage of memory pages.
In this paper, we introduce a profiling method to establish the suitability of parallel applications for improved mappings that take the memory hierarchy into account, based on amathematical description of their memory access behaviors. Experiments with a large set of parallelworkloads that are based on a variety of parallel APIs (MPI, OpenMP, Pthreads, and MPI+OpenMP) show that most applications can benefit from improved mappings.
We provide a mechanism to compute optimized thread and data mappings. Experimental results with this mechanism showed performance improvements of up to 54% (20% on average), as well as reductions of the energy consumption of up to 37% (11% on average), compared to the default mapping by the operating system. Furthermore, our results show that thread and data mapping have to be performed jointly in order to achieve optimal improvements. © 2015 Elsevier B.V. All rights reserved. 1. Introduction
Since reaching the limits of Instruction Level Parallelism (ILP), Thread Level Parallelism (TLP) has become important to continue increasing the performance of shared-memory computer systems. Increases in the TLP are accompanied by more complex memory hierarchies, consisting of several private and shared cache levels, as well as multiple memory controllers that introduce Non-Uniform Memory Access (NUMA) characteristics. As a result, the performance of memory accesses depends on the location of the data [1,2]. Accesses to data that is located on local caches and NUMA nodes have a higher ∗ Corresponding author.
E-mail address:firstname.lastname@example.org (M. Diener). http://dx.doi.org/10.1016/j.peva.2015.03.001 0166-5316/© 2015 Elsevier B.V. All rights reserved.
M. Diener et al. / Performance Evaluation 88–89 (2015) 18–36 19 bandwidth and lower latency than accesses to remote caches or nodes . Improving the locality of memory accesses is therefore an important way to achieve optimal performance in modern architectures .
For parallel applications, locality can be improved in two ways. First, by executing threads that access shared data close to each other in the memory hierarchy, they can benefit from shared caches and faster intra-chip interconnections [5,6].
We refer to accesses to shared data as communication in this paper, and call an optimized mapping of threads to processing units that takes communication into account a communication-aware thread mapping. Most parallel programming APIs for shared memory, such as OpenMP and Pthreads, directly use memory accesses to communicate. Even implementations of the Message Passing Interface (MPI), which uses explicit functions to communicate, contain optimizations to communicate via shared memory, such as Nemesis  for MPICH2  and KNEM  for Open MPI . Second, the memory pages that a thread accesses should be placed on NUMA nodes close to where it is executing to reduce the inter-node traffic, as well as to increase the performance of accesses to the main memory . We call this technique data mapping.
The goal of this paper is to characterize thememory access behavior of parallel applications to determine their suitability for mapping and evaluate their performance improvements using mappings that optimize locality. To characterize the communication, we introduce metrics that describe the spatial, temporal and volume properties of memory accesses to shared memory areas. In contrast to previous work that uses a logical definition of communication [12,13], we use a broader definition that focuses on the architectural impact of these accesses. We characterize the memory page usage of the applications by analyzing the distribution of accesses from different NUMA nodes during the execution. The characterizations are then used to perform an optimized thread and data mapping. Related work in this area mostly treats thread and data mapping as separate problems and only handles one of them [14,15]. We make the case that mapping has to be performed in an integrated way to achieve maximum benefits.
The main contributions of this paper are: • We introducemetrics and amethodology to evaluate the communication and page usage of parallel applications running on shared memory architectures and use it to analyze their potential for thread and data mapping. • We present a mechanism to employ this information and calculate thread and data mappings that optimize memory access locality. • We characterize a large set of parallel applications and evaluate their performance and energy consumption improvements using the optimized mappings.