This article first appeared in Computer magazine and is brought to you by InfoQ & IEEE Computer Society.
The complexity of large-scale parallel systems necessitates the simultaneous optimization of multiple hardware and software components to meet performance, efficiency, and fault-tolerance goals. A codesign methodology using modeling can benefit systems on the path to exascale computing.
Given the complexity of current systems in terms of scale, memory hierarchy, and interconnection network topology, optimizing for performance alone is a monumental task. Systems with hundreds of thousands of processor cores are now commonplace, memory hierarchies routinely consist of three levels of cache, and interconnection networks with high-dimensional meshes, fat trees, and hierarchical fully connected topologies will dominate in the near term.
Optimization to an architecture is just one of many phases in an application’s life span. However, this common process is considered to be “one way” in that it works for already designed and implemented architectures. Future exascale systems and applications will have additional performance, power, and resiliency requirements that represent a multidimensional optimization challenge. A codesign process can optimize two or more factors in concert to achieve a better solution, ultimately leading to highly tuned exascale systems and workloads.
The Codesign Process
As Figure 1 shows, five factors in the codesign process contribute to the complexity of extreme-scale systems:
- Multiple algorithms used for a calculation can exhibit different computational characteristics. For example, using a uniform resolution of a data grid could lead to an implementation with memory and communication characteristics requiring computation exceeding that of a more complex adaptive mesh refinement implementation.
- The application represents the implementation of a particular method and comprises a component of the overall workload of interest. Using several applications in concert can explore multiple aspects of a physical system simultaneously, such as climate simulations that consider land, sea, and atmospheric components together.
- The programming model underlies the application and defines the way it expresses computation. The two commonly used approaches for expressing parallelism are process-centric, in which interprocess communication is expressed explicitly—for example, the message-passing interface (MPI)—or data-centric, in which access to any data across the system can occur from any location—for example, Global Arrays, Unified Parallel C (UPC), and Co-Array Fortran (CAF).
- The runtime system is responsible for ensuring that application requirements are dynamically satisfied and mapped onto system resources. It includes process and data management and migration.
- The architecture includes the processor core’s microarchitecture, the arrangement of cores within a chip, memory hierarchy, system interconnect, and storage subsystems.
(Click on the image to enlarge it)
No codesign process to date has covered all these factors comprehensively, but some notable cases have addressed a subset of them and their corresponding tradeoffs:
- Optimization of an application to an architecture. For an already implemented system architecture, this process requires mapping application workloads to architecture characteristics. It is commonplace in application development and software engineering but is not considered codesign.
- Optimization of an architecture to an application. Given an already implemented application, this process optimizes architecture design to achieve high performance. An example is the performance-guided design of large-scale IBM Power7 systems;[1] again, this is not considered codesign.
- Codesign for performance. Enabling application and architecture to best match each other unlocks the potential to achieve the highest performance in a new system. An example is the design of an application with the first petaflops system—the IBM Roadrunner.[2]
- Codesign for energy efficiency. Energy consumption by extreme-scale systems will increasingly become a design constraint and a notable cost factor. The largest systems today consume more than 10 megawatts, incurring an operational cost of roughly US$10 million per year. Our experiences in codesign for energy involved an application built to provide information about expected periods of idleness and a runtime aimed at lower power consumption.[3]
- Codesign for fault tolerance. A critical factor in extreme-scale system operation is fault tolerance. Traditional methods that use checkpoint-restart mechanisms, which store partial results to disk and can be pulled back if a fault occurs, might not scale well as system sizes increase. However, the use of selective methods, such as replicating just the critical data across system memory, can help reconstruct state from failed nodes and enable job execution to continue. We use codesign involving the application, programming model, and runtime system to improve resiliency.
Several of our own experiences with codesign methodologies have resulted in improved performance, power efficiency, and reliability.
Performance
We draw on our past experience to design and deploy the IBM Roadrunner, the first petaflops system, which was also the first extreme-scale hybrid architecture design.[4] From the outset, we envisioned Roadrunner as an accelerated system using conventional compute nodes, each one hosting accelerators capable of high processing rates. Three years prior to Roadrunner’s deployment, we explored three different classes of accelerators: Clearspeed multi-SIMD processors, GPUs, and the IBM Cell Broadband-Engine.[5] Although we ultimately used the IBM Cell, it was not a foregone conclusion.
The use of quantitative and accurate performance modeling of full applications facilitated Roadrunner’s codesign.[6] Each model encapsulated the first-order perOptimize formance characteristics of an application and was parameterized in both system and application factors. These factors form the variables that allow exploring the application and system design space simultaneously. Typical model parameters include the application’s processing requirements (computation, memory, and communication) and hardware resource capabilities such as processing rate, data movement to memory, interconnection of accelerator to host, and interconnections between nodes.
Many model inputs use empirical data when system components are available for measurement or simulated results for codesign in future systems. Because they are analytically based, models can readily explore the design space, and they can be used for full applications with low overhead at runtime.
The key aspect in the Roadrunner design was to determine how useful, if at all, accelerators would be on the workload of interest; we did not want to consider just the peak speeds and feeds or simulate kernels that might not be correlated to actual application performance.
The key architectural characteristics include the number of cores available on each accelerator, the capabilities of each core, and the high communication costs incurred when data moves between host and accelerator memories. Figure 2 shows an overview of the architecture. The increased cost of communication becomes clearer, given the transfer path from one accelerator memory to another on a distance node that requires three steps: transfer from accelerator to host, host to host, and host to accelerator. Many current large-scale accelerated systems correspond to the architecture shown in Figure 2, but Roadrunner is the only system with two accelerators, each containing two processors in each compute node.
We used a particular application of interest—deterministic transport, which uses wavefront algorithms—in the codesign.[7] In this application, the processing of a data block can only happen after a processor core receives data from upstream neighbors. Figure 3a shows an overview of this for a small, 16-processor-core example arranged in a logical 4 × 4 two-dimensional array. The colors denote that processors on the same diagonal are handling the same block of their local data, and arrows denote the dependencies (communications) between processors. Each step in the operation involves the processing of a data block and the communication of data to downstream neighbors. In Roadrunner, each accelerator core is considered to be a member of the logical two-dimensional processor array.
To design a wavefront application for an accelerated system, we had to significantly alter the communication structure so that it could overcome the high costs of data transfer between accelerators. We reduced the number of messages between accelerators by combining the required communications within a processor-core domain that occur once for several steps, as Figure 3b illustrates. This also resulted in extra computation steps, which dwarfed the communication savings at a small scale.[8]
As Table 1 shows, running the wavefront application on Roadrunner resulted in performance improvements at larger scales.
(Click on the table to enlarge it)
Although the codesign process determined both the best accelerator configuration and the best implementation of the application for Roadrunner, a further complementary activity happened in the development of the reverse acceleration model[9], also guided by performance modeling. Using this programming model with a small runtime system, we programmed each accelerator core directly with a subset of the MPI. Thus, each accelerator core became an MPI rank, and activities that the accelerator could not handle, such as I/O, were offloaded to the host. The design and implementation of the reverse acceleration model was not part of the Roadrunner codesign process, but its capabilities were assumed to be available during the design exploration.
Power
Current techniques to reduce energy consumption include reducing the power state and throttling down system components such as processor cores during idle periods in the processing flow. The power state is a function of the frequency at which the component operates and is often coupled with the supply voltage. Runtime systems can use dynamic voltage and frequency scaling (DVFS) to alter the power state dynamically but at a cost of thousands of processor-clock cycles.
A key issue is how and where to take advantage of these power-saving features. Researchers have proposed several methods, but their effectiveness depends on the parallel structure of applications—for example, application-transparent techniques for saving energy have successfully exploited the idle periods prior to global operations.[10] Figures 4b and 4c show such idleness, which is caused by placing different computational requirements on individual processors.
In certain applications, idle periods are not associated with load imbalance—for example, the parallel activity of deterministic transport in Figure 4d has a defined pattern in which idleness occurs at different times and for different durations due to the wait for incoming data from other processors. This is not due to any inefficiency in the application but is algorithmic, caused by data dependencies that must be satisfied before computation can proceed. Runtime systems have particular difficulty in automatically identifying this activity as it is not associated with global synchronizations; this is especially true when the behavior is time varying. Application-specific information can identify a priori when processor cores will wait for incoming data and thus can be placed in a low-power state to save energy.
The recently proposed energy template approach[11] is a codesign between the application that quantifies expected periods of idleness on a per-core basis and the runtime system that affects changes in the processor power state. Central to an energy template is an analytical model that guides the runtime system in making informed decisions about when to idle a processor core. The energy template principles are to
- represent an application-specific sequence of active and idle states for each processor core,
- contain the rules associated with the transition from one state to another using a model,
- use triggers that enable state transitions by transparently monitoring application activity, and
- enable the runtime system to make informed decisions about when to alter an individual processor core’s power state.
NW-ICE, a power-instrumented testbed at the Pacific Northwest National Laboratory, illustrates the value of the energy template approach when using a wavefront application. As Figure 4d shows, idleness naturally occurs in this application because processor cores need to wait for data from upstream neighbors. Figure 5 shows the execution time, power consumption, and energy used in the wavefront application on a single rack of NW-ICE containing 28 nodes, each with two sockets of the quad-core Intel Harpertown processor. Only two power states for each core are available on NW-ICE, an idle and an active state, with a differential of 11 watts per core.
At most, we find a 4 percent difference in the time the wavefront application takes running with and without the energy template. This variation is small in performance analysis terms and occurs due to the extra overhead induced by the energy template and runtime. On the other hand, the magnitude of a power savings of 8 percent is significant and must be contrasted with the peak possible savings of 23 percent on the test system. The energy saved—the product of the time and the power—increases with processor count and should further increase in larger systems and with increased difference in power states.[12]
Reliability
Fault tolerance is imperative to realizing the dream of sustainable computing on extreme-scale systems: failure rates increase with system size, yet the cost of fault recovery must be independent of scale. To address these challenges, we designed a fault-tolerance system that involves the application, programming model, and runtime—a clear codesign. Its primary focus is the replication of critical application data across the system to enable continued job execution in the presence of node failures.
The codesign utilizes a task-based approach that applies the Global Arrays (GA) programming model[13], in which a task is designed as a unit of computation with input, output, and dependencies to other tasks. A task’s self-contained properties make one-sided communication a perfect model for describing data dependencies, simplifying the implementation of many applications. GA itself was codesigned in conjunction with the NW-Chem computational chemistry application in the 1990s and is under active development at the Pacific Northwest National Laboratory[14].
The self-containment and data-centric nature of task-based execution models has important implications for fault tolerance. Our approach leverages these properties because it is scalable and the cost of recovery is proportional to the degree of failure. There are several requirements for achieving fault tolerance:
- Critical application data must be accessible even if the compute nodes that contain a portion of the global data become inaccessible. This is possible by using selective replication of critical data that is both read from and written to, and by using Reed-Solomon encoding of read-only data.
- A fault-tolerance management infrastructure must support continued execution[15]. FTMI includes highly reliable fault detection that leverages the features of modern high-performance interconnects, a fault-resilient process manager that enables continued application execution, fault-tolerant synchronization—a necessary and sufficient collective communication primitive for partitioned global address space (PGAS) models, and fault information propagation that reduces detection overhead.
- To ensure correctness, data must be in a consistent state at the time of recovery. This is possible by adding metadata to each task that records state transitions and that the system distributes and replicates as critical data to ensure the task state can be recovered. Equally critical is the component that maintains consistency during data writes. FTMI handles this by synchronizing individual writes to remote locations in memory. Many components use this fault detection, including collective communication, the PGAS data store layer, and the application layer for state transition.
Figure 6 shows an overview of a fault-tolerant system we designed in conjunction with computational chemistry domain requirements; it is also applicable to many applications that use PGAS programming models. Many methods commonly used for computational chemistry trade off time and accuracy, including the coupled-cluster (CC) method[16]. CC’s noniterative nature makes it inherently difficult to save intermediate states and use traditional checkpoint-restart methods for fault tolerance. In a CC calculation, the total amount of critical data is proportional to N4—several orders of magnitude smaller than the complexity of the computation proportional to N7, where N is equal to the number of basis functions and typically ranges between 100 and 500. The computation within CC is task-based and utilizes GA. We codesigned FTMI with the fault-tolerant version of CC, providing necessary tools for fault detection, containment, and large-scale recovery.
FTMI is currently in use on large-scale Cray and InfiniBand systems. Figure 7 shows the execution time of fault-tolerant CC for the Uracil molecule, using 4,096 processes of an AMD/InfiniBand 2310 compute node system at the Pacific Northwest National Laboratory. In the absence of failures, overhead is negligible, making the fault-tolerant implementation highly effective compared with traditional checkpoint-restart methods. In the presence of one node failure, the overhead is 15 percent; the overall recovery cost in checkpoint-restart includes the time to restart all the processes proportional to the system’s size.
A major challenge for high-performance computing as it marches to exascale levels is in providing practical and integrated approaches for codesign that consider performance, power, and fault tolerance in concert, as well as algorithms, applications, programming models, runtime systems, and hardware architecture. Although codesign methodologies represent a challenge for the architect, the exponentially increased degrees of freedom to consider the complexities, scale, and costs will warrant using these methodologies to achieve maximum system productivity, as our examples have demonstrated. Accurate predictive tools including analytical modeling have proven to be ideal vehicles to use in such codesign.
Acknowledgments
This research is supported by the US Department of Energy’s Office of Advanced Scientific Computing Research, grants #59493 and #59542. The Pacific Northwest National Laboratory is operated by Battelle for the US Department of Energy under contract DE-AC05-76RL01830.
About the Authors
Darren J. Kerbyson is a Laboratory Fellow at the Pacific Northwest National Laboratory. His research interests include the analysis and modeling of performance, power, and fault resiliency of future systems. Kerbyson received a PhD in computer science from the University of Warwick, UK. He is a member of IEEE. Contact him at darren.kerbyson@pnnl.gov.
Abhinav Vishnu is a member of the High-Performance Computing Group at the Pacific Northwest National Laboratory. His research interests include designing scalable, energy-efficient, fault-tolerant programming models and communication runtime systems on high-speed interconnects. Vishnu received a PhD in computer science from the Ohio State University. He is a member of IEEE. Contact him at abhinav.vishnu@pnnl.gov.
Kevin J. Barker is a member of the High-Performance Computing Group at the Pacific Northwest National Laboratory. His research interests include developing performance modeling methodologies and tools for high-performance computing systems and workloads as well as understanding how current and future architectures impact performance. Barker received a PhD in computer science from the College of William and Mary. Contact him at kevin.barker@pnnl.gov.
Adolfy Hoisie is a Laboratory Fellow and director of the Center for Advanced Architectures at the Pacific Northwest National Laboratory. His research focuses on performance analysis and modeling of systems and applications, areas in which he has published extensively. He is a member of IEEE. Contact him at adolfy.hoisie@pnnl.gov.
Computer, the flagship publication of the IEEE Computer Society, publishes highly acclaimed peer-reviewed articles written for and by professionals representing the full spectrum of computing technology from hardware to software and from current research to new applications. Providing more technical substance than trade magazines and more practical ideas than research journals. Computer delivers useful information that is applicable to everyday work environments.
References
[1] K.J. Barker, A. Hoisie, and D.J. Kerbyson, “An Early Performance Evaluation of Power7-IH HPC Systems,” Proc. ACM/IEEE Conf. Supercomputing (SC 11), IEEE CS, 2011, pp. 1-11
[2] K.J. Barker et al., “Entering the Petaflop Era: The Architecture and Performance of Roadrunner,” Proc. ACM/IEEE Conf. Supercomputing (SC 08), IEEE CS, 2008, pp. 1-11
[3] D.J. Kerbyson, A. Vishnu, and K.J. Barker, “Energy Templates: Exploiting Application Information to Save Energy,” Proc. IEEE Int’l Conf. Cluster Computing (Cluster 11), IEEE CS, 2011, pp. 1-9.
[4] K.J. Barker et al., “Entering the Petaflop Era: The Architecture and Performance of Roadrunner,” Proc. ACM/IEEE Conf. Supercomputing (SC 08), IEEE CS, 2008, pp. 1-11
[5] D.J. Kerbyson and A. Hoisie, “A Performance Analysis of Two-Level Heterogeneous Processing Systems on Wavefront Algorithms,” Unique Chips and Systems, E. John and J. Rubio, eds., CRC Press, 2007, pp. 259-279
[6] K.J. Barker et al., “Using Performance Modeling to Design Large-Scale Systems,” Computer, Nov. 2009, pp. 42-49.
[7] K.R. Koch, R.S. Baker, and R.E. Alcouffe, “Solution of the First-Order Form of the 3D Discrete Ordinates Equation on a Massively Parallel Processor,” Trans. Am. Nuclear Soc., vol. 65, 1992, pp. 198-199.
[8] D.J. Kerbyson, M. Lang, and S. Pakin, “Adapting Wave-Front Algorithms to Efficiently Utilize Systems with Deep Communication Hierarchies,” Parallel Computing, vol. 6, 2011, pp. 550-561
[9] S. Pakin, M. Lang, and D.J. Kerbyson, “The Reverse Acceleration Model for Programming Petascale Hybrid Systems,” IBM J. Research and Development, vol. 53, no. 5, 2009, pp. 8:1-8:15
[10] B. Rountree et al., “Adagio: Making DVS Practical for Complex HPC Applications,” Proc. 23rd Int’l Conf. Supercomputing (ICS 09), ACM, 2009, pp. 460-469
[11] D.J. Kerbyson, A. Vishnu, and K.J. Barker, “Energy Templates: Exploiting Application Information to Save Energy,” Proc. IEEE Int’l Conf. Cluster Computing (Cluster 11), IEEE CS, 2011, pp. 1-9.
[12] B. Rountree et al., “Adagio: Making DVS Practical for Complex HPC Applications,” Proc. 23rd Int’l Conf. Supercomputing (ICS 09), ACM, 2009, pp. 460-469.
[13] J. Nieplocha, R.J. Harrison, and R.J. Littlefield, “Global Arrays: A Nonuniform Memory Access Programming Model for High-Performance Computers,” J. Supercomputing, vol. 10, no. 2, 1996, pp. 169-189
[14] J. Nieplocha et al., “Advances, Applications and Performance of the Global Arrays Shared Memory Programming Toolkit,” Int’l J. High-Performance Computing and Applications, vol. 20, no. 2, 2006, pp. 203-231
[15] A. Vishnu et al., “Fault-Tolerant Communication Runtime Support for Data Centric Programming Models,” Proc. Int’l Conf. High-Performance Computing (HiPC 10), IEEE, 2010, pp. 1-9.
[16] H.V. Dam, A. Vishnu, and W.D. Jong, “Designing a Scalable Fault Tolerance Model for Computational Chemistry: A Case Study with Coupled Cluster Perturbative Triples,” J. Chemical Theory and Computation, vol. 7, no. 1, 2011, pp. 66-75.