Contents
Foreword vi
Preface xviii
1 Introduction
l.l Why Parallel Architecture
l.l.l Application Trends
l.l.2 Technology Trends
l.l.3 Architectural Trends
1.1.4 Supercomputers
l.l.5 Summary
l.2 Convergence of Parallel Architectures
l.2.l Communication Architecture
1.2.2Shared Address Space
l.2.3 Message Passing
1.2.4Convergence
1.2.5 Data Parallel Processing
l.2.6 Other Parallel Architectures
1.2.7A Generic Parallel Architecture
l.3 Fundamental Design Issues
l.3.l Communication Abstraction
l.3.2 Programming Model Requirements
l.3.3 Communication and Replication
1.3.4A Performance
l.3.5 Summary
1.4Concluding Remarks
l.5 Historical References
l.6 Exercises
2 Parallel Programs
2.1 Parallel Application Case Studies
2.1.1 Simulating Ocean Currents
2.1.2 Simulating the Evolution of Galaxies
2.1.3 Visualizing Complex Scenes Using Ray Tracing
2.1.4 Mining Data for Associations
2.2 The Parallelization Process
2.2.1 Steps in the Process
2.2.2 Parallelizing Computation versus Data
2.2.3 Goals of the Parallelization Process
2.3 Parallelization of an Example Program
2.3.1 The Equation Solver Kemel
2.3.2 Decomposition
2.3.3 Assignment
2.3.4 Orchestration under the Data Parallel Model
2.3.5 Orchestration under the Shared Address Space Model
2.3.6 Orchestration under the Message-Passing Model
2.4 Concluding Remarks
2.5 Exercises
3 Programming for Performance
3.1 Partitioning for Performance
3.1.1 Load Balance and Synchronization Wait Time
3.1.2 Reducing Inherent Communication
3.1.3 Reducing the Extra Work
3.1.4 Summary
3.2 Data Access and Communication in a Multimemory System
3.2.1 A Multiprocessor as an Extended Memory Hierarchy
3.2.2 Artifactual Communication in the Extended Memory Hierarchy
3.2.3 Artifactual Communication and Replication: The Working Set
Perspective
3.3 Orchestration for Performance
3.3.1 Reducing Artifactual Communication
3.3.2 Structuring Communication to Reduce Cost
3.4 Perfonnance Factors from the Processor's Perspective
3.5 The Parallel Application Case Studies: An In-Depth Look
3.5.1 Ocean
3.5.2 Bames-Hut
3.5.3 Raytrace
3.5.4 DataMining
3.6 Implications for Programming Models
3.6.1 Naming
3.6.2 Replication
3.6.3 Overhead and Granularity of Communication
3.6.4 Block Data Transfer
3.6.5 Synchronization
3.6.6 Hardware Cost and Design Complexity
3.6.7 Performance Model
3.6.8 Summary
3.7 Concluding Remarks
3.8 Exercises
4 Workload-Driven Evaluation
4.1 Scaling Workloads and Machines
4.1.1 Basic Measures of Multiprocessor Performance
4.1.2 Why Worry about Scaling?
4.1.3 Key Issues in Scahng
4.1.4 Scaling Models and Speedup Measures
4.1.5 Impact of Scaling Models on the Equation Solver Kemel
4.1.6 Scaling Workload Parameters
4.2 Evaluating a Real Machine
4.2.1 Performance Isolation Using Microbenchmarks
4.2.2 Choosing Workloads 216
4.2.3 Evaluating a Fixed-Size Machine
4.2.4 Varying Machine Size
4.2.5 Choosing Performance Metrics
4.3 Evaluating an Architectural Idea or Trade-off
4.3.1 Multiprocessor Simulation
4.3.2 Scaling Down Problem and Machine Parameters for Simulation
4.3.3 Dealing with the Parameter Space: An Example Evaluation
4.3.4 Summary
4.4 Illustrating Workload Characterization
4.4.1 Workload Case Studies
4.4.2 Workload Characteristics
4.5 Concluding Remarks
4.6 Exercises
5 Shared Memory Multiprocessors
5.1 Cache Coherence
5.1.1 The Cache Coherence Problem
5.1.2 Cache Coherence through Bus Snooping
5.2 Memory Consistency
5.2.1 Sequential Consistency
5.2.2 Sufficient Conditions for Preserving Sequential Consistency
5.3 Design Space for Snooping Protocols
5.3.1 A Three-State (MSI) Write-Back Invalidation Protocol
5.3.2 A Four-State (MESI) Write-Back Invalidation Protocol
5.3.3 A Four-State (Dragon) Write-Back Update Protocol
5.4 Assessing Protocol Design Trade-offs
5.4.1 Methodology
5.4.2 Bandwidth Requirement under the MESI Protocol
5.4.3 Impact of Protocol Optimizations
5.4.4 Trade-Offs in Cache Block Size
5.4.5 Update-Based versus Invalidation-Based Protocols
5.5 Synchronization
5.5.1 Components of a Synchronization Event
5.5.2 Role of the User and System
5.5.3 Mutual Exclusion
5.5.4 Point-to-Point Event Synchronization
5.5.5 Global (Barrier) Event Synchronization
5.5.6 Synchronization Summary
5.6 Implications for Software
5.7 Concluding Remarks
5.8 Exercises
6 Snoop-Based Multiprocessor Design
6.1 Correctness Requirements
6.2 Base Design: Single-Level Caches with an Atomic Bus
6.2.1 Cache Controller and Tag Design
6.2.2 Reportmg Snoop Results
6.2.3 Dealing with Write Backs
6.2.4 Base Organization
6.2.5 Nonatomic State Transitions
6.2.6 Serialization
6.2.7 Deadlock
6.2.8 Livelock and Starvation
6.2.9 Implementing Atomic Operations
6.3 Multilevel Cache Hierarchies
6.3.1 Maintaining Inclusion
6.3.2 Propagating Transactions for Coherence in the Hierarchy
6.4 Split-Transaction Bus
6.4.1 An Example Split-Transaction Design
6.4.2 Bus Design and Request-Response Matching
6.4.3 Snoop Results and Conflicting Requests
6.4.4 Flow Control
6.4.5 Path of a Cache Miss
6.4.6 Serialization and Sequential Consistency
6.4.7 Altemative Design Choices
6.4.8 Split-Transaction Bus with Multilevel Caches
6.4.9 Supporting Multiple Outstanding Misses from a Processor
Case Studies: SGl Challenge and Sun Enterprise 6000
6.5.1 sGl Powerpath-2 System Bus
6.5.2 SGl Processor and Memory Subsystems
6.5.3 SGl I/O Subsystem
6.5.4 SGl Challenge Memory System Performance
6.5.5 Sun Gigaplane System Bus
6.5.6 Sun Processor and Memory Subsystem
6.5.7 Sun I/O Subsystem
6.5.8 Sun Enterprise Memory System Performance
6.5.9 Application Performance
Extending Cache Coherence
6.6.l Shared Cache Designs
6.6.2 Coherence for Virtually indexed Caches
6.6.3 Translation Lookaside Buffer Coherence
6.6.4 Snoop-Based Cache Coherence on Rings
6.6.5 Scaling Data and Snoop Bandwidth in Bus-Based Systems
Concluding Remarks
Exercises
Scalable Multiprocessors
Scalability
7.l.l Bandwidth Scaling
7.1.2 Latency Scaling
7.1.3 CostScaling
7.1.4 Physical Scaling
7.1.5 Scaling in a Generic Parallel Architecture
Realizing Programming Models
7.2.1 Primitive Network Transacuons
7.2.2 Shared Address Space
7.2.3 Message Passing
7.2.4 Active Messages
7.2.5 Common Challenges
7.2.6 Communication Architecture Design Space
Physical DMA
7.3.1 Node-to-Network Interface
7.3.2 Implementing Communication Abstractions
7.3.3 A Case Study: nCUBE/2
7.3.4 Typical LAN Interfaces
7.4 User-Level Access
7.4.1 Node-to-Network Interface
7.4.2 Case Study: Thinking Machines CM-5
7.4.3 User-Level Handlers
7.5 Dedicated Message Processing
7.5.1 Case Study: Intel Paragon
7.5.2 Case Study: Meiko CS-2
7.6 Shared Physical Address Space
7.6.1 Case Study: CRAY T3D
7.6.2 Case Study: CRAY T3E
7.6.3 Summary
7.7 Clusters and Networks of Workstations
7.7.1 Case Study: Myrinet SBUS Lanai
7.7.2 Case Study: PCI Memory Channel
7.8 Implications for Parallel Software
7.8.1 Network Transaction Performance
7.8.2 Shared Address Space Operations
7.8.3 Message-Passing Operations
7.8.4 Application-Level Performance
7.9 Synchronization
7.9.1 Algorithms for Locks
7.9.2 Algorithms for Barriers
7.10 Concluding Remarks
7.11 Exercises
8 Directory-Based Cache Coherence
8.1 Scalable Cache Coherence
8.2 Overview of Directory-Based Approaches
8.2.1 Operation ofa Simple Directory Scheme
8.2.2 Scaling
8.2.3 Altematives for Organizing Directories
8.3 Assessing Directory Protocols and Trade-Offs
8.3.1 Data Sharing Pattems for Directory Schemes
8.3.2 Local versus Remote Traffic
8.3.3 Cache Block Size EfTects
8.4 Design Challenges for Directory Protocols
8.4.1 Performance
8.4.2 Correctness
8.5 Memory-Based Directory Protocols: The SGl Origin System
8.5.1 Cache Coherence Protocol
8.5.2 Dealing with Correctness Issues
8.5.3 Details of Directory Structure
8.5.4 Protocol Extensions
8.5.5 Overview of the Origin2000 Hardware
8.5.6 Hub Implementation
8.5.7 Performance Characteristics
8.6 Cache-Based Directory Protocols: The Sequent NUMA-Q
8.6.1 Cache Coherence Protocol
8.6.2 Dealing with Correctness Issues
8.6.3 Protocol Extensions
8.6.4 Overview of NUMA-Q Hardware
8.6.5 Protocol Interactions with SMP Node
8.6.6 IiQ-Link implementation
8.6.7 Performance Characteristics
8.6.8 Comparison Case Study: The HAL Sl Multiprocessor
8.7 Performance Parameters and Protocol Performance
8.8 Synchronization
8.8.l Performance of Synchronization Algorithms
8.8.2 implementing Atomic Primitives
8.9 implications for Parallel Software
8.l0 Advanced Topics
8.l0.l Reducing Directory Storage Overhead
8.10.2 Hierarchical Coherence
8.ll C oncluding Remarks
8.12 Exercises
9 Hardware/Software Trade-Offs
9.1 Relaxed Memory Consistency Models
9.1.l The System Specification
9.1.2 The Programmer's Interface
9.1.3 The Translation Mechanism
9.1.4 Consistency Models in Real Multiprocessor Systems
9.2 Overcoming Capacity Limitations
9.2.1 Tertiary Caches
9.2.2 Cache-Only Memory Architectures (COMA)
9.3 Reducing Hardware Cost
9.3.1 Hardware Access Control with a Decoupled Assist
9.3.2 Access Control through Code Instrumentation
9.3.3 Page-Based Access Control: Shared Virtual Memory
9.3.4 Access Control through Language and Compiler Support
9.4 Putting It All Together: A Taxonomy and Simple COMA
9.4.1 Putting It All Together: Simple COMA and Stache
9.5 Implications for Parallel Software
9.6 Advanced Topics
9.6.1 Flexibility and Address Constraints in CC-NUMA Systems
9.6.2 Implementing Relaxed Memory Consistency in Software
9.7 C oncluding Remarks
9.8 Exercises
10 Interconnection Network Design
10.1 Basic Definitions
10.2 Basic Communication Performance
10.2.1 Latency
10.2.2 Bandwidth
10.3 Organizational Structure
10.3.1 Links
10.3.2 Switches
10.3.3 Network Interfaces
10.4 Interconnection Topologies
10.4.1 Fully Connected Network
10.4.2 Linear Arrays and Rings
10.4.3 Multidimensional Meshes and Tori
10.4.4 Trees
10.4.5 Butterflies
10.4.6 Hypercubes
10.5 Evaluating Design Trade-Offs in Network Topology
10.5.1 Unloaded Latency
10.5.2 Latency under Load
10.6 Routing
10.6.1 Routing Mechanisms
10.6.2 Deterministic Routing
10.6.3 Deadlock Freedom
10.6.4 Virtual Channels
10.6.5 Up*-Down* Routing
10.6.6 Tum-Model Routing
10.6.7 Adaptive Routing
10.7 Switch Design
10.7.1 Ports
10.7.2 Intemal Datapath
10.7.3 Channel Buffers
10.7.4 Output Scheduling
10.7.5 Stacked Dimension Switches
10.8 Flow Control
10.8.1 Parallel Computer Networks versus 1 ANs and WANs
10.8.2 Link-Level Flow Control
10.8.3 End-to-End Flow Control
10.9 Case Studies
10.9.1 CRAY T3D Network
10.9.2 IBM SP-l, SP-2 Network
10.9.3 Scalable Coherent Interface
10.9.4 SGI Origin Network
10.9.5 Myricom Network
10.10 Concluding Remarks
10.11 Exercises
11 Latency Tolerance
11.1 Overview of Latency Tolerance
11.1.1 Latency Tolerance and the Communication Pipeline
11.1.2 Approaches
11.1.3 Fundamental Requirements, Benefits, and Limitations
11.2 Latency Tolerance in Explicit Message Passing
11.2.1 Structure of Communication
11.2.2 Block Data Transfer
11.2.3 Precommuucation
11.2.4 Proceeding Past Communication in the Same Thread
11.2.5 Multithreading
11.3 Latency Tolerance in a Shared Address Space
11.3.1 Structure of Communication
11.4 Block Data Transfer in a Shared Address Space
11.4.1 Techniques and Mechanisms
11.4.2 Policy Issues and Trade-Offs
11.4.3 Performance Benefits
11.5 Proceeding Past Long-Latency Events
11.5.1 Proceeding Past Writes
11.5.2 Proceeding Past Reads
11.5.3 Summary
11.6 Precommunication in a Shared Address Space
11.6.1 Shared Address Space without Caching of Shared Data
11.6.2 Cache-Coherent Shared Address Space
11.6.3 Performance Benefits
11.6.4 Summary
11.7 Multithreading in a Shared Address Space
11.7.1 Techniques and Mechanisms
11.7.2 Performance Benefits
11.7.3 Implementation Issues for the Blocked Scheme
11.7.4 Implementation Issues for the Interleaved Scheme
11.7.5 Integrating Multithreading with Multiple-Issue Processors
11.8 Lockup-Free Cache Design
11.9 Concluding Remarks
11.10 Exercises
12 Future Directions
12.1 Technology and Architecture
12.1.1 Evolutionary Scenario
12.1.2 HittingaWall
12.1.3 Potential Breakthroughs
12.2 Applications and System Software
12.2.1 Evolutionary Scenario
12.2.2 HittingaWall
12.2.3 Potential Breakthroughs
Appendix: Parallel Benchmark Suites
A.1 ScaLapack
A.2 TPC
A.3 SPLASH
A.4 NAS Parallel Benchmarks
A.5 PARKBENCH
A.6 Other Ongoing Efforts
References
Index