Preface
Chapter 1: Parallel Machines and Computations
1.1 The Evolution of Parallel Architectures
1.1.1 Parallelism in Sequential Computers
1.1.2 Vector or SIMD Computers
1.1.3 Multiprocessors or MIMD Computers
1.2 Interconnection Networks
1.3 Application of Architectural Parallelism
1.4 Getting Started in SIMD and MIMD Programming
1.5 Parallelism in Algorithms
1.6 Conclusion
1.7 Bibliographic Notes
Chapter 2: Potential for Parallel Computations
2.1 Parameters Characterizing Algorithm Parallelism
2.2 Prefix Problem
2.3 Parallel Prefix Algorithms
2.3.1 Upper/Lower Parallel Prefix Algorithm
2.3.2 Odd/Even Parallel Prefix Algorithm
2.3.3 Ladner and Fischer''s Parallel Prefix
2.4 Characterizing Algorithm Behavior for Large Problem Size
2.5 Programming Parallel Prefix
2.6 Speedup and Efficiency of Parallel Algorithms
2.7 The Performance Perspective
2.7.1 Factors That Influence Performance
2.7.2 A Simple Performance Model--Amdahl''s Law
2.7.3 Average Execution Rate
2.8 Conclusion
2.9 Bibliographic Notes
Chapter 3: Vector Algorithms and Architectures
3.1 Vector and Matrix Algorithms
3.2 A Vector Architecture---Single Instruction Multiple Data
3.3 An SIMD Instruction Set
3.3.1 Registers and Memories of an SIMD Computer
3.3.2 Vector, Control Unit, and Cooperative Instructions
3.3.3 Data-Dependent Conditional Operations
3.3.4 Vector Length and Strip Mining
3.3.5 Routing Data Among the PEs
3.4 The Prime Memory System
3.5 Use of the PE Index to Solve Storage Layout Problems
3.6 SIMD Language Constructs--Fortran 90
3.6.1 Arrays and Array Sections
3.6.2 Array Assignment and Array Expressions
3.6.3 Fortran 90 Array Intrinsic Functions
3.6.4 Examples of SIMD Operations in Fortran 90
3.7 Pipelined SIMD Vector Computers
3.7.1 Pipelined SIMD Processor Structure
Processor/Memory Interaction
Number and Types of Pipelines
Implementation of Arithmetic
3.7.2 The Memory Interface of a Pipelined SIMD Computer
3.7.3 Performance of Pipelined SIMD Computers
3.8 Vector Architecture Summary
3.9 Bibliographic Notes
Chapter 4: MIMD Computers or Multiprocessors
4.1 Shared Memory and Message-Passing Architectures
4.1.1 Mixed-Type Multiprocessor Architectures
4.1.2 Characteristics of Shared Memory and Message Passing
4.1.3 Switching Topologies for Message Passing Architectures
4.1.4 Direct and Indirect Networks
4.1.5 Classification of Real Systems
4.2 Overview of Shared Memory Multiprocessor Programming
4.2.1 Data Sharing and Process Management
4.2.2 Synchronization
4.2.3 Atomicity and Synchronization
4.2.4 Work Distribution
4.2.5 Many Processes Executing One Program
4.3 Shared Memory Programming Alternatives and Scope
4.3.1 Process Management--Starting, Stopping, and Hierarchy
4.3.2 Data Access by Parallel Processes
4.3.3 Work Distribution
4.3.4 Multiprocessor Synchronization
Atomicity
Hardware and Software Synchronization Mechanisms
Fairness and Mutual Exclusion
4.4 A Shared Memory Multiprocessor Programming Language
4.4.1 The OpenMP Language Extension
Execution Model
Process Control
Parallel Scope of Variables
Work Distribution
Synchronization and Memory Consistency
4.4.2 The OpenMP Fortran Applications Program Interface API
Constructs of the OpenMP Fortran API
4.4.3 OpenMP Fortran Examples and Discussion
4.5 Pipelined MIMD--Multithreading
4.6 Summary and Conclusions
4.7 Bibliographic Notes
Chapter 5: Distributed Memory Multiprocessors
5.1 Distributing Data and Operations Among Processor/Memory Pairs
5.2 Programming with Message Passing
5.2.1 The Communicating Sequential Processes CSP Language
5.2.2 A Distributed Memory Programming Example: Matrix Multiply
5.3 Characterization of Communication
5.3.1 Point-to-Point Communications
5.3.2 Variable Classes in a Distributed Memory Program
5.3.3 High-Level Communication Operations
5.3.4 Distributed Gauss Elimination with High-Level Communications
5.3.5 Process Topology Versus Processor Topology
5.4 The Message Passing Interface, MPI
5.4.1 Basic Concepts in MPI
Communicator Structure
The Envelope
The Data
Point-to-Point Communication Concepts
Collective Communications Concepts
5.4.2 An Example MPI Program--Matrix Multiplication
5.5 Hardware Managed Communication--Distributed Cache
5.5.1 Cache Coherence
5.5.2 Shared Memory Consistency
5.6 Conclusion---Shared Versus Distributed Memory Multiprocessors
5.7 Bibliographic Notes
Chapter 6: Interconneetion Networks
6.1 Network Characteristics
6.2 Permutations
6.3 Static Networks
6.3.1 Mesh
6.3.2 Ring
6.3.3 Tree
6.3.4 Cube Networks
6.3.5 Performance
6.4 Dynamic Networks
6.4.1 Bus
6.4.2 Crossbar
6.4.3 Multistage Interconnection Networks MINs
Benes Network
Butterfly Network
Omega Network
6.4.4 Combining Networks--Mutual Exclusion Free Synchronization
6.4.5 Performance
6.5 Conclusion
6.6 Bibliographic Notes
Chapter 7: Data Dependence and Parallelism
7.1 Discovering Parallel Operations in Sequential Code
7.2 Variables with Complex Names
7.2.1 Nested Loops
7.2.2 Variations on the Array Reference Disambiguation Problem
7.3 Sample Compiler Techniques
7.3.1 Loop Transformations
7.3.2 Loop Restructuring
7.3.3 Loop Replacement Transformations
7.3.4 Anti- and Output Dependence Removal Transformations
7.4 Data Flow Principles
7.4.1 Data Flow Concepts
7.4.2 Graphical Representation of Data Flow Computations
7.4.3 Data Flow Conditionals
7.4.4 Data Flow Iteration
7.4.5 Data Flow Function Application and Recursion
7.4.6 Structured Values in Data Flow--Arrays
7.5 Data Flow Architectures
7.5.1 The MIT Static Data Flow Architecture
7.5.2 Dynamic Data Flow Computers
Manchester Data Flow Computer
The MIT Tagged-Token Data Flow Machine
7.5.3 Issues to Be Addressed by Data Flow Machines
7.6 Systolic Arrays
7.7 Conclusion
7.8 Bibliographic Notes
Chapter 8: Implementing Synchronization and Data Sharing
8.1 The Character of Information Conveyed by Synchronization
8.2 Synchronizing Different Kinds of Cooperative Computations
8.2.1 One Producer with One or More Consumers
8.2.2 Global Reduction
8.2.3 Global Prefix
8.2.4 Cooperative Update of a Partitioned Structure
8.2.5 Managing a Shared Task Set
8.2.6 Cooperative List Manipulation
8.2.7 Parallel Access Queue Using Fetch&add
8.2.8 Histogram--Fine Granularity, Data-Dependent Synchronization
8.3 Waiting Mechanisms
8.3.1 Hardware Waiting
8.3.2 Software Waiting
8.3.3 Multilevel Waiting
8.4 Mutual Exclusion Using Atomic Read and Write
8.5 Proving a Synchronization Implementation Correct
8.5.1 Implementing Produce/Consume Using Locks
8.5.2 Temporal Logic
8.5.3 Proof of Correctness
8.6 Alternative Implementations of Synchronization--Barrier
8.6.1 Features of Barrier Synchronization
8.6.2 Characterization of Barrier Implementations
8.7 Conclusion
8.8 Bibliographic Notes
Chapter 9: Parallel Processor Performance
9.1 Amdahl'' s Law Revisited
9.1.1 The Effect of Work Granularity on Amdahl'' s Law
9.1.2 Least Squares Estimation of Amdahl''s Law Parameters
9.2 Parameterized Execution Time
9.2.1 Pipelined Vector Machine Performance
9.2.2 Pipelined Multiprocessor Performance
Program Sections with Restricted Parallelism
Programs Requiring Critical Sections for Parallel Execution
Granularity Correction to the Critical Section Model
9.2.3 Multiple Pipelined Multiprocessors
9.3 Performance of Barrier Synchronization
9.3.1 Accounting for Barrier Performance
9.3.2 Instrumentation for Barrier Measurement
9.3.3 Examples of Barrier Performance Measurement
9.4 Statistical Models for Static and Dynamic Parallel Loops
9.4.1 Dynamic Scheduling Model
Dynamically Scheduled Iterations of Equal Length
Dynamically Scheduled Iterations of Different Lengths
9.4.2 Static Scheduling Model
Prescheduled Iterations of Equal Length
Prescheduled Iterations of Different Lengths
9.4.3 Comparison with Experimental Results
9.5 Conclusions
9.6 Bibliographic Notes
Chapter 10: Temporal Behavior of Parallel Programs
10.1 Temporal Characterization of Cache Behavior
10.1.1 A Temporal Locality Metric for Cache Behavior
10.1.2 Example Application of the Locality Metric to Bubble Sort
10.2 Read Sharing in Multiprocessors with Distributed Caches
10.2.1 A Simple Example of Read Sharing
10.2.2 The KSR-1 Architecture
10.2.3 Read Multiplicity Metric
10.2.4 Experiments
10.2.5 Programmed Poststore and Prefetch
10.3 Message Waiting in Message Passing Multiprocessors
10.4 Conclusion
10.5 Bibliographic Notes
Chapter 11: Parallel I/O
11.1 The Parallel I/O Problem
11.1.1 Data Dependence and I/O
11.1.2 I/O Format Conversion
11.1.3 Numerical Examples of I/O Latency and Bandwidth
Requirements
11.2 Hardware for Parallel I/O
11.2.1 Control of the Primary Memory Side of the Transfer
11.2.2 I/O Channel Concurrency
11.2.3 Peripheral Device Parallelism
11.3 Parallel Access Disk Arrays--RAID
11.4 Parallel Formatted I/O in Shared Memory Multiprocessors
11.4.1 Parallel Input Using C I/O Routines fread0 and sscanf0
Application Independent Input Operations
Application Dependent Input Operations
11.4.2 Parallel Output Using C I/O Routines sprintf0 and fwrite0
Application Dependent Output Code
Application Independent Part of the Output Code
11.5 Collective I/O in Multiprocessors--MPI-IO
11.5.1 MPI-2 I/O Concepts
11.5.2 MPI-IO Examples
Cyclically Mapped Read from a Striped Disk Array
MPI-IO for Distributed Matrix Multiply
11.6 Conclusions
11.7 Bibliographic Notes
Appendix A: Routines of the MPI Message Passing Library
A. 1 Point-to-Point Communications Routines
A.2 Collective Communications Routines
A.3 MPI Data Types and Constructors
A.4 Communicators, Process Groups, and Topologies
A.5 MPI Environment and Error Handling
A.6 Summary and MPI-2 Extensions
Appendix B: Synchronization Mechanisms
B. 1 Hardware Level Synchronization
B.2 Language Level Synchronization
B.3 Waiting Mechanism
Bibliography
Index