Parallel Computer Architecture A Hardware/Software Approach（英文版第2版）

内容简介

　　David E. Culler & Jaswinder Pal Singh with Anoop Gupta： Parallel Computer Architecture， A Hard ware/Software Approach. Second Edition.Copyright @ 1996 by Morgan Kaufmann Publishers， Inc.Harcourt Asia Pte Ltd under special arrangement with Morgan Kaufmann authorizes China Machine Press to print and exclusively distribute this edition， which is the only authorized complete and unabridged reproduction of the latest American Edition published and priced for sale in China only， not including Hong Kong SAR and Taiwan.Unauthorized export of this edition is a violation of the Copyright Act. Violation of this Law is subjected to Civil and Criminal penalties.

作者简介

暂缺《Parallel Computer Architecture A Hardware/Software Approach（英文版第2版）》作者简介

图书目录

     Contents
   Foreword vi
   Preface xviii
   1 Introduction
    l.l Why Parallel Architecture
    l.l.l Application Trends
    l.l.2 Technology Trends
    l.l.3 Architectural Trends
    1.1.4 Supercomputers
    l.l.5 Summary
    l.2 Convergence of Parallel Architectures
    l.2.l Communication Architecture
    1.2.2Shared Address Space
    l.2.3 Message Passing
    1.2.4Convergence
    1.2.5 Data Parallel Processing
    l.2.6 Other Parallel Architectures
    1.2.7A Generic Parallel Architecture
    l.3 Fundamental Design Issues
    l.3.l Communication Abstraction
    l.3.2 Programming Model Requirements
    l.3.3 Communication and Replication
    1.3.4A Performance
    l.3.5 Summary
    1.4Concluding Remarks
    l.5 Historical References
    l.6 Exercises
   2 Parallel Programs
    2.1 Parallel Application Case Studies
    2.1.1 Simulating Ocean Currents
    2.1.2 Simulating the Evolution of Galaxies
    2.1.3 Visualizing Complex Scenes Using Ray Tracing
    2.1.4 Mining Data for Associations
    2.2 The Parallelization Process
    2.2.1 Steps in the Process
    2.2.2 Parallelizing Computation versus Data
    2.2.3 Goals of the Parallelization Process
    2.3 Parallelization of an Example Program
    2.3.1 The Equation Solver Kemel
    2.3.2 Decomposition
    2.3.3 Assignment
    2.3.4 Orchestration under the Data Parallel Model
    2.3.5 Orchestration under the Shared Address Space Model
    2.3.6 Orchestration under the Message-Passing Model
    2.4 Concluding Remarks
    2.5 Exercises
   3 Programming for Performance
    3.1 Partitioning for Performance
    3.1.1 Load Balance and Synchronization Wait Time
    3.1.2 Reducing Inherent Communication
    3.1.3 Reducing the Extra Work
    3.1.4 Summary
    3.2 Data Access and Communication in a Multimemory System
    3.2.1 A Multiprocessor as an Extended Memory Hierarchy
    3.2.2 Artifactual Communication in the Extended Memory Hierarchy
    3.2.3 Artifactual Communication and Replication: The Working Set
   Perspective
    3.3 Orchestration for Performance
    3.3.1 Reducing Artifactual Communication
    3.3.2 Structuring Communication to Reduce Cost
    3.4 Perfonnance Factors from the Processor's Perspective
    3.5 The Parallel Application Case Studies: An In-Depth Look
    3.5.1 Ocean
    3.5.2 Bames-Hut
    3.5.3 Raytrace
    3.5.4 DataMining
    3.6 Implications for Programming Models
    3.6.1 Naming
    3.6.2 Replication
    3.6.3 Overhead and Granularity of Communication
    3.6.4 Block Data Transfer
    3.6.5 Synchronization
    3.6.6 Hardware Cost and Design Complexity
    3.6.7 Performance Model
    3.6.8 Summary
    3.7 Concluding Remarks
    3.8 Exercises
   4 Workload-Driven Evaluation
    4.1 Scaling Workloads and Machines
    4.1.1 Basic Measures of Multiprocessor Performance
    4.1.2 Why Worry about Scaling?
    4.1.3 Key Issues in Scahng
    4.1.4 Scaling Models and Speedup Measures
    4.1.5 Impact of Scaling Models on the Equation Solver Kemel
    4.1.6 Scaling Workload Parameters
    4.2 Evaluating a Real Machine
    4.2.1 Performance Isolation Using Microbenchmarks
    4.2.2 Choosing Workloads 216
    4.2.3 Evaluating a Fixed-Size Machine
    4.2.4 Varying Machine Size
    4.2.5 Choosing Performance Metrics
    4.3 Evaluating an Architectural Idea or Trade-off
    4.3.1 Multiprocessor Simulation
    4.3.2 Scaling Down Problem and Machine Parameters for Simulation
    4.3.3 Dealing with the Parameter Space: An Example Evaluation
    4.3.4 Summary
    4.4 Illustrating Workload Characterization
    4.4.1 Workload Case Studies
    4.4.2 Workload Characteristics
    4.5 Concluding Remarks
    4.6 Exercises
   5 Shared Memory Multiprocessors
    5.1 Cache Coherence
    5.1.1 The Cache Coherence Problem
    5.1.2 Cache Coherence through Bus Snooping
    5.2 Memory Consistency
    5.2.1 Sequential Consistency
    5.2.2 Sufficient Conditions for Preserving Sequential Consistency
    5.3 Design Space for Snooping Protocols
    5.3.1 A Three-State (MSI) Write-Back Invalidation Protocol
    5.3.2 A Four-State (MESI) Write-Back Invalidation Protocol
    5.3.3 A Four-State (Dragon) Write-Back Update Protocol
    5.4 Assessing Protocol Design Trade-offs
    5.4.1 Methodology
    5.4.2 Bandwidth Requirement under the MESI Protocol
    5.4.3 Impact of Protocol Optimizations
    5.4.4 Trade-Offs in Cache Block Size
    5.4.5 Update-Based versus Invalidation-Based Protocols
    5.5 Synchronization
    5.5.1 Components of a Synchronization Event
    5.5.2 Role of the User and System
    5.5.3 Mutual Exclusion
    5.5.4 Point-to-Point Event Synchronization
    5.5.5 Global (Barrier) Event Synchronization
    5.5.6 Synchronization Summary
    5.6 Implications for Software
    5.7 Concluding Remarks
    5.8 Exercises
   6 Snoop-Based Multiprocessor Design
    6.1 Correctness Requirements
    6.2 Base Design: Single-Level Caches with an Atomic Bus
    6.2.1 Cache Controller and Tag Design
    6.2.2 Reportmg Snoop Results
    6.2.3 Dealing with Write Backs
    6.2.4 Base Organization
    6.2.5 Nonatomic State Transitions
    6.2.6 Serialization
    6.2.7 Deadlock
    6.2.8 Livelock and Starvation
    6.2.9 Implementing Atomic Operations
    6.3 Multilevel Cache Hierarchies
    6.3.1 Maintaining Inclusion
    6.3.2 Propagating Transactions for Coherence in the Hierarchy
    6.4 Split-Transaction Bus
    6.4.1 An Example Split-Transaction Design
    6.4.2 Bus Design and Request-Response Matching
    6.4.3 Snoop Results and Conflicting Requests
    6.4.4 Flow Control
    6.4.5 Path of a Cache Miss
    6.4.6 Serialization and Sequential Consistency
    6.4.7 Altemative Design Choices
    6.4.8 Split-Transaction Bus with Multilevel Caches
    6.4.9 Supporting Multiple Outstanding Misses from a Processor
    Case Studies: SGl Challenge and Sun Enterprise 6000
    6.5.1 sGl Powerpath-2 System Bus
    6.5.2 SGl Processor and Memory Subsystems
    6.5.3 SGl I/O Subsystem
    6.5.4 SGl Challenge Memory System Performance
    6.5.5 Sun Gigaplane System Bus
    6.5.6 Sun Processor and Memory Subsystem
    6.5.7 Sun I/O Subsystem
    6.5.8 Sun Enterprise Memory System Performance
    6.5.9 Application Performance
   Extending Cache Coherence
    6.6.l Shared Cache Designs
    6.6.2 Coherence for Virtually indexed Caches
    6.6.3 Translation Lookaside Buffer Coherence
    6.6.4 Snoop-Based Cache Coherence on Rings
    6.6.5 Scaling Data and Snoop Bandwidth in Bus-Based Systems
   Concluding Remarks
   Exercises
   Scalable Multiprocessors
   Scalability
    7.l.l Bandwidth Scaling
    7.1.2 Latency Scaling
    7.1.3 CostScaling
    7.1.4 Physical Scaling
    7.1.5 Scaling in a Generic Parallel Architecture
   Realizing Programming Models
    7.2.1 Primitive Network Transacuons
    7.2.2 Shared Address Space
    7.2.3 Message Passing
    7.2.4 Active Messages
    7.2.5 Common Challenges
    7.2.6 Communication Architecture Design Space
   Physical DMA
    7.3.1 Node-to-Network Interface
    7.3.2 Implementing Communication Abstractions
    7.3.3 A Case Study: nCUBE/2
    7.3.4 Typical LAN Interfaces
    7.4 User-Level Access
    7.4.1 Node-to-Network Interface
    7.4.2 Case Study: Thinking Machines CM-5
    7.4.3 User-Level Handlers
    7.5 Dedicated Message Processing
    7.5.1 Case Study: Intel Paragon
    7.5.2 Case Study: Meiko CS-2
    7.6 Shared Physical Address Space
    7.6.1 Case Study: CRAY T3D
    7.6.2 Case Study: CRAY T3E
    7.6.3 Summary
    7.7 Clusters and Networks of Workstations
    7.7.1 Case Study: Myrinet SBUS Lanai
    7.7.2 Case Study: PCI Memory Channel
    7.8 Implications for Parallel Software
    7.8.1 Network Transaction Performance
    7.8.2 Shared Address Space Operations
    7.8.3 Message-Passing Operations
    7.8.4 Application-Level Performance
    7.9 Synchronization
    7.9.1 Algorithms for Locks
    7.9.2 Algorithms for Barriers
    7.10 Concluding Remarks
    7.11 Exercises
   8 Directory-Based Cache Coherence
    8.1 Scalable Cache Coherence
    8.2 Overview of Directory-Based Approaches
    8.2.1 Operation ofa Simple Directory Scheme
    8.2.2 Scaling
    8.2.3 Altematives for Organizing Directories
    8.3 Assessing Directory Protocols and Trade-Offs
    8.3.1 Data Sharing Pattems for Directory Schemes
    8.3.2 Local versus Remote Traffic
    8.3.3 Cache Block Size EfTects
    8.4 Design Challenges for Directory Protocols
    8.4.1 Performance
    8.4.2 Correctness
    8.5 Memory-Based Directory Protocols: The SGl Origin System
    8.5.1 Cache Coherence Protocol
    8.5.2 Dealing with Correctness Issues
    8.5.3 Details of Directory Structure
    8.5.4 Protocol Extensions
    8.5.5 Overview of the Origin2000 Hardware
    8.5.6 Hub Implementation
    8.5.7 Performance Characteristics
    8.6 Cache-Based Directory Protocols: The Sequent NUMA-Q
    8.6.1 Cache Coherence Protocol
    8.6.2 Dealing with Correctness Issues
    8.6.3 Protocol Extensions
    8.6.4 Overview of NUMA-Q Hardware
    8.6.5 Protocol Interactions with SMP Node
    8.6.6 IiQ-Link implementation
    8.6.7 Performance Characteristics
    8.6.8 Comparison Case Study: The HAL Sl Multiprocessor
    8.7 Performance Parameters and Protocol Performance
    8.8 Synchronization
    8.8.l Performance of Synchronization Algorithms
    8.8.2 implementing Atomic Primitives
    8.9 implications for Parallel Software
    8.l0 Advanced Topics
    8.l0.l Reducing Directory Storage Overhead
    8.10.2 Hierarchical Coherence
    8.ll C oncluding Remarks
    8.12 Exercises
   9 Hardware/Software Trade-Offs
    9.1 Relaxed Memory Consistency Models
    9.1.l The System Specification
    9.1.2 The Programmer's Interface
    9.1.3 The Translation Mechanism
    9.1.4 Consistency Models in Real Multiprocessor Systems
    9.2 Overcoming Capacity Limitations
    9.2.1 Tertiary Caches
    9.2.2 Cache-Only Memory Architectures (COMA)
    9.3 Reducing Hardware Cost
    9.3.1 Hardware Access Control with a Decoupled Assist
    9.3.2 Access Control through Code Instrumentation
    9.3.3 Page-Based Access Control: Shared Virtual Memory
    9.3.4 Access Control through Language and Compiler Support
    9.4 Putting It All Together: A Taxonomy and Simple COMA
    9.4.1 Putting It All Together: Simple COMA and Stache
    9.5 Implications for Parallel Software
    9.6 Advanced Topics
    9.6.1 Flexibility and Address Constraints in CC-NUMA Systems
    9.6.2 Implementing Relaxed Memory Consistency in Software
    9.7 C oncluding Remarks
    9.8 Exercises
   10 Interconnection Network Design
    10.1 Basic Definitions
    10.2 Basic Communication Performance
    10.2.1 Latency
    10.2.2 Bandwidth
    10.3 Organizational Structure
    10.3.1 Links
    10.3.2 Switches
    10.3.3 Network Interfaces
    10.4 Interconnection Topologies
    10.4.1 Fully Connected Network
    10.4.2 Linear Arrays and Rings
    10.4.3 Multidimensional Meshes and Tori
    10.4.4 Trees
    10.4.5 Butterflies
    10.4.6 Hypercubes
    10.5 Evaluating Design Trade-Offs in Network Topology
    10.5.1 Unloaded Latency
    10.5.2 Latency under Load
    10.6 Routing
    10.6.1 Routing Mechanisms
    10.6.2 Deterministic Routing
    10.6.3 Deadlock Freedom
    10.6.4 Virtual Channels
    10.6.5 Up*-Down* Routing
    10.6.6 Tum-Model Routing
    10.6.7 Adaptive Routing
    10.7 Switch Design
    10.7.1 Ports
    10.7.2 Intemal Datapath
    10.7.3 Channel Buffers
    10.7.4 Output Scheduling
    10.7.5 Stacked Dimension Switches
    10.8 Flow Control
    10.8.1 Parallel Computer Networks versus 1 ANs and WANs
    10.8.2 Link-Level Flow Control
    10.8.3 End-to-End Flow Control
    10.9 Case Studies
    10.9.1 CRAY T3D Network
    10.9.2 IBM SP-l, SP-2 Network
    10.9.3 Scalable Coherent Interface
    10.9.4 SGI Origin Network
    10.9.5 Myricom Network
    10.10 Concluding Remarks
    10.11 Exercises
   11 Latency Tolerance
    11.1 Overview of Latency Tolerance
    11.1.1 Latency Tolerance and the Communication Pipeline
    11.1.2 Approaches
    11.1.3 Fundamental Requirements, Benefits, and Limitations
    11.2 Latency Tolerance in Explicit Message Passing
    11.2.1 Structure of Communication
    11.2.2 Block Data Transfer
    11.2.3 Precommuucation
    11.2.4 Proceeding Past Communication in the Same Thread
    11.2.5 Multithreading
    11.3 Latency Tolerance in a Shared Address Space
    11.3.1 Structure of Communication
    11.4 Block Data Transfer in a Shared Address Space
    11.4.1 Techniques and Mechanisms
    11.4.2 Policy Issues and Trade-Offs
    11.4.3 Performance Benefits
    11.5 Proceeding Past Long-Latency Events
    11.5.1 Proceeding Past Writes
    11.5.2 Proceeding Past Reads
    11.5.3 Summary
    11.6 Precommunication in a Shared Address Space
    11.6.1 Shared Address Space without Caching of Shared Data
    11.6.2 Cache-Coherent Shared Address Space
    11.6.3 Performance Benefits
    11.6.4 Summary
    11.7 Multithreading in a Shared Address Space
    11.7.1 Techniques and Mechanisms
    11.7.2 Performance Benefits
    11.7.3 Implementation Issues for the Blocked Scheme
    11.7.4 Implementation Issues for the Interleaved Scheme
    11.7.5 Integrating Multithreading with Multiple-Issue Processors
    11.8 Lockup-Free Cache Design
    11.9 Concluding Remarks
    11.10 Exercises
   12 Future Directions
    12.1 Technology and Architecture
    12.1.1 Evolutionary Scenario
    12.1.2 HittingaWall
    12.1.3 Potential Breakthroughs
    12.2 Applications and System Software
    12.2.1 Evolutionary Scenario
    12.2.2 HittingaWall
    12.2.3 Potential Breakthroughs
   Appendix: Parallel Benchmark Suites
    A.1 ScaLapack
    A.2 TPC
    A.3 SPLASH
    A.4 NAS Parallel Benchmarks
    A.5 PARKBENCH
    A.6 Other Ongoing Efforts
   References
   Index