List of Figures
List of Tables
Preface
Acknowledgments
Trademarks
Chapter 1 Architecture and Implementation
1.1 Analogy: Piano Architecture
1.2 Types of Computer Languages
1.3 Why Study Assembly Language?
1.4 Prefixes for Binary Multiples
1.5 Instruction Set Architectures
1.6 The Life Cycle of Computer Architectures
1.6.1 The 32-Bit Intel Architecture and Its Predecessors
1.6.2 The AlphaTM Architecture and Its Predecessors
1.6.3 The Itanium Architecture and Its Predecessors
1.6.4 The Naming of Architectures and Implementations
1.7 SQUARES: A First Programming Example
1.7.1 C, FORTRAN, and COBOL
1.7.2 Assembly Language for Itanium Architecture
1.8 Review of Number Systems
1.8.1 Positional Coefficients and Weights
1.8.2 Binary and Hexadecimal Representations
1.8.3 Signed Integers
Summary
References
Exercises
Chapter 2 Computer Structures and Data Representations
2.1 Computer Structures
2.1.1 The Central Processing Unit
2.1.2 The Memory
2.1.3 The Input/Output System
2.2 Instruction Execution
2.3 Classes of Instruction Set Architectures
2.4 Migration to 64-Bit Architectures
2.5 Itanium Information Units and Data Types
2.5.1 Integers
2.5.2 Floating-Point Numbers
2.5.3 Alphanumeric Characters
Summary
References
Exercises
Chapter 3 The Program Assembler and Debugger
3.1 Programming Environments
3.2 Program Development Steps
3.3 Comparing Variants of a Source File
3.4 Assembler Statement Types
3.4.1 Statement Format
3.4.2 Symbolic Addresses
3.4.3 Classes of Assembly Language Operators
3.5 The Functions of a Symbolic Assembler
3.5.1 Constants
3.5.2 Symbols or Identifiers
3.5.3 Storage Allocation
3.5.4 The Location Counter
3.5.5 Expressions
3.5.6 Control Statements
3.5.7 Elements of a Listing File
3.6 The Assembly Process
3.7 The Linking Process
3.8 The Program Debugger
3.8.1 Capabilities of Debugger Programs
3.8.2 Running SQUARES using gdb (Linux~ and HP-UX~)
3.8.3 Running SQUARES using adb (HP-UX)
3.8.4 Examples of Debugger Commands
3.9 Conventions for Writing Programs
Summary
References
Exercises
Chapter 4 Itanium Instruction Formats and Addressing
4.1 Overview of Itanium Instruction Formats
4.1.1 Instruction Bundles
4.1.2 Instruction Bit-Field Layouts
4.1.3 Classes of Itanium Instructions
4.2 Integer Arithmetic Instructions
4.2.1 Addition and Subtraction
4.2.2 Arithmetic Overflow
4.2.3 Shift Left and Add Instruction
4.2.4 Special-Case Arithmetic Operations
4.2.5 Multiplication of 16-Bit Signed Integers
4.2.6 Full-Width Multiplication and Division
4.3 Bit Encoding for Itanium Instructions
4.4 HEXNUM: Using Arithmetic Instructions
4.5 Data Access Instructions
4.5.1 Itanium Cache Structures
4.5.2 Integer Store Instructions
4.5.3 Integer Load Instructions
4.5.4 Move Long Immediate Instruction
4.5.5 Accessing Simple Record Structures
4.5.6 Access to Specialized CPU Registers
4.6 Other ALU Instructions
4.6.1 Sign-Extend Instruction
4.6.2 Zero-Extend Instruction
4.6.3 Instructions for Quantities Less Than 64 Bits in Width
4.7 DOTPROD: Using Data Access Instructions
4.8 Itanium Addressing Modes
4.8.1 Immediate Addressing
4.8.2 Register Direct Addressing
4.8.3 Register Indirect Addressing
4.8.4 Autoincrement Addressing
4.8.5 Summary of Itanium Addressing Modes
4.8.6 Addressing Details in Previous Programs
4.9 Addressing in Other Architectures
4.9.1 Modes Built on Register Indirect Addressing
4.9.2 Modes Built on Displacement Addressing
4.9.3 Comparison of Modes Across Architectures
Summary
References
Exercises
Chapter 5 Comparison, Branches, and Predication
5.1 Hardware Basis for Control of Flow
5.1.1 Condition Codes
5.1.2 State-Management Approaches
5.1.3 Predicate Registers
5.2 Integer Compare Instructions
5.2.1 Signed Comparison and Equality
5.2.2 Unsigned Comparison
5.3 Program Branching
5.3.1 Ordinary Branch Instructions
5.3.2 Timing Considerations for Branches
5.3.3 If...Then...Else Structures
5.3.4 Loop Structures
5.3.5 Branch Addressing Range
5.3.6 Locality and Program Performance
5.4 DOTLOOP: Using a Counted Loop
5.5 Stops, Instruction Groups, and Performance
5.5.1 Study of Stops and Groups in DOTLOOP
5.5.2 Simplified Rules for Data Dependency
5.5.3 How Itanium Assemblers Handle Stops
5.5.4 Local Labels for Loops
5.5.5 Loops, Branches, and Overall Performance
5.6 DOTCLOOP: Using the Loop Count Register
5.7 Other Structured Programming Constructs
5.7.1 Unconditional Compare Instructions
5.7.2 Nested If...Then...Else Structures
5.7.3 Multiway Branching
5.7.4 Simple Case Structures
5.8 MAXIMUM: Using Conditional Instructions
Summary
References
Exercises
Chapter 6 Logical Operations, Bit-Shifts, and Bytes
6.1 Logical Functions
6.1.1 Boolean Functions of Two Variables
6.1.2 Logical Instructions
6.1.3 Applications of Logical Functions
6.1.4 The Single-Bit Test Instruction
6.1.5 Parallel (Logical) Conditions
6.1.6 The Logical Basis of Addition
6.2 HEXNUM2: Using Logical Masks
6.3 Bit and Field Operations
6.3.1 Shift Instructions
6.3.2 Applications of Shift Operations
6.3.3 The Shift Right Pair Instruction
6.3.4 Extract and Deposit Instructions
6.4 SCANTEXT: Processing Bytes
6.5 Integer Multiplication and Division
6.5.1 Booth''s Algorithm for Multiplication
6.5.2 Unsigned Multiplication
6.5.3 Division Using Known Reciprocals
6.6 DECNUM: Converting an Integer to Decimal Format
6.7 Using C for ASCII Input and Output
6.7.1 GETPUT: Encapsulating C Functions
6.7.2 IO_C: A Simple Test Program
6.7.3 Additional Concepts
6.8 BACKWARD: Using Byte Manipulations
Summary
References
Exercises
Chapter 7 Subroutines, Procedures, and Functions
7.1 Memory Stacks
7.1.1 Stack Addressing for CISC Architectures
7.1.2 Stack Addressing for Load/Store Architectures
7.1.3 Stack Addressing for Itanium Architecture
7.1.4 User-Defined Stacks
7.2 DECNUM2: Using Stack Operations
7.3 Register Stacks
7.3.1 SPARC Register Windows
7.3.2 Itanium Register Stack
7.3.3 The alloc Instruction
7.3.4 The Register Stack Engine (RSE)
7.3.5 Banked Registers
7.4 Program Segmentation
7.4.1 Source-Level Modularity
7.4.2 Traditional Subroutines
7.4.3 Coroutines
7.4.4 Procedures and Functions
7.4.5 Shared Library Functions
7.5 Calling Conventions
7.5.1 Register Contention and Conventions
7.5.2 Call and Return Branch Instructions
7.5.3 Argument Passing: Locations
7.5.4 Argument Passing: Methods
7.5.5 Prologues and Epilogues
7.5.6 The regstk Directive
7.6 DECNUM3 and BOOTH: Making a Function
7.6.1 Defining the Interface
7.6.2 BOOTH: The Callable Function
7.6.3 DECNUM3: The Test Program
7.6.4 Position-Independent Code
7.7 Integer Quotients and Remainders
7.7.1 Routines Used by a High-Level Language
7.7.2 Open-Source Routines from Intel Corporation
7.8 RANDOM: A Callable Function
7.8.1 Choosing an Algorithm
7.8.2 RANDOM: Developing the Function
7.8.3 High-Level Language Calling Programs
Summary
References
Exercises
Chapter 8 Floating-Point Operations
8.1 Parallels Between Integer and Floating-Point Instructions
8.2 Representations of Floating-Point Values
8.2.1 IEEE Special Values
8.2.2 Values in Itanium Floating-Point Registers
8.3 Copying Floating-Point Data
8.3.1 Floating-Point Store Instructions
8.3.2 Floating-Point Load Instructions
8.3.3 Floating-Point Load Pair Instruction
8.3.4 Floating-Point Pseudoinstructions for Register-Register Copying
8.3.5 Floating-Point Merge Instruction
8.4 Floating-Point Arithmetic Instructions
8.4.1 Addition, Subtraction, and Multiplication
8.4.2 Fused Multiply-Add and Multiply-Subtract Instructions
8.4.3 Normalization as Another Special Case
8.4.4 Maximum and Minimum Operations
8.4.5 Rounding, Exceptions, and Floating-Point Control
8.5 HORNER: Evaluating a Polynomial
8.6 Predication Based on Floating-Point Values
8.6.1 Floating-Point Compare Instruction
8.6.2 Floating-Point Class Instruction
8.7 Integer Operations in Floating-Point Execution Units
8.7.1 Data Conversion Instructions
8.7.2 Integer Multiplication Instructions
8.7.3 Multiplication Strategies
8.7.4 Floating-Point Logical Instructions
8.8 Approximations for Reciprocals and Square Roots
8.8.1 Floating-Point Reciprocal Approximation
8.8.2 Reciprocal Square Root Approximation
8.8.3 Floating-Point Division
8.8.4 Open-Source Routines from Intel Corporation
8.9 APPROXPI: Using Floating-Point Instructions
Summary
References
Exercises
Chapter 9 Input and Output of Text
9.1 File Systems
9.1.1 Unix~ I/O Software
9.1.2 Linux~ I/O Software
9.2 Keyboard and Display I/O
9.2.1 Unformatted Line FO
9.2.2 Formatted I/O
9.3 SCANTERM: Using C Standard I/O
9.4 SORTSTR: Sorting Strings
9.5 Text File FO
9.5.1 Directory-Level Access
9.5.2 Unformatted Line I/O
9.5.3 Formatted FO
9.6 SCANFILE: Input and Output with Files
9.7 SORTINT: Sorting Integers from a File
9.8 Binary Files
Summary
References
Exercises
Chapter 10 Performance Considerations
10.1 Processor-Level Parallelism
10.1.1 Simplified Instruction Pipeline
10.1.2 Superscalar Pipelining
10.1.3 Itanium 2 Processor Pipelines
10.1.4 Pipeline Hazards
10.2 Instruction-Level Parallelism
10.2.1 RISC Approaches
10.2.2 The VLIW Idea
10.2.3 EPIC as a Way Forward
10.3 Explicit Parallelism in the ltanium Processors
10.3.1 Instruction Templates
10.3.2 Data Dependency and Speculation
10.3.3 Control Dependency and Speculation
10.3.4 Combined Control and Data Speculation
10.4 Software-Pipelined Loops
10.4.1 Traditional Loop Unrolling
10.4.2 Software Pipelining
10.4.3 Rotating Registers
10.4.4 Loop Phases
10.4.5 Branch Instructions for Software Pipelines
10.5 Modulo Scheduling a Loop
10.5.1 DOTCTOP: Implementation-Independent Schedule
10.5.2 DOTCTOP2: Itanium 2 Processor Schedule
10.5.3 Further Considerations
10.6 Program Optimization Factors
10.6.1 Instruction Size
10.6.2 Addressing Mode
10.6.3 Instruction Power
10.6.4 Program Size
10.6.5 Prefetching Lines intoCache
10.6.6 Use of Inline Functions
10.6.7 Instruction Reordering
10.6.8 Recursion and Related Factors
10.7 Fibonacci Numbers
10.7.1 FIB1: Function Using Recursion
10.7.2 FIB2: Function Without Recursion
10.7.3 FIB3: Function Using the Register Stack
10.7.4 TESTFIB: Showing the Cost of Recursion
Summary
References
Exercises
Chapter 11 Looking at Output from Compilers
11.1 Compilers for RISC-like Systems
11.1.1 Optimization Levels for Open-Source Compilers
11.1.2 Optimization Levels for Intel Compilers
11.1.3 Optimization Levels for HP-UX Compilers
11.1.4 Additional Optimization Possibilities
11.2 Compiling a Simple Program
11.2.1 Comparing Output from gcc and ecc (Linux)
11.2.2 Comparing Output from gcc and g77 (Linux)
11.2.3 Comparing Output from ccbundled and f90 (HP-UX)
11.3 Optimizing a Simple Program
11.3.1 Comparing Levels -O1 and -02 for g77 (Linux)
11.3.2 Compiler Messages
11.3.3 Loop Length and Optimization with f90 (HP-UX)
11.4 Inline Optimizations
11.5 Profile-Guided or Other Optimizations
11.6 Debugging Optimized Programs
11.7 Recursion for Fibonacci Numbers Revisited
Summary
References
Exercises
Chapter 12 Parallel Operations
12.1 Classification of Computing Systems
12.2 Integer Parallel Operations
12.3 Applications to Integer Multiplication
12.3.1 32x32-Bit Sources Giving 32-Bit Unsigned Product
12.3.2 32x32-Bit Sources Giving 64-Bit Unsigned Product
12.4 Opportunities and Challenges
12.5 Floating-Point Parallel Operations
12.6 Semaphore Support for Parallel Processes
12.6.1 Previous Architectures
12.6.2 Itanium Architecture
Summary
References
Exercises
Chapter 13 Variations Among Implementations
13.1 Why Implementations Change
13.1.1 Demands and Opportunities
13.1.2 Implications of Moore''s Law
13.1.3 Anticipating a Long Lifetime for an Architecture
13.2 How Implementations Change
13.3 The Original Itanium Processor
13.3.1 Comparison to the Itanium 2 Processor
13.3.2 Cache Hierarchy
13.3.3 Execution Units and Issue Ports
13.3.4 Pipelines
13.3.5 Latency Factors
13.3.6 Branch Prediction
13.3.7 Other Differences and Features
13.4 A Major Role for Software
13.4.1 New Architectures
13.4.2 New Implementations
13.4.3 New Instructions or More Registers
13.5 IA-32 Instruction Set Mode
13.6 Determining Extensions and Implementation Version
Summary
References
Exercises
Appendix A Command-Line Environments
References
Exercises
Appendix B Suggested System Resources
B. 1 System Hardware
B.1.1 Itanium Workstation or Server
B.1.2 Ski Simulator on an IA-32 Linux System
B.1.3 Ski Simulator on a Linux Virtual Machine
B.1.4 Other Simulators
B. 2 System Software
B.2.1 Linux
B.2.2 HP-UX
B.2.3 The Ski Simulator
B.2.4 64-Bit Windows
B.2.5 FreeBSD
B.2.6 OpenVMS
B.3 Desktop Client Access Software
B.3.1 Linux Personal Computers
B.3.2 Macintosh Personal Computers
B.3.3 Windows Personal Computers
B.3.3 References
Appendix C Itanium Instruction Set
C-1 Instructions Listed by Function
C-2 Instructions Listed by Assembler Opcode
References
Appendix D Itanium Registers and Their Uses
D.1 Instruction Pointer
D.2 General Registers and NaT Bits
D.3 Predicate Registers
D.4 Branch Registers
D.5 Floating-Point Registers
D.6 Application Registers
D.7 State Management Registers
D.8 System Information Registers
D.9 System Control Registers
References
Appendix E Conditional Assembly and Macros(GCC Assembler)
E.1 Interference from Explicit Stops
E.2 Repeat Blocks
E.2.1 Simple Repeat Blocks
E.2.2 Indefinite Repeat Blocks Using the .irp Directive
E.2.3 Indefinite Repeat Blocks Using the .irpc Directive
E.3 Conditional Assembly
E.4 Macro Processing
E.4.1 Defining a Macro
E.4.2 Invoking aMacro
E.4.3 Processing of Positional Parameters
E.4.4 Processing of Default Values and Keyword Parameters
E.4.5 Processing of String Parameters
E.5 Using Labels with Macros
E.6 Recursive Macros
E.7 Object File Sections
E.8 MONEY: A Macro Illustrating Sections
Summary
References
Exercises
Appendix F Inline Assembly
F.1 HP-UX C Compilers
F.2 GCC Compiler for Linux
F.3 Intel Compilers for Linux
References
Bibliography
Answers and Hints for Selected Exercises
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
About the Authors
Index