<<  Aluminium Research for Road Transport Applications Lecture 4: An Introduction to the Vector Data Model and Map Layout Techniques  >>
CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction
CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction
Review: Branch Prediction/Speculation
Review: Branch Prediction/Speculation
Review: Memory Disambiguation
Review: Memory Disambiguation
Memory Dependence Prediction
Memory Dependence Prediction
Said another way: Could we do better
Said another way: Could we do better
Premise: Past indicates Future
Premise: Past indicates Future
How well does an infinite tracking work
How well does an infinite tracking work
How to track Store Sets in reality
How to track Store Sets in reality
How well does this do
How well does this do
Load Value Predictability
Load Value Predictability
Load Value Prediction Table
Load Value Prediction Table
Load Classification Table (LCT)
Load Classification Table (LCT)
Accuracy of LCT
Accuracy of LCT
Constant Value Unit
Constant Value Unit
Load Value Architecture
Load Value Architecture
Administrivia
Administrivia
One important tool is RAMP Gold: FAST Emulation of new Hardware
One important tool is RAMP Gold: FAST Emulation of new Hardware
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Implementing the Space-Time Graph
Implementing the Space-Time Graph
Sample of what could make good projects
Sample of what could make good projects
Projects using Quantum CAD Flow
Projects using Quantum CAD Flow
Data Value Prediction
Data Value Prediction
Data Value Predictability
Data Value Predictability
Computational Predictors
Computational Predictors
Context Based Predictors
Context Based Predictors
Which is better
Which is better
How predictable are data items
How predictable are data items
Correlation of Predicted Sets
Correlation of Predicted Sets
Number of unique values
Number of unique values
?
?
Discussion of papers: The Alpha 21264 Microprocessor
Discussion of papers: The Alpha 21264 Microprocessor
Limits to ILP
Limits to ILP
Overcoming Limits
Overcoming Limits
Limits to ILP
Limits to ILP
Limits to ILP HW Model comparison
Limits to ILP HW Model comparison
Upper Limit to ILP: Ideal Machine (Figure 3.1)
Upper Limit to ILP: Ideal Machine (Figure 3.1)
Limits to ILP HW Model comparison
Limits to ILP HW Model comparison
More Realistic HW: Window Impact Figure 3.2
More Realistic HW: Window Impact Figure 3.2
Limits to ILP HW Model comparison
Limits to ILP HW Model comparison
More Realistic HW: Branch Impact Figure 3.3
More Realistic HW: Branch Impact Figure 3.3
Misprediction Rates
Misprediction Rates
Limits to ILP HW Model comparison
Limits to ILP HW Model comparison
More Realistic HW: Renaming Register Impact (N int + N fp) Figure 3.5
More Realistic HW: Renaming Register Impact (N int + N fp) Figure 3.5
Limits to ILP HW Model comparison
Limits to ILP HW Model comparison
More Realistic HW: Memory Address Alias Impact Figure 3.6
More Realistic HW: Memory Address Alias Impact Figure 3.6
Limits to ILP HW Model comparison
Limits to ILP HW Model comparison
Realistic HW: Window Impact (Figure 3.7)
Realistic HW: Window Impact (Figure 3.7)
How to Exceed ILP Limits of this study
How to Exceed ILP Limits of this study
HW v. SW to increase ILP
HW v. SW to increase ILP
Discussion of papers: Complexity-effective superscalar processors
Discussion of papers: Complexity-effective superscalar processors
Performance beyond single thread ILP
Performance beyond single thread ILP
Thread Level Parallelism (TLP)
Thread Level Parallelism (TLP)
And in conclusion
And in conclusion

: CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011. : John Kubiatowicz. : CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011.ppt. zip-: 1943 .

CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011

CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011.ppt
1 CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction

CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction

(Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011

John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

2 Review: Branch Prediction/Speculation

Review: Branch Prediction/Speculation

PC

Fetch

Decode & Rename

Reorder Buffer

Commit

Reg. File

MEM

Branch Unit

ALU

Store Buffer

D$

Execute

Update predictors

Branch Prediction

2

2/23/2011

CS252-S11 lecture 10

3 Review: Memory Disambiguation

Review: Memory Disambiguation

Question: Given a load that follows a store in program order, are the two related? Trying to detect RAW hazards through memory Stores commit in order (ROB), so no WAR/WAW memory hazards. Implementation Keep queue of stores, in program order Watch for position of new loads relative to existing stores Typically, this is a different buffer than ROB! Could be ROB (has right properties), but too expensive When have address for load, check store queue: If any store prior to load is waiting for its address?????? If load address matches earlier store address (associative lookup), then we have a memory-induced RAW hazard: store value available ? return value store value not available ? return ROB number of source Otherwise, send out request to memory Will relax exact dependency checking in later lecture

3

2/23/2011

CS252-S11 lecture 10

4 Memory Dependence Prediction

Memory Dependence Prediction

Important to speculate? Two Extremes: Na?ve Speculation: always let load go forward No Speculation: always wait for dependencies to be resolved Compare Na?ve Speculation to No Speculation False Dependency: wait when dont have to Order Violation: result of speculating incorrectly Goal of prediction: Avoid false dependencies and order violations

From Memory Dependence Prediction using Store Sets, Chrysos and Emer.

4

2/23/2011

CS252-S11 lecture 10

5 Said another way: Could we do better

Said another way: Could we do better

Results from same paper: performance improvement with oracle predictor We can get significantly better performance if we find a good predictor Question: How to build a good predictor?

5

2/23/2011

CS252-S11 lecture 10

6 Premise: Past indicates Future

Premise: Past indicates Future

Basic Premise is that past dependencies indicate future dependencies Not always true! Hopefully true most of time Store Set: Set of store insts that affect given load Example: Addr Inst 0 Store C 4 Store A 8 Store B 12 Store C 28 Load B ? Store set { PC 8 } 32 Load D ? Store set { (null) } 36 Load C ? Store set { PC 0, PC 12 } 40 Load B ? Store set { PC 8 } Idea: Store set for load starts empty. If ever load go forward and this causes a violation, add offending store to loads store set Approach: For each indeterminate load: If Store from Store set is in pipeline, stall Else let go forward Does this work?

6

2/23/2011

CS252-S11 lecture 10

7 How well does an infinite tracking work

How well does an infinite tracking work

Infinite here means to place no limits on: Number of store sets Number of stores in given set Seems to do pretty well Note: Not Predicted means load had empty store set Only Applu and Xlisp seems to have false dependencies

7

2/23/2011

CS252-S11 lecture 10

8 How to track Store Sets in reality

How to track Store Sets in reality

SSIT: Assigns Loads and Stores to Store Set ID (SSID) Notice that this requires each store to be in only one store set! LFST: Maps SSIDs to most recent fetched store When Load is fetched, allows it to find most recent store in its store set that is executing (if any) ? allows stalling until store finished When Store is fetched, allows it to wait for previous store in store set Pretty much same type of ordering as enforced by ROB anyway Transitivity? loads end up waiting for all active stores in store set What if store needs to be in two store sets? Allow store sets to be merged together deterministically Two loads, multiple stores get same SSID Want periodic clearing of SSIT to avoid: problems with aliasing across program Out of control merging

8

2/23/2011

CS252-S11 lecture 10

9 How well does this do

How well does this do

Comparison against Store Barrier Cache Marks individual Stores as tending to cause memory violations Not specific to particular loads. Problem with APPLU? Analyzed in paper: has complex 3-level inner loop in which loads occasionally depend on stores Forces overly conservative stalls (i.e. false dependencies)

9

2/23/2011

CS252-S11 lecture 10

10 Load Value Predictability

Load Value Predictability

Try to predict the result of a load before going to memory Paper: Value locality and load value prediction Mikko H. Lipasti, Christopher B. Wilkerson and John Paul Shen Notion of value locality Fraction of instances of a given load that match last n different values Is there any value locality in typical programs? Yes! With history depth of 1: most integer programs show over 50% repetition With history depth of 16: most integer programs show over 80% repetition Not everything does well: see cjpeg, swm256, and tomcatv Locality varies by type: Quite high for inst/data addresses Reasonable for integer values Not as high for FP values

10

2/23/2011

CS252-S11 lecture 10

11 Load Value Prediction Table

Load Value Prediction Table

Load Value Prediction Table (LVPT) Untagged, Direct Mapped Takes Instructions ? Predicted Data Contains history of last n unique values from given instruction Can contain aliases, since untagged How to predict? When n=1, easy When n=16? Use Oracle Is every load predictable? No! Why not? Must identify predictable loads somehow

11

2/23/2011

CS252-S11 lecture 10

12 Load Classification Table (LCT)

Load Classification Table (LCT)

Instruction Addr

Load Classification Table (LCT) Untagged, Direct Mapped Takes Instructions ? Single bit of whether or not to predict How to implement? Uses saturating counters (2 or 1 bit) When prediction correct, increment When prediction incorrect, decrement With 2 bit counter 0,1 ? not predictable 2 ? predictable 3 ? constant (very predictable) With 1 bit counter 0 ? not predictable 1 ? constant (very predictable)

12

2/23/2011

CS252-S11 lecture 10

13 Accuracy of LCT

Accuracy of LCT

Question of accuracy is about how well we avoid: Predicting unpredictable load Not predicting predictable loads How well does this work? Difference between Simple and Limit: history depth Simple: depth 1 Limit: depth 16 Limit tends to classify more things as predictable (since this works more often) Basic Principle: Often works better to have one structure decide on the basic predictability of structure Independent of prediction structure

13

2/23/2011

CS252-S11 lecture 10

14 Constant Value Unit

Constant Value Unit

Idea: Identify a load instruction as constant Can ignore cache lookup (no verification) Must enforce by monitoring result of stores to remove constant status How well does this work? Seems to identify 6-18% of loads as constant Must be unchanging enough to cause LCT to classify as constant

14

2/23/2011

CS252-S11 lecture 10

15 Load Value Architecture

Load Value Architecture

LCT/LVPT in fetch stage CVU in execute stage Used to bypass cache entirely (Know that result is good) Results: Some speedups 21264 seems to do better than Power PC Authors think this is because of small first-level cache and in-order execution makes CVU more useful

15

2/23/2011

CS252-S11 lecture 10

16 Administrivia

Administrivia

Exam: Week after 3/16? CS252 First Project proposal due by Friday 3/4 Need two people/project (although can justify three for right project) Complete Research project in 9 weeks Typically investigate hypothesis by building an artifact and measuring it against a base case Generate conference-length paper/give oral presentation Often, can lead to an actual publication.

16

2/23/2011

CS252-S11 lecture 10

17 One important tool is RAMP Gold: FAST Emulation of new Hardware

One important tool is RAMP Gold: FAST Emulation of new Hardware

RAMP emulation model for Parlab manycore SPARC v8 ISA -> v9 Considering ARM model Single-socket manycore target Split functional/timing model, both in hardware Functional model: Executes ISA Timing model: Capture pipeline timing detail (can be cycle accurate) Host multithreading of both functional and timing models Built for Virtex-5 systems (ML505 or BEE3) Have Tessellation OS currently running on RAMP system!

17

2/23/2011

CS252-S11 lecture 10

18 Tessellation: The Exploded OS

Tessellation: The Exploded OS

Normal Components split into pieces Device drivers (Security/Reliability) Network Services (Performance) TCP/IP stack Firewall Virus Checking Intrusion Detection Persistent Storage (Performance, Security, Reliability) Monitoring services Performance counters Introspection Identity/Environment services (Security) Biometric, GPS, Possession Tracking Applications Given Larger Partitions Freedom to use resources arbitrarily

18

2/23/2011

CS252-S11 lecture 10

19 Implementing the Space-Time Graph

Implementing the Space-Time Graph

Partition Policy Layer (Resource Allocator) Reflects Global Goals

Partition Policy layer (allocation) Allocates Resources to Cells based on Global policies Produces only implementable space-time resource graphs May deny resources to a cell that requests them (admission control) Mapping layer (distribution) Makes no decisions Time-Slices at a course granularity (when time-slicing necessary) performs bin-packing like operation to implement space-time graph In limit of many processors, no time multiplexing processors, merely distributing resources Partition Mechanism Layer Implements hardware partitions and secure channels Device Dependent: Makes use of more or less hardware support for QoS and Partitions

Mapping Layer (Resource Distributer)

Partition Mechanism Layer ParaVirtualized Hardware To Support Partitions

19

2/23/2011

CS252-S11 lecture 10

20 Sample of what could make good projects

Sample of what could make good projects

Implement new resource partitioning mechanisms on RAMP and integrate into OS You can actually develop a new hardware mechanism, put into the OS, and show how partitioning gives better performance or real-time behavior You could develop new message-passing interfaces and do the same Virtual I/O devices RAMP-Gold runs in virtual time Develop devices and methodology for investigating real-time behavior of these devices in Tessellation running on RAMP Energy monitoring and adaptation How to measure energy consumed by applications and adapt accordingly Develop and evaluate new parallel communication model Target for Multicore systems New Message-Passing Interface, New Network Routing Layer Investigate applications under different types of hardware CUDA vs MultiCore, etc New Style of computation, tweak on existing one Better Memory System, etc.

20

2/23/2011

CS252-S11 lecture 10

21 Projects using Quantum CAD Flow

Projects using Quantum CAD Flow

Use the quantum CAD flow developed in Kubiatowiczs group to investigate Quantum Circuits Tradeoff in area vs performance for Shors algorithm Other interesting algorithms (Quantum Simulation)

21

2/23/2011

CS252-S11 lecture 10

22 Data Value Prediction

Data Value Prediction

Why do it? Can Break the DataFlow Boundary Before: Critical path = 4 operations (probably worse) After: Critical path = 1 operation (plus verification)

22

2/23/2011

CS252-S11 lecture 10

23 Data Value Predictability

Data Value Predictability

The Predictability of Data Values Yiannakis Sazeides and James Smith, Micro 30, 1997 Three different types of Patterns: Constant (C): 5 5 5 5 5 5 5 5 5 5 Stride (S): 1 2 3 4 5 6 7 8 9 Non-Stride (NS): 28 13 99 107 23 456 Combinations: Repeated Stride (RS): 1 2 3 1 2 3 1 2 3 1 2 3 Repeadted Non-Stride (RNS): 1 -13 -99 7 1 -13 -99 7

23

2/23/2011

CS252-S11 lecture 10

24 Computational Predictors

Computational Predictors

Last Value Predictors Predict that instruction will produce same value as last time Requires some form of hysteresis. Two subtle alternatives: Saturating counter incremented/decremented on success/failure replace when the count is below threshold Keep old value until new value seen frequently enough Second version predicts a constant when appears temporarily constant Stride Predictors Predict next value by adding the sum of most recent value to difference of two most recent values: If vn-1 and vn-2 are the two most recent values, then predict next value will be: vn-1 + (vn-1 vn-2) The value (vn-1 vn-2) is called the stride Important variations in hysteresis: Change stride only if saturating counter falls below threshold Or two-delta method. Two strides maintained. First (S1) always updated by difference between two most recent values Other (S2) used for computing predictions When S1 seen twice in a row, then S1?S2 More complex predictors: Multiple strides for nested loops Complex computations for complex loops (polynomials, etc!)

24

2/23/2011

CS252-S11 lecture 10

25 Context Based Predictors

Context Based Predictors

Context Based Predictor Relies on Tables to do trick Classified according to the order: an n-th order model takes last n values and uses this to produce prediction So 0th order predictor will be entirely frequency based Consider sequence: a a a b c a a a b c a a a Next value is? Blending: Use prediction of highest order available

25

2/23/2011

CS252-S11 lecture 10

26 Which is better

Which is better

Stride-based: Learns faster less state Much cheaper in terms of hardware! runs into errors for any pattern that is not an infinite stride Context-based: Much longer to train Performs perfectly once trained Much more expensive hardware

26

2/23/2011

CS252-S11 lecture 10

27 How predictable are data items

How predictable are data items

Assumptions looking for limits Prediction done with no table aliasing (every instruction has own set of tables/strides/etc. Only instructions that write into registers are measured Excludes stores, branches, jumps, etc Overall Predictability: L = Last Value S = Stride (delta-2) FCMx = Order x context based predictor

27

2/23/2011

CS252-S11 lecture 10

28 Correlation of Predicted Sets

Correlation of Predicted Sets

Way to interpret: l = last val s = stride f = fcm3 Combinations: ls = both l and s Etc. Conclusion? Only 18% not predicted correctly by any model About 40% captured by all predictors A significant fraction (over 20%) only captured by fcm Stride does well! Over 60% of correct predictions captured Last-Value seems to have very little added value

28

2/23/2011

CS252-S11 lecture 10

29 Number of unique values

Number of unique values

Data Observations: Many static instructions (>50%) generate only one value Majority of static instructions (>90%) generate fewer than 64 values Majority of dynamic instructions (>50%) correspond to static insts that generate fewer than 64 values Over 90% of dynamic instructions correspond to static insts that generate fewer than 4096 unique values Suggests that a relatively small number of values would be required for actual context prediction

29

2/23/2011

CS252-S11 lecture 10

30 ?

?

General Idea: Confidence Prediction

Separate mechanisms for data and confidence prediction Data predictor keeps track of values via multiple mechanisms Confidence predictor tracks history of correctness (good/bad) Confidence prediction options: Saturating counter History register (like branch prediction)

Data Predictor

Confidence Prediction

Result

Commit

Fetch

Decode

Reorder Buffer

PC

Complete

Execute

30

2/23/2011

CS252-S11 lecture 10

31 Discussion of papers: The Alpha 21264 Microprocessor

Discussion of papers: The Alpha 21264 Microprocessor

BTB?Line/set predictor Trained by branch predictor (Tournament predictor) Renaming: 80 integer registers, 72 floating-point registers Clustered architecture for integer ops (2 clusters) Speculative Loads: Dependency speculation Cache-miss speculation

31

2/23/2011

CS252-S11 lecture 10

32 Limits to ILP

Limits to ILP

Conflicting studies of amount Benchmarks (vectorized Fortran FP vs. integer C programs) Hardware sophistication Compiler sophistication How much ILP is available using existing mechanisms with increasing HW budgets? Do we need to invent new HW/SW mechanisms to keep on processor performance curve? Intel MMX, SSE (Streaming SIMD Extensions): 64 bit ints Intel SSE2: 128 bit, including 2 64-bit Fl. Pt. per clock Motorola AltaVec: 128 bit ints and FPs Supersparc Multimedia ops, etc.

32

2/23/2011

CS252-S11 lecture 10

33 Overcoming Limits

Overcoming Limits

Advances in compiler technology + significantly new and different hardware techniques may be able to overcome limitations assumed in studies However, unlikely such advances when coupled with realistic hardware will overcome these limits in near future

33

2/23/2011

CS252-S11 lecture 10

34 Limits to ILP

Limits to ILP

Initial HW Model here; MIPS compilers. Assumptions for ideal/perfect machine to start: 1. Register renaming infinite virtual registers ? all register WAW & WAR hazards are avoided 2. Branch prediction perfect; no mispredictions 3. Jump prediction all jumps perfectly predicted (returns, case statements) 2 & 3 ? no control dependencies; perfect speculation & an unbounded buffer of instructions available 4. Memory-address alias analysis addresses known & a load can be moved before a store provided addresses not equal; 1&4 eliminates all but RAW Also: perfect caches; 1 cycle latency for all instructions (FP *,/); unlimited instructions issued/clock cycle;

34

2/23/2011

CS252-S11 lecture 10

35 Limits to ILP HW Model comparison

Limits to ILP HW Model comparison

Model

Power 5

Instructions Issued per clock

Infinite

4

Instruction Window Size

Infinite

200

Renaming Registers

Infinite

48 integer + 40 Fl. Pt.

Branch Prediction

Perfect

2% to 6% misprediction (Tournament Branch Predictor)

Cache

Perfect

64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias Analysis

Perfect

??

35

2/23/2011

CS252-S11 lecture 10

36 Upper Limit to ILP: Ideal Machine (Figure 3.1)

Upper Limit to ILP: Ideal Machine (Figure 3.1)

Instructions Per Clock

FP: 75 - 150

Integer: 18 - 60

36

2/23/2011

CS252-S11 lecture 10

37 Limits to ILP HW Model comparison

Limits to ILP HW Model comparison

New Model

Model

Power 5

Instructions Issued per clock

Infinite

Infinite

4

Instruction Window Size

Infinite, 2K, 512, 128, 32

Infinite

200

Renaming Registers

Infinite

Infinite

48 integer + 40 Fl. Pt.

Branch Prediction

Perfect

Perfect

2% to 6% misprediction (Tournament Branch Predictor)

Cache

Perfect

Perfect

64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect

Perfect

??

37

2/23/2011

CS252-S11 lecture 10

38 More Realistic HW: Window Impact Figure 3.2

More Realistic HW: Window Impact Figure 3.2

IPC

FP: 9 - 150

Integer: 8 - 63

Change from Infinite window 2048, 512, 128, 32

38

2/23/2011

CS252-S11 lecture 10

39 Limits to ILP HW Model comparison

Limits to ILP HW Model comparison

New Model

Model

Power 5

Instructions Issued per clock

64

Infinite

4

Instruction Window Size

2048

Infinite

200

Renaming Registers

Infinite

Infinite

48 integer + 40 Fl. Pt.

Branch Prediction

Perfect vs. 8K Tournament vs. 512 2-bit vs. profile vs. none

Perfect

2% to 6% misprediction (Tournament Branch Predictor)

Cache

Perfect

Perfect

64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect

Perfect

??

39

2/23/2011

CS252-S11 lecture 10

40 More Realistic HW: Branch Impact Figure 3.3

More Realistic HW: Branch Impact Figure 3.3

IPC

Change from Infinite window to examine to 2048 and maximum issue of 64 instructions per clock cycle

FP: 15 - 45

Integer: 6 - 12

Perfect

Tournament

BHT (512)

Profile

No prediction

40

2/23/2011

CS252-S11 lecture 10

41 Misprediction Rates

Misprediction Rates

41

2/23/2011

CS252-S11 lecture 10

42 Limits to ILP HW Model comparison

Limits to ILP HW Model comparison

New Model

Model

Power 5

Instructions Issued per clock

64

Infinite

4

Instruction Window Size

2048

Infinite

200

Renaming Registers

Infinite v. 256, 128, 64, 32, none

Infinite

48 integer + 40 Fl. Pt.

Branch Prediction

8K 2-bit

Perfect

Tournament Branch Predictor

Cache

Perfect

Perfect

64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect

Perfect

Perfect

42

2/23/2011

CS252-S11 lecture 10

43 More Realistic HW: Renaming Register Impact (N int + N fp) Figure 3.5

More Realistic HW: Renaming Register Impact (N int + N fp) Figure 3.5

IPC

FP: 11 - 45

Change 2048 instr window, 64 instr issue, 8K 2 level Prediction

Integer: 5 - 15

Infinite

256

128

64

32

None

43

2/23/2011

CS252-S11 lecture 10

44 Limits to ILP HW Model comparison

Limits to ILP HW Model comparison

New Model

Model

Power 5

Instructions Issued per clock

64

Infinite

4

Instruction Window Size

2048

Infinite

200

Renaming Registers

256 Int + 256 FP

Infinite

48 integer + 40 Fl. Pt.

Branch Prediction

8K 2-bit

Perfect

Tournament

Cache

Perfect

Perfect

64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

Perfect v. Stack v. Inspect v. none

Perfect

Perfect

44

2/23/2011

CS252-S11 lecture 10

45 More Realistic HW: Memory Address Alias Impact Figure 3.6

More Realistic HW: Memory Address Alias Impact Figure 3.6

IPC

Change 2048 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers

FP: 4 - 45 (Fortran, no heap)

Integer: 4 - 9

Perfect

Global/Stack perf; heap conflicts

Inspec. Assem.

None

45

2/23/2011

CS252-S11 lecture 10

46 Limits to ILP HW Model comparison

Limits to ILP HW Model comparison

New Model

Model

Power 5

Instructions Issued per clock

64 (no restrictions)

Infinite

4

Instruction Window Size

Infinite vs. 256, 128, 64, 32

Infinite

200

Renaming Registers

64 Int + 64 FP

Infinite

48 integer + 40 Fl. Pt.

Branch Prediction

1K 2-bit

Perfect

Tournament

Cache

Perfect

Perfect

64KI, 32KD, 1.92MB L2, 36 MB L3

Memory Alias

HW disambiguation

Perfect

Perfect

46

2/23/2011

CS252-S11 lecture 10

47 Realistic HW: Window Impact (Figure 3.7)

Realistic HW: Window Impact (Figure 3.7)

IPC

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

FP: 8 - 45

Integer: 6 - 12

Infinite

256

128

64

32

16

8

4

47

2/23/2011

CS252-S11 lecture 10

48 How to Exceed ILP Limits of this study

How to Exceed ILP Limits of this study

These are not laws of physics; just practical limits for today, and perhaps overcome via research Compiler and ISA advances could change results WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory usage Can get conflicts via allocation of stack frames as a called procedure reuses the memory addresses of a previous frame on the stack

48

2/23/2011

CS252-S11 lecture 10

49 HW v. SW to increase ILP

HW v. SW to increase ILP

Memory disambiguation: HW best Speculation: HW best when dynamic branch prediction better than compile time prediction Exceptions easier for HW HW doesnt need bookkeeping code or compensation code Very complicated to get right Scheduling: SW can look ahead to schedule better Compiler independence: does not require new compiler, recompilation to run well

49

2/23/2011

CS252-S11 lecture 10

50 Discussion of papers: Complexity-effective superscalar processors

Discussion of papers: Complexity-effective superscalar processors

Complexity-effective superscalar processors, Subbarao Palacharla, Norman P. Jouppi and J. E. Smith. Several data structures analyzed for complexity WRT issue width Rename: Roughly Linear in IW, steeper slope for smaller feature size Wakeup: Roughly Linear in IW, but quadratic in window size Bypass: Strongly quadratic in IW Overall results: Bypass significant at high window size/issue width Wakeup+Select delay dominates otherwise Proposed Complexity-effective design: Replace issue window with FIFOs/steer dependent Insts to same FIFO

50

3/1/2010

cs252-S10, Lecture 11

51 Performance beyond single thread ILP

Performance beyond single thread ILP

There can be much higher natural parallelism in some applications (e.g., Database or Scientific codes) Explicit Thread Level Parallelism or Data Level Parallelism Thread: process with own instructions and data thread may be a process part of a parallel program of multiple processes, or it may be an independent program Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute Data Level Parallelism: Perform identical operations on data, and lots of data

51

2/23/2011

CS252-S11 lecture 10

52 Thread Level Parallelism (TLP)

Thread Level Parallelism (TLP)

ILP exploits implicit parallel operations within a loop or straight-line code segment TLP explicitly represented by the use of multiple threads of execution that are inherently parallel Goal: Use multiple instruction streams to improve Throughput of computers that run many programs Execution time of multi-threaded programs TLP could be more cost-effective to exploit than ILP

52

2/23/2011

CS252-S11 lecture 10

53 And in conclusion

And in conclusion

Last Value Prediction Predict that value of load will be similar (same?) as previous value Works better than one might expect Dependence Prediction: Try to predict whether load depends on stores before addresses are known Store set: Set of stores that have had dependencies with load in past Computational Based Predictors Try to construct prediction based on some actual computation Last Value is trivial Prediction Stride Based Prediction is slightly more complex Context Based Predictors Table Driven When see given sequence, repeat what was seen last time Can reproduce complex patterns Limits to ILP (power efficiency, compilers, dependencies ) seem to limit to 3 to 6 issue for practical options Must start to utilize other types of parallelism!

53

2/23/2011

CS252-S11 lecture 10

CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011
http://900igr.net/prezentacija/anglijskij-jazyk/cs252-graduate-computer-architecture-lecture-10-dependency-prediction-cont-data-prediction-and-confidence-ilp-limits-february-23rd-2011-233063.html
c

46

29
900igr.net > > > CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011