<<  Aluminium Research for Road Transport Applications Lecture 4: An Introduction to the Vector Data Model and Map Layout Techniques  >>
Memory Dependence Prediction
Memory Dependence Prediction
Said another way: Could we do better
Said another way: Could we do better
How well does an infinite tracking work
How well does an infinite tracking work
How to track Store Sets in reality
How to track Store Sets in reality
How well does this do
How well does this do
Load Value Predictability
Load Value Predictability
Accuracy of LCT
Accuracy of LCT
Constant Value Unit
Constant Value Unit
Load Value Architecture
Load Value Architecture
Load Value Architecture
Load Value Architecture
One important tool is RAMP Gold: FAST Emulation of new Hardware
One important tool is RAMP Gold: FAST Emulation of new Hardware
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Tessellation: The Exploded OS
Projects using Quantum CAD Flow
Projects using Quantum CAD Flow
Context Based Predictors
Context Based Predictors
Context Based Predictors
Context Based Predictors
Context Based Predictors
Context Based Predictors
Context Based Predictors
Context Based Predictors
Context Based Predictors
Context Based Predictors
Which is better
Which is better
How predictable are data items
How predictable are data items
Correlation of Predicted Sets
Correlation of Predicted Sets
Number of unique values
Number of unique values
Discussion of papers: The Alpha 21264 Microprocessor
Discussion of papers: The Alpha 21264 Microprocessor
Upper Limit to ILP: Ideal Machine (Figure 3.1)
Upper Limit to ILP: Ideal Machine (Figure 3.1)
More Realistic HW: Memory Address Alias Impact Figure 3.6
More Realistic HW: Memory Address Alias Impact Figure 3.6
Realistic HW: Window Impact (Figure 3.7)
Realistic HW: Window Impact (Figure 3.7)
Discussion of papers: Complexity-effective superscalar processors
Discussion of papers: Complexity-effective superscalar processors
CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011

: John Kubiatowicz. , . , CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011.ppt zip- 1943 .

CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011

CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011.ppt
1CS252 Graduate Computer Architecture 24computing predictions When S1 seen twice
Lecture 10 Dependency Prediction (Cont) in a row, then S1?S2 More complex
Data Prediction and Confidence ILP Limits predictors: Multiple strides for nested
February 23rd, 2011. John Kubiatowicz loops Complex computations for complex
Electrical Engineering and Computer loops (polynomials, etc!). 24. 2/23/2011.
Sciences University of California, CS252-S11 lecture 10.
Berkeley 25Context Based Predictors. Context
http://www.eecs.berkeley.edu/~kubitron/cs2 Based Predictor Relies on Tables to do
2. trick Classified according to the order:
2Review: Branch Prediction/Speculation. an n-th order model takes last n values
PC. Fetch. Decode & Rename. Reorder and uses this to produce prediction So
Buffer. Commit. Reg. File. MEM. Branch 0th order predictor will be entirely
Unit. ALU. Store Buffer. D$. Execute. frequency based Consider sequence: a a a b
Update predictors. Branch Prediction. 2. c a a a b c a a a Next value is?
2/23/2011. CS252-S11 lecture 10. Blending: Use prediction of highest
3Review: Memory Disambiguation. order available. 25. 2/23/2011. CS252-S11
Question: Given a load that follows a lecture 10.
store in program order, are the two 26Which is better? Stride-based: Learns
related? Trying to detect RAW hazards faster less state Much cheaper in terms of
through memory Stores commit in order hardware! runs into errors for any pattern
(ROB), so no WAR/WAW memory hazards. that is not an infinite stride
Implementation Keep queue of stores, in Context-based: Much longer to train
program order Watch for position of new Performs perfectly once trained Much more
loads relative to existing stores expensive hardware. 26. 2/23/2011.
Typically, this is a different buffer than CS252-S11 lecture 10.
ROB! Could be ROB (has right properties), 27How predictable are data items?
but too expensive When have address for Assumptions looking for limits
load, check store queue: If any store Prediction done with no table aliasing
prior to load is waiting for its (every instruction has own set of
address?????? If load address matches tables/strides/etc. Only instructions that
earlier store address (associative write into registers are measured Excludes
lookup), then we have a memory-induced RAW stores, branches, jumps, etc Overall
hazard: store value available ? return Predictability: L = Last Value S = Stride
value store value not available ? return (delta-2) FCMx = Order x context based
ROB number of source Otherwise, send out predictor. 27. 2/23/2011. CS252-S11
request to memory Will relax exact lecture 10.
dependency checking in later lecture. 3. 28Correlation of Predicted Sets. Way to
2/23/2011. CS252-S11 lecture 10. interpret: l = last val s = stride f =
4Memory Dependence Prediction. fcm3 Combinations: ls = both l and s Etc.
Important to speculate? Two Extremes: Conclusion? Only 18% not predicted
Na?ve Speculation: always let load go correctly by any model About 40% captured
forward No Speculation: always wait for by all predictors A significant fraction
dependencies to be resolved Compare Na?ve (over 20%) only captured by fcm Stride
Speculation to No Speculation False does well! Over 60% of correct predictions
Dependency: wait when dont have to Order captured Last-Value seems to have very
Violation: result of speculating little added value. 28. 2/23/2011.
incorrectly Goal of prediction: Avoid CS252-S11 lecture 10.
false dependencies and order violations. 29Number of unique values. Data
From Memory Dependence Prediction using Observations: Many static instructions
Store Sets, Chrysos and Emer. 4. (>50%) generate only one value Majority
2/23/2011. CS252-S11 lecture 10. of static instructions (>90%) generate
5Said another way: Could we do better? fewer than 64 values Majority of dynamic
Results from same paper: performance instructions (>50%) correspond to
improvement with oracle predictor We can static insts that generate fewer than 64
get significantly better performance if we values Over 90% of dynamic instructions
find a good predictor Question: How to correspond to static insts that generate
build a good predictor? 5. 2/23/2011. fewer than 4096 unique values Suggests
CS252-S11 lecture 10. that a relatively small number of values
6Premise: Past indicates Future. Basic would be required for actual context
Premise is that past dependencies indicate prediction. 29. 2/23/2011. CS252-S11
future dependencies Not always true! lecture 10.
Hopefully true most of time Store Set: Set 30? General Idea: Confidence Prediction.
of store insts that affect given load Separate mechanisms for data and
Example: Addr Inst 0 Store C 4 Store A 8 confidence prediction Data predictor keeps
Store B 12 Store C 28 Load B ? Store set { track of values via multiple mechanisms
PC 8 } 32 Load D ? Store set { (null) } 36 Confidence predictor tracks history of
Load C ? Store set { PC 0, PC 12 } 40 Load correctness (good/bad) Confidence
B ? Store set { PC 8 } Idea: Store set for prediction options: Saturating counter
load starts empty. If ever load go forward History register (like branch prediction).
and this causes a violation, add offending Data Predictor. Confidence Prediction.
store to loads store set Approach: For Result. Commit. Fetch. Decode. Reorder
each indeterminate load: If Store from Buffer. PC. Complete. Execute. 30.
Store set is in pipeline, stall Else let 2/23/2011. CS252-S11 lecture 10.
go forward Does this work? 6. 2/23/2011. 31Discussion of papers: The Alpha 21264
CS252-S11 lecture 10. Microprocessor. BTB?Line/set predictor
7How well does an infinite tracking Trained by branch predictor (Tournament
work? Infinite here means to place no predictor) Renaming: 80 integer registers,
limits on: Number of store sets Number of 72 floating-point registers Clustered
stores in given set Seems to do pretty architecture for integer ops (2 clusters)
well Note: Not Predicted means load had Speculative Loads: Dependency speculation
empty store set Only Applu and Xlisp seems Cache-miss speculation. 31. 2/23/2011.
to have false dependencies. 7. 2/23/2011. CS252-S11 lecture 10.
CS252-S11 lecture 10. 32Limits to ILP. Conflicting studies of
8How to track Store Sets in reality? amount Benchmarks (vectorized Fortran FP
SSIT: Assigns Loads and Stores to Store vs. integer C programs) Hardware
Set ID (SSID) Notice that this requires sophistication Compiler sophistication How
each store to be in only one store set! much ILP is available using existing
LFST: Maps SSIDs to most recent fetched mechanisms with increasing HW budgets? Do
store When Load is fetched, allows it to we need to invent new HW/SW mechanisms to
find most recent store in its store set keep on processor performance curve? Intel
that is executing (if any) ? allows MMX, SSE (Streaming SIMD Extensions): 64
stalling until store finished When Store bit ints Intel SSE2: 128 bit, including 2
is fetched, allows it to wait for previous 64-bit Fl. Pt. per clock Motorola AltaVec:
store in store set Pretty much same type 128 bit ints and FPs Supersparc Multimedia
of ordering as enforced by ROB anyway ops, etc. 32. 2/23/2011. CS252-S11 lecture
Transitivity? loads end up waiting for all 10.
active stores in store set What if store 33Overcoming Limits. Advances in
needs to be in two store sets? Allow store compiler technology + significantly new
sets to be merged together and different hardware techniques may be
deterministically Two loads, multiple able to overcome limitations assumed in
stores get same SSID Want periodic studies However, unlikely such advances
clearing of SSIT to avoid: problems with when coupled with realistic hardware will
aliasing across program Out of control overcome these limits in near future. 33.
merging. 8. 2/23/2011. CS252-S11 lecture 2/23/2011. CS252-S11 lecture 10.
10. 34Limits to ILP. Initial HW Model here;
9How well does this do? Comparison MIPS compilers. Assumptions for
against Store Barrier Cache Marks ideal/perfect machine to start: 1.
individual Stores as tending to cause Register renaming infinite virtual
memory violations Not specific to registers ? all register WAW & WAR
particular loads. Problem with APPLU? hazards are avoided 2. Branch prediction
Analyzed in paper: has complex 3-level perfect; no mispredictions 3. Jump
inner loop in which loads occasionally prediction all jumps perfectly predicted
depend on stores Forces overly (returns, case statements) 2 & 3 ? no
conservative stalls (i.e. false control dependencies; perfect speculation
dependencies). 9. 2/23/2011. CS252-S11 & an unbounded buffer of instructions
lecture 10. available 4. Memory-address alias analysis
10Load Value Predictability. Try to addresses known & a load can be
predict the result of a load before going moved before a store provided addresses
to memory Paper: Value locality and load not equal; 1&4 eliminates all but RAW
value prediction Mikko H. Lipasti, Also: perfect caches; 1 cycle latency for
Christopher B. Wilkerson and John Paul all instructions (FP *,/); unlimited
Shen Notion of value locality Fraction of instructions issued/clock cycle; 34.
instances of a given load that match last 2/23/2011. CS252-S11 lecture 10.
n different values Is there any value 35Limits to ILP HW Model comparison.
locality in typical programs? Yes! With Model. Power 5. Instructions Issued per
history depth of 1: most integer programs clock. Infinite. 4. Instruction Window
show over 50% repetition With history Size. Infinite. 200. Renaming Registers.
depth of 16: most integer programs show Infinite. 48 integer + 40 Fl. Pt. Branch
over 80% repetition Not everything does Prediction. Perfect. 2% to 6%
well: see cjpeg, swm256, and tomcatv misprediction (Tournament Branch
Locality varies by type: Quite high for Predictor). Cache. Perfect. 64KI, 32KD,
inst/data addresses Reasonable for integer 1.92MB L2, 36 MB L3. Memory Alias
values Not as high for FP values. 10. Analysis. Perfect. ?? 35. 2/23/2011.
2/23/2011. CS252-S11 lecture 10. CS252-S11 lecture 10.
11Load Value Prediction Table. Load 36Upper Limit to ILP: Ideal Machine
Value Prediction Table (LVPT) Untagged, (Figure 3.1). Instructions Per Clock. FP:
Direct Mapped Takes Instructions ? 75 - 150. Integer: 18 - 60. 36. 2/23/2011.
Predicted Data Contains history of last n CS252-S11 lecture 10.
unique values from given instruction Can 37Limits to ILP HW Model comparison. New
contain aliases, since untagged How to Model. Model. Power 5. Instructions Issued
predict? When n=1, easy When n=16? Use per clock. Infinite. Infinite. 4.
Oracle Is every load predictable? No! Why Instruction Window Size. Infinite, 2K,
not? Must identify predictable loads 512, 128, 32. Infinite. 200. Renaming
somehow. 11. 2/23/2011. CS252-S11 lecture Registers. Infinite. Infinite. 48 integer
10. + 40 Fl. Pt. Branch Prediction. Perfect.
12Load Classification Table (LCT). Perfect. 2% to 6% misprediction
Instruction Addr. Load Classification (Tournament Branch Predictor). Cache.
Table (LCT) Untagged, Direct Mapped Takes Perfect. Perfect. 64KI, 32KD, 1.92MB L2,
Instructions ? Single bit of whether or 36 MB L3. Memory Alias. Perfect. Perfect.
not to predict How to implement? Uses ?? 37. 2/23/2011. CS252-S11 lecture 10.
saturating counters (2 or 1 bit) When 38More Realistic HW: Window Impact
prediction correct, increment When Figure 3.2. IPC. FP: 9 - 150. Integer: 8 -
prediction incorrect, decrement With 2 bit 63. Change from Infinite window 2048, 512,
counter 0,1 ? not predictable 2 ? 128, 32. 38. 2/23/2011. CS252-S11 lecture
predictable 3 ? constant (very 10.
predictable) With 1 bit counter 0 ? not 39Limits to ILP HW Model comparison. New
predictable 1 ? constant (very Model. Model. Power 5. Instructions Issued
predictable). 12. 2/23/2011. CS252-S11 per clock. 64. Infinite. 4. Instruction
lecture 10. Window Size. 2048. Infinite. 200. Renaming
13Accuracy of LCT. Question of accuracy Registers. Infinite. Infinite. 48 integer
is about how well we avoid: Predicting + 40 Fl. Pt. Branch Prediction. Perfect
unpredictable load Not predicting vs. 8K Tournament vs. 512 2-bit vs.
predictable loads How well does this work? profile vs. none. Perfect. 2% to 6%
Difference between Simple and Limit: misprediction (Tournament Branch
history depth Simple: depth 1 Limit: depth Predictor). Cache. Perfect. Perfect. 64KI,
16 Limit tends to classify more things as 32KD, 1.92MB L2, 36 MB L3. Memory Alias.
predictable (since this works more often) Perfect. Perfect. ?? 39. 2/23/2011.
Basic Principle: Often works better to CS252-S11 lecture 10.
have one structure decide on the basic 40More Realistic HW: Branch Impact
predictability of structure Independent Figure 3.3. IPC. Change from Infinite
of prediction structure. 13. 2/23/2011. window to examine to 2048 and maximum
CS252-S11 lecture 10. issue of 64 instructions per clock cycle.
14Constant Value Unit. Idea: Identify a FP: 15 - 45. Integer: 6 - 12. Perfect.
load instruction as constant Can ignore Tournament. BHT (512). Profile. No
cache lookup (no verification) Must prediction. 40. 2/23/2011. CS252-S11
enforce by monitoring result of stores to lecture 10.
remove constant status How well does 41Misprediction Rates. 41. 2/23/2011.
this work? Seems to identify 6-18% of CS252-S11 lecture 10.
loads as constant Must be unchanging 42Limits to ILP HW Model comparison. New
enough to cause LCT to classify as Model. Model. Power 5. Instructions Issued
constant. 14. 2/23/2011. CS252-S11 lecture per clock. 64. Infinite. 4. Instruction
10. Window Size. 2048. Infinite. 200. Renaming
15Load Value Architecture. LCT/LVPT in Registers. Infinite v. 256, 128, 64, 32,
fetch stage CVU in execute stage Used to none. Infinite. 48 integer + 40 Fl. Pt.
bypass cache entirely (Know that result is Branch Prediction. 8K 2-bit. Perfect.
good) Results: Some speedups 21264 seems Tournament Branch Predictor. Cache.
to do better than Power PC Authors think Perfect. Perfect. 64KI, 32KD, 1.92MB L2,
this is because of small first-level cache 36 MB L3. Memory Alias. Perfect. Perfect.
and in-order execution makes CVU more Perfect. 42. 2/23/2011. CS252-S11 lecture
useful. 15. 2/23/2011. CS252-S11 lecture 10.
10. 43More Realistic HW: Renaming Register
16Administrivia. Exam: Week after 3/16? Impact (N int + N fp) Figure 3.5. IPC. FP:
CS252 First Project proposal due by Friday 11 - 45. Change 2048 instr window, 64
3/4 Need two people/project (although can instr issue, 8K 2 level Prediction.
justify three for right project) Complete Integer: 5 - 15. Infinite. 256. 128. 64.
Research project in 9 weeks Typically 32. None. 43. 2/23/2011. CS252-S11 lecture
investigate hypothesis by building an 10.
artifact and measuring it against a base 44Limits to ILP HW Model comparison. New
case Generate conference-length Model. Model. Power 5. Instructions Issued
paper/give oral presentation Often, can per clock. 64. Infinite. 4. Instruction
lead to an actual publication. 16. Window Size. 2048. Infinite. 200. Renaming
2/23/2011. CS252-S11 lecture 10. Registers. 256 Int + 256 FP. Infinite. 48
17One important tool is RAMP Gold: FAST integer + 40 Fl. Pt. Branch Prediction. 8K
Emulation of new Hardware. RAMP emulation 2-bit. Perfect. Tournament. Cache.
model for Parlab manycore SPARC v8 ISA Perfect. Perfect. 64KI, 32KD, 1.92MB L2,
-> v9 Considering ARM model 36 MB L3. Memory Alias. Perfect v. Stack
Single-socket manycore target Split v. Inspect v. none. Perfect. Perfect. 44.
functional/timing model, both in hardware 2/23/2011. CS252-S11 lecture 10.
Functional model: Executes ISA Timing 45More Realistic HW: Memory Address
model: Capture pipeline timing detail (can Alias Impact Figure 3.6. IPC. Change 2048
be cycle accurate) Host multithreading of instr window, 64 instr issue, 8K 2 level
both functional and timing models Built Prediction, 256 renaming registers. FP: 4
for Virtex-5 systems (ML505 or BEE3) Have - 45 (Fortran, no heap). Integer: 4 - 9.
Tessellation OS currently running on RAMP Perfect. Global/Stack perf; heap
system! 17. 2/23/2011. CS252-S11 lecture conflicts. Inspec. Assem. None. 45.
10. 2/23/2011. CS252-S11 lecture 10.
18Tessellation: The Exploded OS. Normal 46Limits to ILP HW Model comparison. New
Components split into pieces Device Model. Model. Power 5. Instructions Issued
drivers (Security/Reliability) Network per clock. 64 (no restrictions). Infinite.
Services (Performance) TCP/IP stack 4. Instruction Window Size. Infinite vs.
Firewall Virus Checking Intrusion 256, 128, 64, 32. Infinite. 200. Renaming
Detection Persistent Storage (Performance, Registers. 64 Int + 64 FP. Infinite. 48
Security, Reliability) Monitoring services integer + 40 Fl. Pt. Branch Prediction. 1K
Performance counters Introspection 2-bit. Perfect. Tournament. Cache.
Identity/Environment services (Security) Perfect. Perfect. 64KI, 32KD, 1.92MB L2,
Biometric, GPS, Possession Tracking 36 MB L3. Memory Alias. HW disambiguation.
Applications Given Larger Partitions Perfect. Perfect. 46. 2/23/2011. CS252-S11
Freedom to use resources arbitrarily. 18. lecture 10.
2/23/2011. CS252-S11 lecture 10. 47Realistic HW: Window Impact (Figure
19Implementing the Space-Time Graph. 3.7). IPC. Perfect disambiguation (HW), 1K
Partition Policy Layer (Resource Selective Prediction, 16 entry return, 64
Allocator) Reflects Global Goals. registers, issue as many as window. FP: 8
Partition Policy layer (allocation) - 45. Integer: 6 - 12. Infinite. 256. 128.
Allocates Resources to Cells based on 64. 32. 16. 8. 4. 47. 2/23/2011. CS252-S11
Global policies Produces only lecture 10.
implementable space-time resource graphs 48How to Exceed ILP Limits of this
May deny resources to a cell that requests study? These are not laws of physics; just
them (admission control) Mapping layer practical limits for today, and perhaps
(distribution) Makes no decisions overcome via research Compiler and ISA
Time-Slices at a course granularity (when advances could change results WAR and WAW
time-slicing necessary) performs hazards through memory: eliminated WAW and
bin-packing like operation to implement WAR hazards through register renaming, but
space-time graph In limit of many not in memory usage Can get conflicts via
processors, no time multiplexing allocation of stack frames as a called
processors, merely distributing resources procedure reuses the memory addresses of a
Partition Mechanism Layer Implements previous frame on the stack. 48.
hardware partitions and secure channels 2/23/2011. CS252-S11 lecture 10.
Device Dependent: Makes use of more or 49HW v. SW to increase ILP. Memory
less hardware support for QoS and disambiguation: HW best Speculation: HW
Partitions. Mapping Layer (Resource best when dynamic branch prediction better
Distributer). Partition Mechanism Layer than compile time prediction Exceptions
ParaVirtualized Hardware To Support easier for HW HW doesnt need bookkeeping
Partitions. 19. 2/23/2011. CS252-S11 code or compensation code Very complicated
lecture 10. to get right Scheduling: SW can look ahead
20Sample of what could make good to schedule better Compiler independence:
projects. Implement new resource does not require new compiler,
partitioning mechanisms on RAMP and recompilation to run well. 49. 2/23/2011.
integrate into OS You can actually develop CS252-S11 lecture 10.
a new hardware mechanism, put into the OS, 50Discussion of papers:
and show how partitioning gives better Complexity-effective superscalar
performance or real-time behavior You processors. Complexity-effective
could develop new message-passing superscalar processors, Subbarao
interfaces and do the same Virtual I/O Palacharla, Norman P. Jouppi and J. E.
devices RAMP-Gold runs in virtual time Smith. Several data structures analyzed
Develop devices and methodology for for complexity WRT issue width Rename:
investigating real-time behavior of these Roughly Linear in IW, steeper slope for
devices in Tessellation running on RAMP smaller feature size Wakeup: Roughly
Energy monitoring and adaptation How to Linear in IW, but quadratic in window size
measure energy consumed by applications Bypass: Strongly quadratic in IW Overall
and adapt accordingly Develop and evaluate results: Bypass significant at high window
new parallel communication model Target size/issue width Wakeup+Select delay
for Multicore systems New Message-Passing dominates otherwise Proposed
Interface, New Network Routing Layer Complexity-effective design: Replace issue
Investigate applications under different window with FIFOs/steer dependent Insts to
types of hardware CUDA vs MultiCore, etc same FIFO. 50. 3/1/2010. cs252-S10,
New Style of computation, tweak on Lecture 11.
existing one Better Memory System, etc. 51Performance beyond single thread ILP.
20. 2/23/2011. CS252-S11 lecture 10. There can be much higher natural
21Projects using Quantum CAD Flow. Use parallelism in some applications (e.g.,
the quantum CAD flow developed in Database or Scientific codes) Explicit
Kubiatowiczs group to investigate Quantum Thread Level Parallelism or Data Level
Circuits Tradeoff in area vs performance Parallelism Thread: process with own
for Shors algorithm Other interesting instructions and data thread may be a
algorithms (Quantum Simulation). 21. process part of a parallel program of
2/23/2011. CS252-S11 lecture 10. multiple processes, or it may be an
22Data Value Prediction. Why do it? Can independent program Each thread has all
Break the DataFlow Boundary Before: the state (instructions, data, PC,
Critical path = 4 operations (probably register state, and so on) necessary to
worse) After: Critical path = 1 operation allow it to execute Data Level
(plus verification). 22. 2/23/2011. Parallelism: Perform identical operations
CS252-S11 lecture 10. on data, and lots of data. 51. 2/23/2011.
23Data Value Predictability. The CS252-S11 lecture 10.
Predictability of Data Values Yiannakis 52Thread Level Parallelism (TLP). ILP
Sazeides and James Smith, Micro 30, 1997 exploits implicit parallel operations
Three different types of Patterns: within a loop or straight-line code
Constant (C): 5 5 5 5 5 5 5 5 5 5 Stride segment TLP explicitly represented by the
(S): 1 2 3 4 5 6 7 8 9 Non-Stride (NS): use of multiple threads of execution that
28 13 99 107 23 456 Combinations: are inherently parallel Goal: Use multiple
Repeated Stride (RS): 1 2 3 1 2 3 1 2 3 1 instruction streams to improve Throughput
2 3 Repeadted Non-Stride (RNS): 1 -13 -99 of computers that run many programs
7 1 -13 -99 7. 23. 2/23/2011. CS252-S11 Execution time of multi-threaded programs
lecture 10. TLP could be more cost-effective to
24Computational Predictors. Last Value exploit than ILP. 52. 2/23/2011. CS252-S11
Predictors Predict that instruction will lecture 10.
produce same value as last time Requires 53And in conclusion Last Value
some form of hysteresis. Two subtle Prediction Predict that value of load will
alternatives: Saturating counter be similar (same?) as previous value Works
incremented/decremented on success/failure better than one might expect Dependence
replace when the count is below threshold Prediction: Try to predict whether load
Keep old value until new value seen depends on stores before addresses are
frequently enough Second version predicts known Store set: Set of stores that have
a constant when appears temporarily had dependencies with load in past
constant Stride Predictors Predict next Computational Based Predictors Try to
value by adding the sum of most recent construct prediction based on some actual
value to difference of two most recent computation Last Value is trivial
values: If vn-1 and vn-2 are the two most Prediction Stride Based Prediction is
recent values, then predict next value slightly more complex Context Based
will be: vn-1 + (vn-1 vn-2) The value Predictors Table Driven When see given
(vn-1 vn-2) is called the stride sequence, repeat what was seen last time
Important variations in hysteresis: Change Can reproduce complex patterns Limits to
stride only if saturating counter falls ILP (power efficiency, compilers,
below threshold Or two-delta method. Two dependencies ) seem to limit to 3 to 6
strides maintained. First (S1) always issue for practical options Must start to
updated by difference between two most utilize other types of parallelism! 53.
recent values Other (S2) used for 2/23/2011. CS252-S11 lecture 10.
CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011.ppt
http://900igr.net/kartinka/anglijskij-jazyk/cs252-graduate-computer-architecture-lecture-10-dependency-prediction-cont-data-prediction-and-confidence-ilp-limits-february-23rd-2011-233063.html
c

CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011

CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011

Computer - I love my job. in this business a lot of competitors who are trying to thee . manypeoplebuycomputers. this is a very popular technique. on the earned money I help the children's kindergartens. Iverylikeacomputer. andIwanttoopencomputerbusiness. Computer equipment enjoys good success. and I will be happy to open such a beneficial.

Data Mining - . . . . Data Mining. , . Data Mining. . . Data Mining. . .

Invention - The term bicycle was coined in France in the 1860s. he inventors of the first airplane were Orville and Wilbur Wright. He is considered by many Italians as the inventor of the telephone. The BBC begins regular TV transmissions. John Baird opens the first TV studio, however, the image quality was poor.

Word 2010 - Picture quick styles. Tables. Calculate with table formulas. Display a document in different views. Using an access database. Sorting and filter records. Styles. Select a main document. Printing options. Merging to E-mail. Cropping. Document properties. Create a data source. Formatting a table. Collaboration.

House - HOUSE. : . I live (2) ______ Oxford. : . . What is your address. I am (1) ______ England. table. Where do you live. What is your name. Where are you from. computer. AIR MAIL. in, on, at, from.

Youth subcultures - rocker. hippie. skinhead. raver. mod. subculture. biker. hacker. goth. punk.

46

29
900igr.net > > > CS252 Graduate Computer Architecture Lecture 10 Dependency Prediction (Cont) Data Prediction and Confidence ILP Limits February 23rd, 2011