Тексты на английском
<<  Evaluation of the resistance of bacterial pathogens of synopulmonary infections in children Призма 10 класс атанасян  >>
Kernel Benchmarks and Metrics for Polymorphous Computer Architectures
Kernel Benchmarks and Metrics for Polymorphous Computer Architectures
Future Warfighting Scenarios Examples
Future Warfighting Scenarios Examples
Polymorphous Computing
Polymorphous Computing
Architectural Flexibility
Architectural Flexibility
Outline
Outline
Kernel Synthesis from Application Survey
Kernel Synthesis from Application Survey
Kernel Performance Evaluation
Kernel Performance Evaluation
Throughput-Stability Product A New Kernel Metric
Throughput-Stability Product A New Kernel Metric
Outline
Outline
High Performance Programming: Conventional vs
High Performance Programming: Conventional vs
Kernel Benchmarks and the PowerPC G4
Kernel Benchmarks and the PowerPC G4
FIR Filter (G4)
FIR Filter (G4)
Baseline Performance Measurements: Throughput and Stability
Baseline Performance Measurements: Throughput and Stability
Stream Algorithms for Tiled Architectures
Stream Algorithms for Tiled Architectures
Time Domain Convolution on RAW
Time Domain Convolution on RAW
FIR Filter (RAW)
FIR Filter (RAW)
Outline
Outline
Singular Value Decomposition (SVD)
Singular Value Decomposition (SVD)
SVD Results (G4)
SVD Results (G4)
Modified Gram-Schmidt QR Results (G4)
Modified Gram-Schmidt QR Results (G4)
SVD for RAW Architecture
SVD for RAW Architecture
RAW and G4 Results: Fast Givens QR Factorization
RAW and G4 Results: Fast Givens QR Factorization
Lincoln Laboratory PCA Testbed
Lincoln Laboratory PCA Testbed
Outline
Outline
Conclusions
Conclusions
MIT Lincoln Laboratory PCA Team
MIT Lincoln Laboratory PCA Team

Презентация: «Будет ли стрим lg g4». Автор: James Lebak. Файл: «Будет ли стрим lg g4.ppt». Размер zip-архива: 2138 КБ.

Будет ли стрим lg g4

содержание презентации «Будет ли стрим lg g4.ppt»
СлайдТекст
1 Kernel Benchmarks and Metrics for Polymorphous Computer Architectures

Kernel Benchmarks and Metrics for Polymorphous Computer Architectures

Hank Hoffmann James Lebak (Presenter) Janice McMahon MIT Lincoln Laboratory Seventh Annual High-Performance Embedded Computing Workshop (HPEC) 24 September 2003

This work is sponsored by the Defense Advanced Research Projects Agency under Air Force Contract F19628-00-C-0002. Opinions, interpretations, conclusions, and recommendations are those of the authors and are not necessarily endorsed by the United States Government.

2 Future Warfighting Scenarios Examples

Future Warfighting Scenarios Examples

Surveillance Satellite

Communication Satellite

SIGINT

Airborne Vehicle

Aegis Cruiser

Communication Antenna

Micro UAV

Personal Terminals

3 Polymorphous Computing

Polymorphous Computing

Stream processing

Threaded processing

P

M

Regular, deterministic operations Constant flow of input data

Complex operations Dynamic data movement

Set of homogenous computing tiles

2morph \’mor-()f\ vt : to re-structure tiles for optimized processing

4 Architectural Flexibility

Architectural Flexibility

Radar Processing Flow

Performance

Front end Signal Processing

Detection/Estimation

Back end Discrimination/ Identification

Command Control

Signal Processing Benchmark 1

Signal Processing Benchmark 2

Information Processing Benchmark

Knowledge Processing Benchmark

Intelligence Processing Benchmark

Structured Bit-operations

Vectors/ Streaming

Dynamic/ Threading

Symbolic Operations

5 Outline

Outline

Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions

6 Kernel Synthesis from Application Survey

Kernel Synthesis from Application Survey

Specific Application Areas

“Front-end Processing”

“Back-end Processing”

Broad Processing Categories

Specific Kernels

Data independent, stream-oriented Signal processing, image processing, high-speed network communication Examples: pulse compression adaptive beamforming target detection

Data dependent, thread oriented Information processing, knowledge processing Examples: workload optimization target classification

MIT-LL Surveyed DoD Applications to Provide: Kernel Benchmark Definitions Example Requirements and Data Sets

Sonar

Radar

Infrared

Hyper-Spectral

SIGINT

Communication

Data Fusion

7 Kernel Performance Evaluation

Kernel Performance Evaluation

Kernel Benchmarks

Performance Metrics

Definitions

Floating point and integer ops Latency Throughput Efficiency Stability Density and cost Size Weight Power

PowerPC(G4)

RAW

Smart Memory

TRIPS

MONARCH

8 Throughput-Stability Product A New Kernel Metric

Throughput-Stability Product A New Kernel Metric

Throughput x Stability

For a given application, PCA processors should achieve higher product of throughput and stability than conventional processors

rewards consistent high performance penalizes lack of performance or lack of consistency

9 Outline

Outline

Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions

10 High Performance Programming: Conventional vs

High Performance Programming: Conventional vs

PCA Processors

PowerPC(G4)

Characteristics: Rigid memory hierarchy Rigid datapath Specialized Structures High Performance Programming: Change algorithm to match memory hierarchy One degree of freedom Can only work with blocking factor

Characteristics: Flexible memory hierarchy Flexible datapath(s) Generic Structures High Performance Programming: Co-optimize algorithm and architecture Many degrees of freedom Optimize time/space tradeoff

PCA provides more degrees of freedom, and thus greater flexibility (morphability) and greater performance over a range of applications

11 Kernel Benchmarks and the PowerPC G4

Kernel Benchmarks and the PowerPC G4

Two predictors of kernel performance: Programmer’s maximization of data reuse and locality (blocking factor) Memory hierarchy of G4 Blocking factor determines max achieved performance Memory hierarchy determines shape of performance curve Want to maximize blocking factor to limit memory hierarchy bottleneck

PowerPC G4 7410 Specs 500 MHz Clock rate 4 Gflop/s peak 125 MHz main memory bus L1 cache: 32 kB, on chip L2 cache: 2MB, 250 MHz bus Mercury daughtercard

Main Memory

L2 Cache

12 FIR Filter (G4)

FIR Filter (G4)

FIR Filter Throughput (MFLOPS/sec)

FIR Throughput * Stability

Caches are performance bottlenecks Performance curve changes when cache is full Product metric penalizes G4 for performance drop at cache boundaries

Mean Efficiency: 29%

PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec

Level 1 Cache

Level 1 Cache

Level 2 Cache

Level 2 Cache

64

Vector Length

Number of filters = 4 Filter size = 16

Number of filters = 4 Filter size = 16

*Implemented with VSIPL Real FIR Filter

13 Baseline Performance Measurements: Throughput and Stability

Baseline Performance Measurements: Throughput and Stability

Throughput

Data Set and Overall Stability

PowerPC G4 (Mercury) 500 MHz 32 KB L1 2 MB L2 Peak: 4 GFLOPS/sec

Ratio of minimum to maximum over all data set sizes for a particular kernel

Data Set Stability:

Ratio of minimum to maximum over all floating-point kernels&all data set sizes

Overall Stability:

14 Stream Algorithms for Tiled Architectures

Stream Algorithms for Tiled Architectures

Systolic Morph

Stream algorithms achieve high efficiency by optimizing time space tradeoff – tailoring memory hierarchy and datapaths to specific needs of application

Stream Algorithm Efficiency:

R

M(R) edge tiles are allocated to memory management

P(R) inner tiles perform computation systolically using registers and static network

where N = problem size R = edge length of tile array C(N) = number of operations T(N,R) = number of time steps P(R) + M(R) = total number of processors

where s = N/R

15 Time Domain Convolution on RAW

Time Domain Convolution on RAW

Stream algorithms achieve high performance by removing memory access bottleneck from computational critical path

RAW Chip with R rows and R+2 columns: Number of filters = R Number of memory tiles: M = 2*R Number of processing tiles: P = R2

Systolic Array for K Tap Filter

Each row performs a number of K tap filters

Manage Input Vectors

Manage Output Vectors

16 FIR Filter (RAW)

FIR Filter (RAW)

Throughput (GFLOPS/sec)

Throughput * Stability

Raw implements the appropriate memory hierarchy for the problem Raw’s Throughput x Stability score stays high

RAW: 250 MHz, 4 GFLOPS/sec G4: 500 MHz, 4 GFLOPS/sec

4

3

2

1

.5

0

128

1K

4K

16K

64K

256K

512K

Vector Length

Vector Length

Number of filters = 4

17 Outline

Outline

Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions

18 Singular Value Decomposition (SVD)

Singular Value Decomposition (SVD)

X=USVH U, V Unitary S Diagonal

SVD is becoming more widely used in signal and image processing Important for spectral analysis Can also be used for adaptive beamforming, especially for ill-conditioned problems SVD kernel implementation is a Reduced SVD that begins with a QR factorization if M > N Uses Modified Gram-Schmidt QR factorization Many possible optimizations, especially block factorization

Upper- Triangular Matrix

Bidiagonal Matrix

Diagonal Matrix S

Input Matrix

M Rows

N Columns

19 SVD Results (G4)

SVD Results (G4)

Reduced SVD of a 16-column complex matrix Begins with MGS QR factorization (needs A+R) L1 cache drives inner loop performance 1: A+R fills L1 cache 2: One column of A is half of L1 cache

SVD Throughput (Mflop/s)

SVD Throughput * Stability

1

2

1

2

Mean Efficiency: 16%

PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec

20 Modified Gram-Schmidt QR Results (G4)

Modified Gram-Schmidt QR Results (G4)

MGS Throughput (Mflop/s)

MGS Throughput * Stability

1

1

2

2

Modified Gram-Schmidt QR factorization of a 16-column complex matrix MGS is about 60% of SVD time L1 cache drives inner loop performance 1: A+R fills L1 cache 2: One column of A is half of L1 cache

Mean Efficiency: 12%

PowerPC G4 (Mercury) 500 MHz Peak: 4 GFLOPS/sec

21 SVD for RAW Architecture

SVD for RAW Architecture

Goal is to match problem size and architecture Use 2D systolic morph maximizes time/space efficiency uses architecture in a scalable way Uses efficient QR/LQ approach to get to banded form Fast Givens approach for QR/LQ Decoupled algorithm with good parallelism Banded form matches array dimension of systolic morph provides high locality for reduction to bidiagonal form

Banded Matrix

Bidiagonal Matrix

Diagonal Matrix S

Input Matrix

Raw implementation seeks to efficiently match the many possible algorithms to the many possible architectural configurations

M Rows

N Columns

Memory Tiles

Compute Tiles

22 RAW and G4 Results: Fast Givens QR Factorization

RAW and G4 Results: Fast Givens QR Factorization

The QR is a key sub-kernel of the SVD

Throughput (GFLOPS/sec)

Throughput * Stability

The QR performance demonstrates the benefit of the PCA approach on matrix algebra operations

3

3

2

2

1

1

0

0

8

16

32

64

128

256

512

1K

2K

16

32

64

128

256

512

1K

N (for N by N matrices)

N (for N by N matrices)

23 Lincoln Laboratory PCA Testbed

Lincoln Laboratory PCA Testbed

Test Bed Architecture

Test Bed Objectives

Kernel performance evaluation Application morphing demonstration High-level software prototyping

Unit under test

Ethernet LAN

SBC

Intel PC Dual processor 66 MHz/64-bit wide PCI bus Running Linux

G4

DSP

PCI bus

Annapolis Wildstar

DSP/ FPGA

High Speed I/O

Mercury RACE/VME Solaris/MCOS

Clusters on LLAN

24 Outline

Outline

Introduction Kernel Benchmarks and Metrics Programming PCA Architectures Case Study: SVD Kernel Conclusions

25 Conclusions

Conclusions

MIT Lincoln Laboratory has defined kernel benchmarks for the PCA program Multiple categories of processing Based on DoD application needs Establishing a performance baseline on conventional architectures Performance is limited by the blocking factor and by the memory hierarchy Example: CFAR – low ops/byte, 3% efficiency: FIR – high ops/byte, 29% efficiency PCA processors allow opportunities for high performance Performance achieved through co-optimization of the algorithm and the architecture Example: unusual SVD algorithm leads to high performance on Raw The greater degree of freedom allows greater optimization across a variety of problem domains

26 MIT Lincoln Laboratory PCA Team

MIT Lincoln Laboratory PCA Team

Hector Chan Bill Coate Jim Daly Ryan Haney Hank Hoffmann Preston Jackson James Lebak Janice McMahon Eddie Rutledge Glenn Schrader Edmund Wong

«Будет ли стрим lg g4»
http://900igr.net/prezentacija/anglijskij-jazyk/budet-li-strim-lg-g4-261718.html
cсылка на страницу
Урок

Английский язык

29 тем
Слайды