Без темы
<<  Bookfighter Brookhaven Town  >>
Boosting Mobile GPU Performance with a Decoupled Access/Execute
Boosting Mobile GPU Performance with a Decoupled Access/Execute
Focusing on Mobile GPUs
Focusing on Mobile GPUs
GPU Performance and Memory
GPU Performance and Memory
Outline
Outline
Assumed GPU Architecture
Assumed GPU Architecture
Assumed Fragment Processor
Assumed Fragment Processor
Methodology
Methodology
Workload Selection
Workload Selection
Improving Performance Using Multithreading
Improving Performance Using Multithreading
Employing Prefetching
Employing Prefetching
Decoupled Access/Execute
Decoupled Access/Execute
Inter-Core Data Sharing
Inter-Core Data Sharing
Decoupled Access/Execute
Decoupled Access/Execute
Benefits of Remote L1 Cache Accesses
Benefits of Remote L1 Cache Accesses
Conclusions
Conclusions
Boosting Mobile GPU Performance with a Decoupled Access/Execute
Boosting Mobile GPU Performance with a Decoupled Access/Execute

Презентация на тему: «Boosting Mobile GPU Performance with a Decoupled AccessExecute Fragment Processor». Автор: Jose Arnau. Файл: «Boosting Mobile GPU Performance with a Decoupled AccessExecute Fragment Processor.ppt». Размер zip-архива: 2609 КБ.

Boosting Mobile GPU Performance with a Decoupled AccessExecute Fragment Processor

содержание презентации «Boosting Mobile GPU Performance with a Decoupled AccessExecute Fragment Processor.ppt»
СлайдТекст
1 Boosting Mobile GPU Performance with a Decoupled Access/Execute

Boosting Mobile GPU Performance with a Decoupled Access/Execute

Fragment Processor

Jos?-Mar?a Arnau, Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)

2 Focusing on Mobile GPUs

Focusing on Mobile GPUs

Energy-efficient mobile GPUs

Market demands

Technology limitations

1

2

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

2

1 http://www.digitalversus.com/mobile-phone/samsung-galaxy-note-p11735/test.html Samsung galaxy SII vs Samsung Galaxy Note when running the game Shadow Gun 3D 2 http://www.ispsd.com/02/battery-psd-templates/

3 GPU Performance and Memory

GPU Performance and Memory

A mobile single-threaded GPU with perfect caches achieves a speedup of 3.2x on a set of commercial Android games

Graphical workloads: Large working sets not amenable to caching Texture memory accesses are fine-grained and unpredictable Traditional techniques to deal with memory: Caches Prefetching Multithreading

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

3

4 Outline

Outline

Background Methodology Multithreading & Prefetching Decoupled Access/Execute Conclusions

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

4

5 Assumed GPU Architecture

Assumed GPU Architecture

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

5

6 Assumed Fragment Processor

Assumed Fragment Processor

Warp: group of threads executed in lockstep mode (SIMD group)

4 threads per warp 4-wide vectorial registers (16 bytes) 36 registers per thread

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

6

7 Methodology

Methodology

Main memory

Latency = 100 cycles Bandwidth = 4 bytes/cycle

Pixel/Textures caches

2 KB, 2-way, 2 cycles

L2 cache

32 KB, 8-way, 12 cycles

Number of cores

4 vertex, 4 pixel processors

Warp width

4 threads

Register file size

2304 bytes per warp

Number of warps

1-16 warps/core

Power Model: CACTI 6.5 and Qsilver

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

7

8 Workload Selection

Workload Selection

2D games

Simple 3D games

Complex 3D games

Small/medium sized textures Texture filtering: 1 memory access Small fragment programs

Small/medium sized textures Texture filtering: 1-4 memory accesses Small/medium fragment programs

Medium/big sized textures Texture filtering: 4-8 memory accesses Big, memory intensive fragment programs

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

8

9 Improving Performance Using Multithreading

Improving Performance Using Multithreading

Very effective High energy cost (25% more energy) Huge register file to maintain the state of all the threads 36 KB MRF for a GPU with 16 warps/core (bigger than L2)

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

9

10 Employing Prefetching

Employing Prefetching

Hardware prefetchers: Global History Buffer K. J. Nesbit and J. E. Smith. “Data Cache Prefetching Using a Global History Buffer”. HPCA, 2004. Many-Thread Aware J. Lee, N. B. Lakshminarayana, H. Kim and R, Vuduc. “Many-Thread Aware Prefetching Mechanisms for GPGPU Applications”. MICRO, 2010. Prefetching is effective but there is still ample room for improvement

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

10

11 Decoupled Access/Execute

Decoupled Access/Execute

Use the fragment information to compute the addresses that will be requested when processing the fragment Issue memory requests while the fragments are waiting in the tile queue Tile queue size: Too small: timeliness is not achieved Too big: cache conflicts

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

11

12 Inter-Core Data Sharing

Inter-Core Data Sharing

66.3% of cache misses are requests to data available in the L1 cache of another fragment processor Use the prefetch queue to detect inter-core data sharing Saves bandwidth to the L2 cache Saves power (L1 caches smaller than L2) Associative comparisons require additional energy

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

12

13 Decoupled Access/Execute

Decoupled Access/Execute

33% faster than hardware prefetchers, 9% energy savings DAE with 2 warps/core achieves 93% of the performance of a bigger GPU with 16 warps/core, providing 34% energy savings

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

13

14 Benefits of Remote L1 Cache Accesses

Benefits of Remote L1 Cache Accesses

Single threaded GPU Baseline: Global History Buffer 30% speedup 5.4% energy savings

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

14

15 Conclusions

Conclusions

High performance, energy efficient GPUs can be architected based on the decoupled access/execute concept A combination of decoupled access/execute -to hide memory latency- and multithreading -to hide functional units latency- provides the most energy efficient solution Allowing for remote L1 cache accesses provides L2 cache bandwidth savings and energy savings The decoupled access/execute architecture outperforms hardware prefetchers: 33% speedup, 9% energy savings

Jose-Maria Arnau, Joan-Manuel Parcerisa, Polychronis Xekalakis

15

16 Boosting Mobile GPU Performance with a Decoupled Access/Execute

Boosting Mobile GPU Performance with a Decoupled Access/Execute

Fragment Processor

Thank you! Questions?

Jos?-Mar?a Arnau (UPC) Joan-Manuel Parcerisa (UPC) Polychronis Xekalakis (Intel)

«Boosting Mobile GPU Performance with a Decoupled AccessExecute Fragment Processor»
http://900igr.net/prezentacija/anglijskij-jazyk/boosting-mobile-gpu-performance-with-a-decoupled-accessexecute-fragment-processor-65522.html
cсылка на страницу

Без темы

661 презентация
Урок

Английский язык

29 тем
Слайды
900igr.net > Презентации по английскому языку > Без темы > Boosting Mobile GPU Performance with a Decoupled AccessExecute Fragment Processor