The most important quantitative aspect of modern computing is: floating-point arithmetic is fast, memory access is time-consuming.
The computational throughput of a processor reflects the rate at which this processor can perform floating-point or integer arithmetic operations. In recent computer architectures, the computational throughput of processors increases approximately 2x every 2 years. At the same time, the rate of performance improvements is much lower for other components of computer architecture, such as DRAM, network fabrics and persistent storage. These trends have important consequences for best programming methods for computational applications.
In this course, you will learn to identify applications that can be compute-bound using the Roofline model. You will see how to use Intel® Parallel Studio XE tools (Intel® Advisor, Intel® VTune™ Amplifier and its Application Performance Snapshot tool) to detect optimization opportunities to make your application
- vectorized with the appropriate instruction set
- free of inner loop setup overheads
- efficient at the ALU pipeline utilization through the chaining of multiple independent instructions
- economical in terms of memory access thanks to loop tiling (or cache blocking)
- more efficient through specialization for fixed problem parameters and sizes
- portable and future-proof through the use of Intel’s performance libraries and explicit threading/SIMD constructs
The course is structured as a combination of instruction material (30%) and hands-on guided exercise (70%). After enrolling in the course, you can download the source code of a hands-on exercise, which is a compute-bound 64-taps FIR filter. You can compile the application on your computer and proceed with the lessons. You will run and analyze the application in Intel® Parallel Studio XE tools and get it close to the theoretical peak arithmetic performance limit. You can also download and open the results of data collection in Advisor and VTune collected on a server based on a two-way Intel® Xeon® Scalable processor with a total of 12 CPU cores.