The Python® programming language occupies a diverse habitat ranging from web applications and gaming to scientific computing and artificial intelligence (AI). The use of the Python language in AI has contributed to a rapid growth in the popularity of the language in recent years, making it one of the most popular software development tools. With the release of the Intel® Distribution for Python*, the language has gained significant performance improvements for general-purpose computing and machine learning, making it a practical language for production applications in a high-performance environment.
However, performance control in Python applications is not trivial. That is because the Python syntax is enabled by a vast forest of standard libraries. Therefore, the complexity hidden behind the layers of abstractions and implicit library calls makes it very difficult to control and improve the performance of computing applications built in the Python language.
This is where Intel® Parallel Studio XE tools come in. One of the tools, the Intel® Math Kernel Library, has been implanted into Intel® Distribution for Python’s key computational libraries, including NumPy, SciPy, TensorFlow, and others. Another tool, Intel® C++ Compiler, provides a way to build high-performing functions implemented in C/C++ and callable from Python programs. Finally, Intel® VTune Amplifier™ supports Python code for hotspot detection and also allows general-purpose analysis of performance events occurring during the running of a Python application.
In this course, you will learn to use the Intel® VTune Amplifier™ to analyze, and improve the performance of, computational applications written in the Python language and running under the Intel® Distribution for Python. We will demonstrate and discuss:
- detecting hotspots (most time-consuming functions) in a Python application;
- what vectorization means in Python code, and how to diagnose and improve it;
- the limitations on multi-threading imposed by the global interpreter lock (GIL);
- using multiple cores in workloads through multiprocessing and OpenMP-enabled functions;
- finding common threading issues such as synchronization, serialization, and load imbalance;
- identifying subtle performance issues, such as type conversions and sub-optimal transcendental math.
The course is structured as a combination of instruction material (30%) and hands-on guided exercise (70%). After enrolling in the course, you can download the source code of a hands-on exercise, which is an image normalization workload. You can run the application on your computer and proceed with the lessons. You will run and analyze the application in Intel® VTune Amplifier® and accelerate it by an enormous factor. You can also download and open the results of performance analysis in Intel VTune Amplifier collected on a server based on a two-way Intel® Xeon® Scalable processor with a total of 12 CPU cores.