OpenAI released their newest language, Triton. This open-source programming language that enables researchers to write highly efficient GPU code for AI workloads is Python-compatible and allows new users to achieve expert-quality results in only 25 lines of code.
Triton uses Python as its base. The developer writes code in Python using Triton’s libraries, which are then JIT-compiled to run on the GPU. This allows integration with the rest of the Python ecosystem, currently the biggest destination for developing machine-learning solutions.
Triton’s libraries, reminiscent of NumPy, provide a variety of matrix operations, for instance, or functions that perform reductions on arrays according to some criterion. The user combines these primitives in their own code, adding the @triton.jit decorator compiled to run on the GPU. In this sense Triton also resembles Numba, the project that allows numerically intensive Python code to be JIT-compiled to machine-native assembly for speed.
Simple examples of Triton at work include a vector addition kernel and a fused softmax operation. The latter example can run many times faster than the native PyTorch fused softmax for operations that can be done entirely in GPU memory.
According to OpenAI, Triton simplifies the development of specialized kernels that can be much faster than those in general-purpose libraries. Its compiler simplifies code and automatically optimizes and parallelizes it, converting it into code for execution on recent Nvidia GPUs.
The architecture of modern GPUs can be broken down into three major components - DRAM, SRAM and ALUs. Each must be considered when optimizing CUDA code; one cannot overlook the challenges that come with GPU programming, such as making sure that memory transfers from DRAM coalesce to leverage larger bus widths on today’s memory interfaces. Data needs to be manually stashed in SRAMS before it is re-used so as not to conflict with other shared memory banks upon retrieval.
OpenAI claims this makes it possible to reach peak hardware performance without much effort, making creating more complex workflows easier than ever before. Whenever the data in cache form is too large for the process, Triton transfers almost double the amount of necessary memory, making it faster than the others. Coding in Triton is easier and more reliable than PyTorch’s CUDA kernel, and easy to understand and maintain the process.
Triton intends to become a community-driven project, and the repository to fork is available on GitHub.