Magika 1.0: Smarter, Faster File Detection with Rust and AI

Google has just released version 1.0 of Magika, a substantial rewrite of its open-source file type detection system. The new version leverages AI to support a broader range of file types and is built in Rust for maximum speed and security.

Magika 1.0 brings the number of supported file types to over 200, up from 100 in the previous Python version.

Google highlights that many of the newly added file types are specialized text-based file types that were previously difficult to detect. These include Dockerfiles, TOML, HCL, Bazel files, and many more. Magika 1.0 can also distinguish between source code files written in Swift, Kotlin, TypeScript, Dart, Web Assembly, and Zig (zig). Additionally, it supports file types commonly used in data science, such as Jupyter Notebooks, Numpy arrays, PyTorch models, ONNX files, and others.

In addition to supporting a wider range of file types, Magika 1.0 offers greater granularity, distinguishing similar formats that were previously grouped, such as TypeScript and JavaScript, C++ and C, TSV and CSV, etc.

To enable the tool to detect this wide diversity of formats, Google engineers created a large dataset of file format samples to train a specialized AI model. The sheer volume of data represented a challenge in itself:

Our training dataset grew to over 3TB when uncompressed, which required an efficient processing pipeline. To handle this, we leveraged our recently released SedPack dataset library. This tool allows us to stream and decompress this large dataset directly to memory during training, bypassing potential I/O bottlenecks and making the process feasible.

At the same time, several formats, including recent, legacy, and specialized formats, were significantly underrepresented. Google addressed this using Gemini to "create a high-quality, synthetic training set by translating existing code and other structured files from one format to another".

Google says that Magika achieves ~99% average precision and recall, outperforming existing approaches, especially on textual content types.

Another significant advantage of Magika 1.0 is its completely rewritten core, which uses Rust to maximize performance and enhance memory safety. The new Rust-based engine is at the heart of Magika's command line tool, which can scan hundreds of files per second on a single CPU:

Magika is able to identify hundreds of files per second on a single core and easily scale to thousands per second on modern multi-core CPUs thanks to the use of the high-performance ONNX Runtime for model inference and Tokio for asynchronous parallel processing.

Based on Google's benchmarks, this approach makes it possible to process nearly 1,000 files per second on a MacBook Pro (M4). As Reddit user robertknight2 explains, in this workflow:

Rust is used for extracting feature vectors from files using a small subset of the content and driving the scanning process via a tokio-based loop. The ML inference which predicts the file type based on extracted features is however done in C++ by ONNX Runtime (via the ort crate).

The tool incurs a one-time performance cost when initially loading the model, but afterwards it achieves around 5ms per file, with nearly constant inference time independently of file size.

Although some have viewed the adoption of Rust negatively, X user Caleb Maclennan noted that "the security implications of heuristic guessing how to handle inputs make Rust a good pick". User Mazzarito added:

When file extensions are missing, or when they cannot be trusted such as during file uploads this type of program is actually quite valuable. File types are simply conventions- but there is no standard way to determine its type other than trying to read it its decoder.

You can install Magika's command-line tool by executing:

curl -LsSf https://securityresearch.google/magika/install.sh | sh

or getting the Python package, which includes the CLI tool, running pipx install magika.

About the Author

Sergio De Simone

Show moreShow less

InfoQ Software Architects' Newsletter

Write for InfoQ

About the Author

Sergio De Simone

Rate this Article

This content is in the Python topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter