DMTK, a Machine Learning Toolkit from Microsoft

About the same time Google announced open sourcing TensorFlow, Microsoft has pushed to GitHub DMTK, a Distributed Machine Learning Toolkit. While Google has released a one-machine version of TensorFlow, DMTK runs on a cluster of machines.

DMTK is a parameter server framework for training machine learning models with large amounts of data on a cluster of machines, taking care of data storage and operation, inter-process and inter-thread communication. DMTK was written in C++, comes with a client API and SDK and it uses ZeroMQ and/or MPI for communication. To emphasize its capabilities, Microsoft said that “using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. That workload would previously have required thousands of machines.”

When open sourcing the framework, Microsoft has made available a number of tools:

DMTK – the underlying machine learning framework.
LightLDA – an algorithm for training topic models based on large scale data. According to this paper, LightLDA can be used to train “1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens” on a 8-machine cluster, Microsoft using it to train models for Bing.
Distributed Word Embedding (DWE) – a parallelization of the Word2Vec algorithm.
Distributed Multi-sense Word Embedding (DMWE) - a parallelization of the Skip-Gram Mixture algorithm used for polysemous words.

While DMTK has been used for topic modeling and word embedding, it can be used for “computer vision, speech recognition and textual understanding,” according to Microsoft.

The source code can be taken from GitHub or alternatively they are providing binaries for Windows or Linux.

InfoQ Software Architects' Newsletter

Write for InfoQ

Rate this Article

This content is in the AI, ML & Data Engineering topic

Related Topics:

Related Editorial

Related Sponsors

Popular across InfoQ

The InfoQ Newsletter