BT

DMTK, a Machine Learning Toolkit from Microsoft

| by Abel Avram Follow 12 Followers on Nov 13, 2015. Estimated reading time: 1 minute | NOTICE: QCon.ai - Applied AI for Developers Apr 15 - 17, 2019, San Francisco. Join us!

About the same time Google announced open sourcing TensorFlow, Microsoft has pushed to GitHub DMTK, a Distributed Machine Learning Toolkit. While Google has released a one-machine version of TensorFlow, DMTK runs on a cluster of machines.

DMTK is a parameter server framework for training machine learning models with large amounts of data on a cluster of machines, taking care of data storage and operation, inter-process and inter-thread communication. DMTK was written in C++, comes with a client API and SDK and it uses ZeroMQ and/or MPI for communication. To emphasize its capabilities, Microsoft said that “using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. That workload would previously have required thousands of machines.”

When open sourcing the framework, Microsoft has made available a number of tools:

  • DMTK – the underlying machine learning framework.
  • LightLDA – an algorithm for training topic models based on large scale data. According to this paper, LightLDA can be used to train “1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens” on a 8-machine cluster, Microsoft using it to train models for Bing.
  • Distributed Word Embedding (DWE) – a parallelization of the Word2Vec algorithm.
  • Distributed Multi-sense Word Embedding (DMWE) - a parallelization of the Skip-Gram Mixture algorithm used for polysemous words.

While DMTK has been used for topic modeling and word embedding, it can be used for “computer vision, speech recognition and textual understanding,” according to Microsoft.

The source code can be taken from GitHub or alternatively they are providing binaries for Windows or Linux.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss
BT