BT

DMTK, a Machine Learning Toolkit from Microsoft

| by Abel Avram Follow 9 Followers on Nov 13, 2015. Estimated reading time: 1 minute |

About the same time Google announced open sourcing TensorFlow, Microsoft has pushed to GitHub DMTK, a Distributed Machine Learning Toolkit. While Google has released a one-machine version of TensorFlow, DMTK runs on a cluster of machines.

DMTK is a parameter server framework for training machine learning models with large amounts of data on a cluster of machines, taking care of data storage and operation, inter-process and inter-thread communication. DMTK was written in C++, comes with a client API and SDK and it uses ZeroMQ and/or MPI for communication. To emphasize its capabilities, Microsoft said that “using the toolkit one can train a topic model with one million topics and a 20-million word vocabulary, or a word-embedding model with 1000 dimensions and a 20-million word vocabulary, on a web document collection with 200 billion tokens utilizing a cluster of just 24 machines. That workload would previously have required thousands of machines.”

When open sourcing the framework, Microsoft has made available a number of tools:

  • DMTK – the underlying machine learning framework.
  • LightLDA – an algorithm for training topic models based on large scale data. According to this paper, LightLDA can be used to train “1 million topics and a 1-million-word vocabulary (for a total of 1 trillion parameters), on a document collection with 200 billion tokens” on a 8-machine cluster, Microsoft using it to train models for Bing.
  • Distributed Word Embedding (DWE) – a parallelization of the Word2Vec algorithm.
  • Distributed Multi-sense Word Embedding (DMWE) - a parallelization of the Skip-Gram Mixture algorithm used for polysemous words.

While DMTK has been used for topic modeling and word embedding, it can be used for “computer vision, speech recognition and textual understanding,” according to Microsoft.

The source code can be taken from GitHub or alternatively they are providing binaries for Windows or Linux.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT