BT

AFK-MC² Algorithm Speeds up k-Means Clustering Algorithm Seeding

| by Alexandre Rodrigues Follow 0 Followers on Dec 23, 2016. Estimated reading time: 2 minutes |

Fast and Probably Good Seedings for k-Means” by Olivier Bachem et al. was presented on 2016’s Neural Information Processing Systems (NIPS) conference and describes AFK-MC², an alternative method to generate initial seedings for the k-Means clustering algorithm that is several orders of magnitude faster than the state of art method k-Means++.

k-Means clustering can be used to group data points, or observations, for which you don't know the labels but you have some knowledge on how many groups you have in total (K). The objective is to build K clusters that group the data together using some metric of similarity (e.g euclidean distance). The algorithm, often called Lloyd’s algorithm, consists of finding cluster centers that minimize the distance of between each data point in the group.

As all non-convex optimization algorithms, Lloyd’s algorithm may converge to a local minimum. In order to improve solution quality, the algorithm is bootstrapped with initial cluster centers points called seeds. Random seeds are fast to generate but may converge into far from optimal solutions.

k-Means++ improves the seeds by doing an adaptative sampling on the data points. It starts by selecting a random one as initial seed. Then it computes the distance of all data points to the the nearest seed (initial seed in the first iteration). The next seed is selected randomly from the data points with a probability proportional to the square of the distance calculated.

The drawback of k-Means++ is that it does not scale easily to massive data sets since both its seeding step and every iteration of Lloyd’s algorithm require the computation of all pairwise distances between cluster centers and data points.

The proposed AFK-MC² method is presented as a simple, yet fast, alternative for k-Means seeding that produces probably good clustering without assumptions on the data.

The method consists in an approximation of k-Means++ using Markov chains, where the data points are states. The first chain state is a random sampled data point. A randomized process determines if the chain transitions to other random data points. The decision of state transition is dependent of the initial distances of all data points, that are computed once, as part of a pre-processing step. Unlike k-Means++ it only requires a full pass through all the dataset.

The stationary distribution of AFK-MC² is the same as k-Means++ given enough steps in the chain are simulated. The results showed speedups of 200 to 1000 times faster when compared with k-Means++ with 0 to 1% relative error, for large datasets.

A Cython based implementation of the algorithm is available on Github with examples on how to use it together with scikit-learn.

Rate this Article

Adoption Stage
Style

Hello stranger!

You need to Register an InfoQ account or or login to post comments. But there's so much more behind being registered.

Get the most out of the InfoQ experience.

Tell us what you think

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread
Community comments

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Allowed html: a,b,br,blockquote,i,li,pre,u,ul,p

Email me replies to any of my messages in this thread

Discuss

Educational Content

Login to InfoQ to interact with what matters most to you.


Recover your password...

Follow

Follow your favorite topics and editors

Quick overview of most important highlights in the industry and on the site.

Like

More signal, less noise

Build your own feed by choosing topics you want to read about and editors you want to hear from.

Notifications

Stay up-to-date

Set up your notifications and don't miss out on content that matters to you

BT