Show Buttons
Share On Facebook
Share On Twitter
Share On Google Plus
Share On Linkdin
Share On Pinterest
Share On Reddit
Share On Stumbleupon
Contact us
Hide Buttons

K-Means Algorithm

The algo­rithm of kMeans is an unsu­per­vised learn­ing algo­rithm for clus­ter­ing a set of items into groups.

Given a set of multi-dimensional items and a num­ber of clus­ters, k, we are tasked of cat­e­go­riz­ing the items into groups of sim­i­lar­ity. The cen­ters of these groups are called means.

Overview

The algo­rithm in pseudocode:

Input:
Data Set
Number of Clusters (k)
Number of Max Iterations (maxIterations)
---

Initialize k means at random points

For i = 0 to maxIterations:
    Iterate through items
        Assign an item to its closest mean
        Update that mean

Return means

Read­ing Data

The data is given as a text file (‘data.txt’). Each line in the file rep­re­sents an item. Each line/item con­tains numer­i­cal val­ues split by com­mas. The num­ber of val­ues in a line is the num­ber of features.

First we need to read the data from the file. We will save the data into a list, where each ele­ment is another list con­tain­ing the item val­ues for the fea­tures. A func­tion that does this is the following:

Ini­tial­iz­ing Means

For bet­ter accu­racy, we need to ini­tial­ize the means at ran­dom points. One way would be to pick ran­dom items from the data set. A bet­ter way is to ini­tial­ize the means with ran­dom val­ues inside the bound­aries of the data set. For each fea­ture, we will find its max and min val­ues in the data set and we will pick a ran­dom num­ber between those to ini­tial­ize our mean feature.

First we need to find the min and max val­ues of all columns in the data set.

Next we ini­tial­ize the means with val­ues between the min and max of each feature.

Updat­ing Means

We update each mean by find­ing the aver­age value for each fea­ture. We can do this by adding all the val­ues together and then divid­ing by the num­ber of items in the clus­ter. As we are con­tin­u­ously updat­ing the means though, adding all the item val­ues together again and again is not effi­cient. So we will update the means via another way.

Let n be the num­ber of items in a clus­ter, m the value of a fea­ture in the clus­ter and x the item value for that fea­ture. To find the new mean value for the fea­ture in the clus­ter, we do the fol­low­ing calculation:

newM = (m*(n-1)+x)/n

We do that for each fea­ture and we have our new mean:

Euclid­ean Distance

To find the sim­i­lar­ity between a clus­ter and an item, we will cal­cu­late their Euclid­ean Distance:

NOTE: Depend­ing on the data, you can chose another sim­i­lar­ity function.

Clas­si­fy­ing an Item

Next we need to write a func­tion to clas­sify an item to a clus­ter. We will find the euclid­ean dis­tance between the item and each mean, and we will clas­sify the item to the minimum-distance cluster.

Cal­cu­late Means

Now we move on to cal­cu­lat­ing the means of the data set.

We will loop through all the items, we will clas­sify them to their near­est clus­ter and we will update the mean of that clus­ter. We will repeat the process a fixed num­ber of times. If between two loops no item changes clas­si­fi­ca­tion we stop the process, as the algo­rithm con­verged to the opti­mal solution.

The below func­tion returns the means and the clus­ters of the data set.

Find­ing Clusters

This will receive as input the clus­ters each item belongs to (com­puted by the above func­tion, Cal­cu­late­Means) and it will return the clus­ters themselves.

Com­plete Code

This arti­cle is con­tribute by Anto­nis Maronikolakis

You may also like...