Be the first user to complete this post

  • 0
Add to List
Hard

381. K-Means Algorithm

The algorithm of kMeans is an unsupervised learning algorithm for clustering a set of items into groups.

Given a set of multi-dimensional items and a number of clusters, k, we are tasked of categorizing the items into groups of similarity. The centers of these groups are called means.

Overview

The algorithm in pseudocode:

Input:
Data Set
Number of Clusters (k)
Number of Max Iterations (maxIterations)
---

Initialize k means at random points

For i = 0 to maxIterations:
    Iterate through items
        Assign an item to its closest mean
        Update that mean

Return means

Reading Data

The data is given as a text file ('data.txt'). Each line in the file represents an item. Each line/item contains numerical values split by commas. The number of values in a line is the number of features.

First we need to read the data from the file. We will save the data into a list, where each element is another list containing the item values for the features. A function that does this is the following:


Initializing Means

For better accuracy, we need to initialize the means at random points. One way would be to pick random items from the data set. A better way is to initialize the means with random values inside the boundaries of the data set. For each feature, we will find its max and min values in the data set and we will pick a random number between those to initialize our mean feature.

First we need to find the min and max values of all columns in the data set.


Next we initialize the means with values between the min and max of each feature.


Updating Means

We update each mean by finding the average value for each feature. We can do this by adding all the values together and then dividing by the number of items in the cluster. As we are continuously updating the means though, adding all the item values together again and again is not efficient. So we will update the means via another way.

Let n be the number of items in a cluster, m the value of a feature in the cluster and x the item value for that feature. To find the new mean value for the feature in the cluster, we do the following calculation:

newM = (m*(n-1)+x)/n

We do that for each feature and we have our new mean:


Euclidean Distance

To find the similarity between a cluster and an item, we will calculate their Euclidean Distance:


NOTE: Depending on the data, you can chose another similarity function.

Classifying an Item

Next we need to write a function to classify an item to a cluster. We will find the euclidean distance between the item and each mean, and we will classify the item to the minimum-distance cluster.


Calculate Means

Now we move on to calculating the means of the data set.

We will loop through all the items, we will classify them to their nearest cluster and we will update the mean of that cluster. We will repeat the process a fixed number of times. If between two loops no item changes classification we stop the process, as the algorithm converged to the optimal solution.

The below function returns the means and the clusters of the data set.


Finding Clusters

This will receive as input the clusters each item belongs to (computed by the above function, CalculateMeans) and it will return the clusters themselves.


Complete Code