The algorithm of kMeans is an unsupervised learning algorithm for clustering a set of items into groups.
Given a set of multi-dimensional items and a number of clusters, k, we are tasked of categorizing the items into groups of similarity. The centers of these groups are called means.
The algorithm in pseudocode:
Input: Data Set Number of Clusters (k) Number of Max Iterations (maxIterations) --- Initialize k means at random points For i = 0 to maxIterations: Iterate through items Assign an item to its closest mean Update that mean Return means
The data is given as a text file (‘data.txt’). Each line in the file represents an item. Each line/item contains numerical values split by commas. The number of values in a line is the number of features.
First we need to read the data from the file. We will save the data into a list, where each element is another list containing the item values for the features. A function that does this is the following:
For better accuracy, we need to initialize the means at random points. One way would be to pick random items from the data set. A better way is to initialize the means with random values inside the boundaries of the data set. For each feature, we will find its max and min values in the data set and we will pick a random number between those to initialize our mean feature.
First we need to find the min and max values of all columns in the data set.
Next we initialize the means with values between the min and max of each feature.
We update each mean by finding the average value for each feature. We can do this by adding all the values together and then dividing by the number of items in the cluster. As we are continuously updating the means though, adding all the item values together again and again is not efficient. So we will update the means via another way.
Let n be the number of items in a cluster, m the value of a feature in the cluster and x the item value for that feature. To find the new mean value for the feature in the cluster, we do the following calculation:
newM = (m*(n-1)+x)/n
We do that for each feature and we have our new mean:
To find the similarity between a cluster and an item, we will calculate their Euclidean Distance:
NOTE: Depending on the data, you can chose another similarity function.
Classifying an Item
Next we need to write a function to classify an item to a cluster. We will find the euclidean distance between the item and each mean, and we will classify the item to the minimum-distance cluster.
Now we move on to calculating the means of the data set.
We will loop through all the items, we will classify them to their nearest cluster and we will update the mean of that cluster. We will repeat the process a fixed number of times. If between two loops no item changes classification we stop the process, as the algorithm converged to the optimal solution.
The below function returns the means and the clusters of the data set.
This will receive as input the clusters each item belongs to (computed by the above function, CalculateMeans) and it will return the clusters themselves.