K-means is a Partitioning-Based clustering algorithm.
Let and be two numbers, with . In order to have a cluster, will be the number of data points in input, and will be the number of clusters in output. The goal is to find the partition that optimized a certain criterion.
Clustering is an NP-Hard problem since if we try to brute force all the possible non-empty partitions of elements to see which one is the best, the complexity will be .
The idea behind partitioning-based algorithms is to approximate the optimal solution by using some techniques.
Let’s first define some terminology:
- is the set of input data points;
- is the set of output clusters;
- is the generic -th cluster;
- is the representative of cluster .
The objective function, that is the function that has to be minimized is:
L(A, \boldsymbol{\Theta}) = \sum_{n=1}^N\sum_{k=1}^k\alpha_{n,k}\delta(\boldsymbol{x}_n, \boldsymbol{\theta}_k) $$ where: - $A$ is an $N \times K$ matrix such that $\alpha_{n,k} = 1$ iff $\boldsymbol{x}_n$ is assigned to cluster $C_k$ and $0$ otherwise. Since this is hard clustering, each element has to be assigned to one and only one cluster, so each row will contain only a single $1$ in position $k$ and all $0$ in other positions; - $\boldsymbol{\Theta} = \{\boldsymbol{\theta}_1, \dots, \boldsymbol{\theta}_K\}$ is the set of cluster representatives - $\delta(\boldsymbol{x}_n, \boldsymbol{\theta}_k)$ is a distance function between $\boldsymbol{x}_n$ and $\boldsymbol{\theta}_k$. The minimizers $A^*$ and $\boldsymbol{\Theta}^*$ are the ones that minimize the function, and so they can be defined as: $$ A^*, \boldsymbol{\Theta}^* = \operatorname{argmin}_{A, \boldsymbol{\Theta}}L(A, \boldsymbol{\Theta}) $$ Since the matrix $A$ is discrete, the function is non-convex and so will have multiple local minima, preventing us from using iterative methods like gradient descent to find the global minimum. # Lloyd-Forgy Algorithm In order to compute an approximate solution in an efficient way, the Lloyd-Forgy algorithm uses a 2-step iterative procedure. Since it’s an approximation, the computed minimum could be a local minimum or a saddle point. - **Assignment step:** in this step we minimize $L$ w.r.t. $A$ by fixing $\boldsymbol{\Theta}$. Intuitively, given a set of fixed representatives, $L$ is minimized if the distance between each point and its representative is minimal, which means that each data point should be assigned to the closes centroid according to $\delta$. Since the representative are fixed in this step I can do that easily. - **Update step:** in this step, we instead minimize $L$ w.r.t. $\boldsymbol{\Theta}$ by fixing $A$. This time we can take the gradient of $L$ w.r.t. $\boldsymbol{\Theta}$, set it to $0$ and solve it for $\boldsymbol{\Theta}$. ## K-means K-means is a special case of the Lloyd-Forgy algorithm in which each cluster representative is its center of mass (i.e. centroid). The centroid of a cluster is the mean of the instances assigned to that cluster. The idea is to minimize the Sum of Square Distances (SSD) within-cluster. We can define the $k$-th centroid as: $$ \boldsymbol{\theta}_k = \frac{\sum_{n=1}^N \alpha_{n,k}\boldsymbol{x}_n}{\sum_{n=1}^N \alpha_{n,k}} = \boldsymbol{\mu}_k = \frac{1}{|C_k|}\sum_{n \in C_k}\boldsymbol{x}_n $$ where $|C_k| = \sum_{n=1}^N \alpha_{n,k}$. The new objective function, being the Sum of Square Distances defined with Euclidean distance can be written as: $$ L(A, \boldsymbol{\Theta}) = \sum_{n=1}^N\sum_{k=1}^k\alpha_{n,k}(\|\boldsymbol{x}_n - \boldsymbol{\theta}_k\|_2)^2 $$ We can also rewrite the Euclidean norm as: $$ (\|\boldsymbol{x}_n - \boldsymbol{\theta}_k\|_2)^2 = \sqrt{(\boldsymbol{x}_n - \boldsymbol{\theta}_k)^2}^2 = (\boldsymbol{x}_n - \boldsymbol{\theta}_k)^2 $$ and so: $$ L(A, \boldsymbol{\Theta}) = \sum_{n=1}^N\sum_{k=1}^k\alpha_{n,k}(\boldsymbol{x}_n - \boldsymbol{\theta}_k)^2 $$ In the assignment step, we compute the original centroids and then assign each data point to the centroids that are closest. In the update step, we compute the gradient of the loss, set it to $0$ and solve it for $\boldsymbol{\Theta}$. Afer some calculations that can be found on the Lecture 5 slide (slide 50 to 57) we find out that the set of representatives that minimize the loss are exactly the centroids of each cluster. So new centroids are computed in the update step. Each time we do the assignment, we are modifing the center of mass of the clusters, and so we have to update those. We do this until a stopping criterion is met, where the stopping criterion can be different from case to case. Note that in the first time the update step is done, the algorithm will select $K$ data points as the initial cluster representatives. This step is crucial in order for the algorithm performances, so will be detailed later. The most naive choice would of course to select the points randomly. ### Stopping Criterion About the stopping criterion, there are many options to choose from: - **Fixed number of iterations**, meaning after a certain number $R$ of iterations, we stop the algorithm no matter what; - **Cluster assignments stop changing beyond a certain threshold**, meaning that we stop when a certain percentage of cluster don’t change between iterations, meaning the assignment is converging; - **Centroids don’t chanfge beyond a certain threshold**, meaning a percentage of centroids don’t change between iterations. We’re sure that the algorithm will converge at a certain point because at each iteration we are either improving the objective funciton or not changing anything, but we’re never getting worse. Furthermore, we can see the Lloyd-Forgy Algorithm as an instance of a more general Expectation Maximization algorithm, and we know that these types of algorithm converge (even tough not necessarily to a global optimum). We can see the assignment step as the expectation step; and the update step as the maximization step (even if it’s actually minimization). ### Complexity Analysis We can analyze the overall complexity of the algorithm by analyzing the complexity of each single step: - Computing the distance between two $d$-dimensional data points takes $O(d)$; - Assigning (and re-assigning) clusters takes $O(KNd)$ because it has to repeat the distance computaiton $O(KN)$ times. This because in order to assign a data point to the nearest centroid, we have to compute the distance between each data point ($N$) and each cluster representative ($K$). - The update step, in which the centroids are computed, takes $O(Nd)$ since it’s made out of $O(N)$ average computation (each data point is added to one and only one cluster per iteration) each taking $O(d)$. Assuming $R$ is the number of iterations of the two steps, the overall complexity is $O(RKNd)$. ### Initial representative choice As said before, choosing the representatives for the first iteration is a crucial step in order to achieve a good result. The Lloyd-Forgy algorithm randomly chooses $K$ data points as the initial centroids, this results in the possibility of having a bad final cluster, this because the convergence can be on sub-optimal clusterings. Actually both clustering quality and the convergene rate are affected by the selection of initial centroids. ![[Untitled-2023-02-04-1558 (1).webp]] From this image we can see that if we are unlucky and we choose B and E as representative, we will have a bad clustering; on the other hand if we choose D and F we will have a much better final clustering. As we can see from this example, most of the time we have a good clustering if we choose representative that are well seprarated between each other, meaning they are far apart according to the chosen distance measure. In order to avoid the possible chance of being unlucky we can try different initial random centroids and pick the best result, but this of course isn’t efficient. ### K-means++ K-means++ is a variant of K-means in which the choice of the initial centroids is done in a smarter and better way. The idea is to choose the $K$ centroids in an iterative manner. Initially we have just one centroid, that is the mean of the entire dataset, then for each data point $x$ we: - Compute the distance $D(x)$ from the nearest centroid; - Choose one new data point at random as a new centroid with probability proportional to $D(x)^2$ (farther the datapoint to its nearest centroid, the more probable it will be chosen as a new centroid); And we repeat those two steps untile we have $K$ centroids, after which we will start the usual K-means algorithm. Note that even if we pick some outlier as a centroid because they’re very distant from the mean of the dataset, since this is just the starting phase, eventually the K-means algorithm will even out the centroids after some iterations. This variant of the algorithm has the advantage of putting an upper bound to the clustering, meaning that the clustering has a limit of how much worse can it be with respect to the optimal partitioning. This limit is defined as $O(\log K)$. This is not true for the vanilla implementation of K-means, since it has no limits on how bad the final result can be w.r.t. the optimal solution. ### Choosing the number of clusters $K$ Sometimes $K$ can be implicitly given in the dataset, for example if we are dealing with the MNIST dataset and so we have to cluster images of digits, we are sure that the best $K$ would be $10$. Most of the time is very difficult to know $K$ in advance, and we need to trade-off between having too few and too many clusters, or in other words betweeen total benefit and total cost. - _**Benefit**_: we define as $b_i$ the benefit for a data point $x_i$ as the similarity (a value in $[0,1]$) to its assigned centroid. The total benefit would be$B = \sum b_i$; Note that there is always a clustering in which the total benefit $B = N$, when we have a cluster for each datapoint, so each $x_i = 1$ because each data point would be its own centroid, and so $B = N$. - _**Cost**_: we can assign a cost $p$ to each cluster, so the total cost is defined as $P = Kp$. We then can define the value of a certain clustering as $V = B-P$. The goal would be to maximise $V$ over all the possible $K$. **The “Elbow” Method:** Usually the best approach is just to try with multiple values of $K$ and look at the change of the SSD by plotting it on a graph. ![[Screenshot 2023-03-20 at 7.30.53 PM.webp]] Since the graph will be decreasing, we can find a certain value of $K$ where after that the gain in benefit will be very little with respect to the cost we have to pay. In the image above this value is $K=6$ and so that is a good choice of $K$. --- #ai/machine-learning