Given a set of data points $X = \{x_1, x_2, \ldots, x_m\}$ where $x_i \in \mathbb{R}^d$ we run K-means on $X$ and obtain the clusters $c_1, c_2, \ldots, c_k$.
Now, if we create a new dataset $Y = \{y_1, y_2, \ldots, y_m\}$ where $y_i = Ax_i + b$ and $y_i \in \mathbb{R}^d$ and run K-means on $Y$ to get clusters $g_1, g_2, \ldots g_k$.
Under what conditions of $A$ and $b$ are we guaranteed to get the same clusters?
Let's assume that K-means is using the euclidean distance and has same initial conditions on both algorithms, that is, if the initial centers for X are $c^0_1, \ldots, c^0_k$ then the initial centers for Y are $g^0_1, \ldots, g^0_k$ where $g^0_i = Ac^0_i + b$.
So far I've thought that $A$ has to be full rank and $b$ can be any vector. However, I haven't been able to prove it.
Asked By : Ana Echavarria
Answered By : Yuval Filmus
The answer depends on your K-means algorithm, but what follows should work for standard algorithms.
You will get the same result if your transformation $T$ satisfies two conditions:
- It preserves distances: $d(z,w) = d(T(z),T(w))$, where $d$ is your metric, say $d(z,w) = \|z-w\|$.
- It preserves averages: if $\sum_i p_i z_i$ is a convex combination that $T(\sum_i p_i z_i) = \sum_i p_i T(z_i)$.
You can check this by going over the algorithm, showing that it always makes the same choices.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/64843
0 comments:
Post a Comment
Let us know your responses and feedback