World's most popular travel blog for travel bloggers.

[Solved]: Minimize sum of squared error

, , No Comments
Problem Detail: 

I have an array of real numbers, I want to partition them into k sets. In each set, I calculate the sum of squared error. Then, I add up all the sum of squared error for all the set. I want to minimize this number. For example:

1 3 5 9

if k=2, I would partition them into (1 3 5) and (9), the sum of squared error of (1 3 5) is (3-1)^2+(3-3)^2+(5-3)^2=8 and the sum of squared error of (9) is 0. So the total sum of squared error is 8. I think 8 is the minimum sum of squared error in this case.

I want to use the traditional prim's method to solve this problem. i.e. to use connect n-k edges, then stop. But the problem is whenever I add 1 more integers to the subset, the mean changes. So, it seems that I cannot use Prim's method...Anyone give me some insight on this? Thanks!

Asked By : user196736

Answered By : Nick

You are looking for an optimal 1-dimensional k-means algorithm. The k-means objective function for partitioning the data $x_1, \ldots, x_n$ into $k$ sets $S = \{S_1, \ldots, S_k\}$.

$$ \sum\limits_{i=1}^k \sum\limits_{x \in S_i}\lVert x - \mu_i \rVert^2 $$

where $\mu_i$ is the mean of $S_i$ [1].

You can apply a dynamic programming algorithm to the problem [2].

  1. Let the data $x_1, \ldots, x_n$ be sorted in non-decreasing order.
  2. Fill a $(n + 1) \times (k + 1)$ array, $D$, with the minimum sum of squared errors, for the first $i$ entries, using $m$ clusters. This can be calculated as

    $$ D[i,m] := \min\limits_{1 \le j \le i} \left\{ D[j-1,m-1] + d(x_j, \ldots, x_i) \right\} $$

    where $d(x_j, \ldots, x_i)$ is the sum of squared errors from their mean. For $1 \le i \le n$ and $1 \le m \le k$, and the base cases are $d[i,m] = 0$ if $i=0$ or $m=0$.

    Thus, the entry $D[n,k]$ contains the optimal sum of squared errors for the original problem, now all that is left to do is to extract the answer.

    Note that $d(x_j, \ldots, x_i)$ can be computed iteratively to speed up the algorithm, as

    $$\begin{align*} d(x_j, \ldots, x_i) &= d(x_j, \ldots, x_{i-1}) + \frac{i-1}{i} (x_i - \mu_{i-1})^2 \\ \mu_i &= \frac{x_i + (i - 1) \mu_{i-1}}{i}\end{align*} $$

    Thus, $D$ can be computed in $O(n^2k)$ time, since there are $nk$ entries, and each takes $O(n)$ time to compute.

  3. Now, we fill a $n \times k$ array, $B$, whose entries, $B[i,m]$, contain the index of the smallest (first) element in the cluster that contains entry $i$, which can be calculated as

    $$ B[i,m] := \arg\min\limits_{1 \le j \le i} \left\{ D[j-1,m-1] + d(x_j, \ldots, x_i \right\} $$

    for $1 \le i \le n$ and $1 \le m \le k$. This takes $O(n^2k)$ time.

    Thus, we can backtrack from $B[n,k]$ to find the clustering. The clusters are defined recursively as

    $$\begin{align*} S_k &= \{B[n,k], \ldots, x_n\} \\ S_i &= \{B[x_j,k], \ldots, x_j \mid x_{j+1} := \min S_{i+1}\} \end{align*}$$

    The backtracking takes $O(k)$ time.

The algorithm runs in $O(n^2k)$ time, and requires $O(nk)$ space.

[1] http://en.wikipedia.org/wiki/K-means_clustering

[2] http://journal.r-project.org/archive/2011-2/RJournal_2011-2_Wang+Song.pdf

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/33696

0 comments:

Post a Comment

Let us know your responses and feedback