World's most popular travel blog for travel bloggers.

# Least squares fit of a 1D lattice of points to a 2D dataset

, ,
Problem Detail:

Given a set of data points (shown in red), it is possible to fit a line $y = mx + c$ through the points using linear least squares regression.

I would like to modify this to fit a 1D lattice (grid) along the line such that $d$ is the offset of the the grid along the line and $i$ is the interval between neighboring grid points along the line.

An additional constraint is that more or less the same number of data points should be in the neighbourhood for each grid point in order to prevent picking an arbitrarily small $i$ and overfitting the data.

I believe that calculating the mean square error on a solution can be simple. The squared distances of all of the data points in the neighbourhood of the grid point would be:

$$mse = \sum_{j = 0}^{g} \sum_{k = 0}^{h} distance^2(\vec{latticePoint_j}, \vec{dataPoint_{j,k}})$$

(Where $g$ is the number of clusters and $h$ is the number of points inside the grid point's neighborhood.)

However, I'm not sure how to go about solving for all four variables $m$, $c$, $d$ and $i$ since I'm left with a discrete function. My hope is that since $i$ is fixed and points are distributed reasonably evenly among the clusters this could be done without invoking an overly complicated clustering algorithm.

The number of clusters are not known ahead of time and may contain outliers. Any help is much appreciated.

###### Asked By : Rehno Lindeque

Define an objective function $\Psi(m,c,d,i)$ that specifies how "bad" a particular choice of values for $m,c,d,i$ are. Then, use any standard black-box optimization algorithm (e.g., gradient descent) to find $m,c,d,i$ that minimize the objective function. You might choose the initial values of $m,c$ based on a least-squares fit, and choose the initial values $d,i$ based on a best guess (maybe by projecting the points to the line $y=mx+c$ and then taking a FFT and looking for peaks, to try to find the periodicity), and then run gradient descent from there.

So, it all reduces to choosing an objective function $\Psi(m,c,d,i)$ that captures what you want. We can't give that for you, since you haven't fully specified the problem. For instance, you say you want "more or less the same number of data points" in each cluster, but you haven't specified how to trade off the errors: of course, we can't achieve this exactly, so we need some way to measure and compare different candidate solutions.

For instance, given a set $P$ of points, you might use an objective function like

$$\Psi(m,c,d,i) = \sum_{p \in P} d_{m,c,d,i}(p)^2$$

where $d_{m,c,d,i}(p)$ is the distance from $p$ to the nearest grid point on the line $y=mx+c$, i.e.,

$$d_{m,c,d,i}(p) = \min\{\sqrt{(p_x-x)^2 +(p_y-(mx+c))^2} : x=d + ij, j \in \mathbb{Z}\}$$

where $p=(p_x,p_y)$ in $(x,y)$-coordinates.

This doesn't do anything special about outliers; they incur a very large penalty. That might be undesirable -- it might lead to poor results in the presence of outliers. There are various methods for dealing with outliers. One simple one is to fit a line using a robust least-squares fit, find outliers and remove them as part of a pre-processing step, and then run the procedure on the result.

Also, you mention fear of overfitting the data set. Another way to avoid overfitting is to add a regularization term that penalizes values of $d,i$ that are considered a priori unlikely. For instance, you might add a penalty term that is proportional to $1/i$ or to $-i$. The penalty term gets added to the objective function. Then, when you minimize the objective function, this will exert some pressure to avoid choosing small values of $i$.