In astronomy, the term "clustering" typically implies galaxies being clustered and refers to their distributions not being uniform random but coupled to the dark matter.
In data science the term is more general, referring to the task of breaking up a data set in groups of samples that are more similar to others in the same group than to those from another group. Replace "group" with "cluster" and you have an operation definition of what clustering attempts to do.
To make the most of the terminology clash, we'll use an example of DES red-sequence galaxies in field of several galaxy clusters. That is to say that by virtue of the selection, these galaxies are *strongly* clustered, but the field is complex, with overlapping clusters of various shapes and magnitude. As we'll see, that's not a trivial problem for data clustering.
```python
# load the data
import numpy as np
from sklearn.cluster import KMeans, MeanShift
data = np.load('clusters.npy')
print (len(data),data.dtype.names)
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
# tangent plane to the sky
ra0, dec0 = data['RA'].mean(), data['DEC'].mean()
X = np.dstack(((ra0-data['RA'])*np.cos(data['DEC']*np.pi/180), data['DEC']-dec0))[0]
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
ax.scatter(X[:,0], X[:,1], alpha=0.1)
ax.set_xlabel('$\Delta$RA [deg]')
ax.set_ylabel('$\Delta$Dec [deg]')
```
1107 ('MEM_MATCH_ID', 'Z', 'RA', 'DEC')

## K-means algorithm
The go-to clustering algorithm. It uses $K$ clusters and optimizes their means (duh!). In detail, it starts with $K$ cluster center positions $\mu_k$ randomly placed over the data volume, and then it repeats two steps until convergence:
* Cluster asignment: for each sample $i$, add it to cluster set $C_k$ with the nearest mean according to $\text{argmin}_k \lVert x_i - \mu_k\rVert_2^2$.
* Recompute mean: for every cluster $k$, update $\mu_k\leftarrow \frac{1}{|C_k|}\sum_{i\in C_k} x_i$.
This sequence consitutes an EM prodcedure (we've come across it before in the Gaussian Mixture Model) that minimizes the sum of the total dispersion $\sum_k \sum_i \lVert x_i - \mu_k\rVert_2^2$. Even on closer inspection, it share a lot of similarity to the GMM. In fact, it can be considered a very limited form of the GMM, in which the cluster assignment is hard (first step) and only the means are optimized (second step). In the full GMM, one has fractional assignments $q_{ik}$ of sample $i$ to cluster $k$ and updates amplitudes, means, and covariances of each Gaussian components.
From this insight, we can already see what character the $K$-means clustering will have.
* It doesn't like overlap (hard assignment)
* It likes clusters with equal magnitude (no amplitude adjustment) and equal size (no covariance adjustment)
* It prefers clusters to be round because the only metric is spherically symmetric and there's no cluster covariance that could change that.
However, because it does so little, it's pretty fast and generally gives a decent first shot at many clustering problems. Let's see what it can do.
**Note**: We assumed to know the cluster number $K$. There's no inherent way the algorithm could decide for itself what that number should be. Below we'll like into ways of avoiding that decision, but for now we'll have to choose this as users.
```python
for K in range(3,30,5):
clf = KMeans(n_clusters=K)
labels = clf.fit_predict(X)
centers = clf.cluster_centers_
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
ax.scatter(X[:,0], X[:,1], c=labels, alpha=0.1, cmap='Paired')
ax.scatter(centers[:,0], centers[:,1], c=np.arange(1,K+1), s=1000, marker='+', cmap='Paired')
ax.text(0.05, 0.95, 'K=%d' % K, va='top', transform=ax.transAxes)
fig.tight_layout()
```






As expected, we see plausible results, but in detail, there are flaws: At larger $K$, the algorithm prefers to break up bigger clusters in smaller ones before breaking up smaller, but really fragmented ones (no amplitude variation) and breaks bigger clusters along the semimajor axis (sphericity kicks in).
But the fundamental question remains: can we find the optimal cluster number $K$?
Alternatively, can we avoid the decision altogether? Enter the ...
## Mean-Shift algorithm
Where K-means is related to the GMM, Mean-Shift uses a KDE internally to construct a smoothed density distribution $\hat{f}_h$ of the samples. It then throws down test particles at many locations $y_j$ and advances their positions along the local gradient of the smoothed density
$$
y_j \leftarrow \frac{\sum_i x_i g\left(\frac{\lVert x_i - y_j\rVert}{h}\right)}{\sum_i g\left(\frac{\lVert x_i - y_j\rVert}{h}\right)},
$$
where $g$ is a suitable kernel function and $h$ is its bandwidth. With a little algebra, we find that mean shift
$$
m_j = y_j^{\mathrm{it}+1} - y_j^\mathrm{it} \propto \frac{\nabla \hat{f}_h}{\hat{f}_h},
$$
i.e. the test particles walk uphill with a step size that is set by the inverse local density $1/\hat{f}$. In other words, the will quickly move out of low-density areas and converge (provably, in fact) and the modes of the smoothed density distribution.
This is where the next step kicks in: if the test particles overlap within some fraction of the kernel bandwidth, they will be merged. (Before you ask: they keep their original "mass", it's only about their location.) Because the number of modes in the density distribution is finite, so will the number of surviving test particles, and that will be the cluster number $K$.
Here's the relevant figure from the in the original paper by [Comaniciu & Meer (2002)](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.76.8968&rep=rep1&type=pdf):

The track of the particles is shown in black lines, the final position by the red dots. You can see the particles first walk up the sides towards the ridge lines and then along the ridge lines to the peaks.
> The kicker is: the test particles are the same samples that define the density via KDE!
In summary, for Mean-Shift we need to decide on the bandwith, but it will tell us how many clusters it finds. This is great news if we can better guess the spatial correlation structure of the data than the cluster number. Let's try...
```python
for b in [0.3, 0.05, 0.01, None]:
ms = MeanShift(bandwidth=b)
labels = ms.fit_predict(X)
centers = ms.cluster_centers_
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
ax.scatter(X[:,0], X[:,1], c=labels, alpha=0.1, cmap='Paired')
ax.scatter(centers[:,0], centers[:,1], c=np.arange(1,len(centers)+1), s=1000, marker='+', cmap='Paired')
ax.text(0.05, 0.95, 'bandwidth = %r' % b, transform=ax.transAxes, va='top')
fig.tight_layout()
```




I'm not a fan of the last one with `b=None`, which uses some auto-tuning of the bandwidth, but if we pick a scale that physically makes sense (here something like `b=0.05` degrees), we get quite plausible results. Note that this algorithm doesn't break up larger clusters in smaller ones just to keep some approximate amplitude balance.
We also had a discussion during the seminar if one couldn't speed up the convergence of the test particles to the peaks. The ideas we discussed ranged from Nesterov acceleration to speed up the walk along the ridge lines by gaining momentum to some ensemble triangulation of likely peak position given a convergence of the paths different particles seem to take. All I'm saying: there seems to be room for improvement.
## Minimum spanning tree
This brings us to another class of clustering algorithms: graph-based ones. They are often associated with hierarchical clustering, but I don't think this tells us what's really going on under the hood (in fact the mean-shift algorithm can also be considered hierarchical...).
When we construct a graph of a data set, we need to decide when we connect two nodes with an edge. A natural ideas is to define a distance metric and then connect nodes that have small distances. King among those graphs is the [Minimum spanning tree](https://en.wikipedia.org/wiki/Minimum_spanning_tree). It's the tree structure that stems from a fully connected graph, pruned such that the sum of the edge length is mimimal and there are no cycles.
Let's construct one and then see how we can do clustering with it.
```python
from scipy.sparse.csgraph import minimum_spanning_tree
from sklearn.neighbors import kneighbors_graph
G = kneighbors_graph(X, n_neighbors=100, mode='distance')
T = minimum_spanning_tree(G).toarray()
one,two = np.where(T>0)
fig = plt.figure()
ax = fig.add_subplot(111, aspect='equal')
ax.scatter(X[:,0], X[:,1], alpha=0.1)
plt.quiver(X[one,0], X[one,1], X[two,0]-X[one,0], X[two,1]-X[one,1], angles='xy', scale_units='xy', scale=1, headwidth=0, headaxislength=0, headlength=0, minlength=0)
fig.tight_layout()
```

It's quite clear that within the clusters the distances between nodes are smaller than between clusters. Since the graph is fully connected, it needs to have those long edges, but what if we pruned them so that the graph breaks up into pieces?
```python
# remove the 90 .. 99 percentile longest edges
perc = [90,95,98,99]
cuts = np.percentile(T[T>0],perc)
for i in range(len(perc))[::-1]:
one,two = np.where((T>0) & (T
Data Science in Astronomy: Clustering