When working with classifiers, we have many choices. The main dichotomy is: **generative vs discriminative**. I will show that some canonical classifiers are not as different as they often seem.
Throughout this note, I will assume that we have observed samples $\lbrace x_i \rightarrow y_i\rbrace$, with the output variable being canonical, $y_i\in\lbrace 0,\dots,K\rbrace$.
## Generative classifiers
The methods in this group generate a model for the probability density distribution in each class, and use these models to determine the probability of a given sample to belong to any particular class. As such, generative classifiers are described by the same posterior class assignment we've seen before in Gaussian mixture models,
$$
p(k\,|\,x_i) = \frac{p(x_i\,|\,k)\, p(k)}{\sum_k p(x_i\,|\,k)\, p(k)}
$$
where $k$ is the class label, and $p(k)$ is the prior density of the class. As so often, we will now assume that the class likelihoods are Gaussian:
$$
p(x\,|\,k) = \frac{1}{\sqrt{|2\pi\Sigma_k|}} \exp\left(-\frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1} (x-\mu_k)\right)
$$
For classifications, it's most convenient to look at the two-class case and inspect the shape of the discriminating boundary, where $p(k\,|\,x) = p(l\,|\,x)$. With the Gaussians it's easier to take the log of the class posteriors, which gives
$$
\begin{split}
&-\frac{1}{2}\log(|2\pi\Sigma_k|) - \frac{1}{2}(x-\mu_k)^\top \Sigma_k^{-1} (x-\mu_k) + \log(p(k)) =\\
&-\frac{1}{2}\log(|2\pi\Sigma_l|) - \frac{1}{2}(x-\mu_l)^\top \Sigma_l^{-1} (x-\mu_l) + \log(p(l)).
\end{split}
$$
The boundary is quadratic in $x$ because of the second term on either side. This is the reason for the name **Quadratic Discriminant Analysis** (QDA), which really isn't anything other than a multivariate Gaussian fit to each of two classes with a class labeling decision based on which posterior is larger: $p(k\,|\,x)$ or $p(l\,|\,x)$.
But since that's not simple enough, we can make even more assumptions. If we say that $p(x\,|\,k) = \prod_d p_d(x_d\,|\,k)$, which means that all dimensions of a multi-variate feature vector are independent, and that each $p_d(x_d\,|\,k)$ is a univariate Gaussian, we end up with the **Naive Bayes classifier**. In the formulation above, that results in diagonal covariance matrices $\Sigma_k$, which is only a good idea if the sample distributions in each class are approximately axis-aligned. It does, on the other hand, reduce the complexity of estimating the best-fitting Gaussian, which is especially important in high-dimensional spaces.
We can also assume, mostly for historic reasons in reference to Ronald Fisher, that each class has the same covariance (diagonal or not). In this case, the discriminating boundary is
$$
\mu_k^\top\Sigma^{-1}x - \frac{1}{2}\mu_k^\top\Sigma^{-1}\mu_k + \log(p(k)) = \mu_l^\top\Sigma^{-1}x - \frac{1}{2}\mu_l^\top\Sigma^{-1}\mu_l + \log(p(l)),
$$
which is evidently linear in $x$, which is why this classifier is known as **Linear Discriminant Analysis** (LDA). We can rewrite the condition as $\beta^\top x +\beta_0 = 0$, which will be a common form for the next group of classifiers.
Two notes before we close this section:
* Instead of simplifying from the single multi-variate Gaussian, one can fit more complex forms such as a Gaussian Mixture to each (or some) classes. The boundaries will then become more complex, too.
* Generative classifiers naturally allow for *soft classification*, where one doesn't want to make the strict/hard classification according to the discriminating boundary. This may be a better choice if one wants to compute expectation values of class statistics that are sensitive to truncation at the boundary.
## Discriminative classifiers
This group of classifiers tries to determine an optimal boundary between samples from different classes. The easiest ones are those for which the boundary is linear. Data that permit a linear boundary to discriminate all samples are called *linearly separable*.
### Logistic regression
If we require that the boundary between different class posteriors (their logs, actually) is linear in $x$, we get
$$
\log \frac{p(k\,|\,x)}{p(l\,|\,x)} = \beta_{kl}^\top x + \beta_{kl,0},
$$
which is the familiar form of a hyperplane we already saw with LDA. But instead of demanding that the class likelihoods are Gaussian (with the same covariance), we can simply ask: what form of the class posterior do we get from the linear discriminant assumption. In the two-class case we have
$$
\begin{split}
&p(0\,|\,x) = \frac{\exp(\beta^\top x + \beta_0)}{1+\exp(\beta^\top x + \beta_0)}\\
&p(1\,|\,x) = \frac{1}{1+\exp(\beta^\top x + \beta_0)},
\end{split}
$$
which is proper because $\forall k: 0\leq p(k \,| \,x)\leq 1$ and $\sum_kp(k \,| \,x)=1$. The form of $p(1\,|\,x)$ is that of the [logistic function](https://en.wikipedia.org/wiki/Logistic_function), which gives this method its name: **Logistic regression** or **logit**.
To determine the parameters $\beta$ and $\beta_0$, we first realize that the probability distribution of the two-class classifier is the [Bernoulli](https://en.wikipedia.org/wiki/Bernoulli_distribution), with $y_i\in\lbrace 0,1\rbrace$ being the outcomes of the binary trial:
$$
\mathcal{L}(\hat\beta) = \prod_i p(1\,|\, x_i)^{y_i} (1-p(1\,|\, x_i))^{1-y_i}.
$$
The log-likelihood is then
$$
\log\mathcal{L}(\beta, \beta_0)=\sum_i y_i (\beta^\top x_i +\beta_0)- \log\left( 1+\exp(\beta^\top x_i +\beta_0)\right),
$$
which one can find the maximum of by various gradient techniques, the traditional one is called [iteratively reweighted least squares (IRLS)](https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares).
At this point, it's useful to understand the relation of regression and classification we've been exploiting here. The logit classifier is equivalent to a regression problem for
$$
y = \begin{cases}1 & \beta^\top x_i +\beta_0 + \epsilon > 0\\0&\text{else}\end{cases}
$$
where the error term $\epsilon$ follows the logistic distribution (if the standard normal distribution is used instead, it is the [probit model](https://en.wikipedia.org/wiki/Probit_model)). This is manifestly discriminative even though we have probabilistic forms of each class.
So, one might wonder how logit classifier differs from LDA. It's the Gaussian assumption in LDA, which is useful to learn the bulk properties of the classes by inferring the analytic PDF from all samples, but it's also more sensitive to outliers since they are expected to follow the Gaussian form, too.
### Support Vector classifier
And now we will make the linear boundary assumption fully discriminative. We will ignore the class distributions entirely and focus only on optimal separability by requiring that the hyperplane between two classes has a maximally wide margin as show in the image below, where the margin is labeled as **w**.
![](https://upload.wikimedia.org/wikipedia/commons/2/2a/Svm_max_sep_hyperplane_with_margin.png)
We therefore demand of the hyperplane to satisfy
$$
\max_{\beta, \beta_0:\lVert\beta\rVert=1}(w)\ \ \text{subject to}\ \forall i: y_i (\beta^\top x_i+\beta_0)\geq w.
$$
To simplify the formulation, I have changed the definition of the response variable to $y_i\in\lbrace-1,1\rbrace$. We can get rid of the constraint $\lVert\beta\rVert=1$ by redefining $\beta_0$ and requiring that $y_i(\beta^\top x_i+\beta_0)\geq w\lVert\beta\rVert$. As this constraint can be fulfilled by *any* rescaling of $\beta$ and $\beta_0$, we can set $\lVert\beta\rVert=w^{-1}$ and rewrite the constrained optimization problem above as
$$
\min_{\beta,\beta_0}\lVert\beta\rVert\ \ \text{subject to}\ \forall i: y_i (\beta^\top x_i+\beta_0)\geq 1.
$$
And to make it even simpler we will instead minimize the most convenient cost function of them all, $\frac{1}{2}\lVert\beta\rVert^2$, because all norms are equivalent and some are easier to compute.
Before we go into solving this problem, let's make it more powerful by adding robustness to outliers. Imagine the two classes can be linearly separated as shown in the image above apart from a few points that will fall on the wrong side of the discriminating. We'll parameterize the fractional misclassification $\xi_i$ as the amount (in units of $\beta$) by which a sample is on the wrong side of the boundary. This will relax the constraint to $y_i (\beta^\top x_i+\beta_0)\geq 1-\xi_i$, but now we have one more constraint, namely $\forall i: \xi_i\geq0$, and another term in the const function: $\gamma\sum_i \xi_i$, which weighs the total misclassification error with a factor $\gamma$. As so often for constrained optimization problems, we can solve it by writing down the Lagrange function
$$
L_p =\frac{1}{2}\lVert \beta\rVert^2 + \gamma \sum_i \xi_i + \sum_i \alpha_i \left[y_i (\beta^\top x_i+\beta_0) - (1-\xi_i)\right] - \sum_i\mu_i\xi_i
$$
where the first terms are the extended cost function, the third is the classification constraint (with a set of Lagrange multipliers $\alpha_i$), and the last ensures that the $\xi_i$ are non-negative. A not particularly illuminating derivation will result in the following solution for $\beta$:
$$
\hat\beta=\sum_i \hat\alpha_i y_i x_i,
$$
with the interesting property that the coefficients $\hat\alpha_i \neq 0 \Leftrightarrow y_i (\beta^\top x_i+\beta_0)\geq 1-\xi_i$. This means that the solution is obtained only from those samples $x_i$ that satisfy the constraint, and they are thus call *support vectors*. The samples on the margin ($\xi_i=0$) then determine $\beta_0$.
We've now seen three classifiers with linear boundaries: LDA, logit, and SV, which each optimize different objective functions. However, they all require linear separability in the feature space (or something close to that). Sometimes that's simply out of the question. The canonical way out is the [kernel trick](https://en.wikipedia.org/wiki/Kernel_trick), which replaces ordinary dot products by some other kernel function. For classifiers, this means one can pull of the linear separability in some higher-dimensional space. The combination of the support vector classifier with the kernel trick is called **Support Vector Machine**.
# Data Science in Astronomy: Classification Theory