Bayes' Theorem:
$$
P(A|B) = \frac{P(B\mid A)P(A)}{P(B)}
$$
Decision rule with only the prior information
Decide $\omega_1$ if $P(\omega_1)\gt P(\omega_2)$, otherwise decide $\omega_2$
Likelihood
$P(x\mid \omega_j)$is called the likelihood of $\omega_j$ with respect to $x$; the category $\omega_j$ for which $P(x\mid \omega_j)$ is large is more likely to be the true category
Maximum likelihood decision
Assign input pattern $x$ to class $\omega_1$ if $P(x \mid \omega_1) > P(x \mid \omega_2),$ otherwise $\omega_2$
Posterior
$$
P(\omega_i\mid x)=\frac{P(x\mid \omega_i)P(\omega_i)}{P(x)}\\
P(x)=\sum_{i=1}^{k}{P(x\mid \omega_i)P(\omega_i)}
$$
Posterior = (Likelihood × Prior) / Evidence
Optimal Bayes Decision Rule
Decision given the posterior probabilities, Optimal Bayes Decision rule
Assign input pattern $x$ to class $\omega_1$ if $P(\omega_1 \mid x) > P(\omega_2 \mid x),$ otherwise $\omega_2$
Bayes decision rule minimizes the probability of error, that is the term Optimal comes from.
$$
P(error\mid x) = \min{\left\{P(\omega_1\mid x),P(\omega_2\mid x)\right\}}
$$
Special cases:
Bayes Risk
Likelihood Ratio
if
$$
\frac{P(\boldsymbol{x}\mid\omega_1)}{P(\boldsymbol{x}\mid\omega_2)}\gt \frac{\lambda_{12}-\lambda_{22}}{\lambda_{12}-\lambda_{11}} \times \frac{P(\omega_2)}{P(\omega_1)}
$$ decide $\omega_1$.
"If the likelihood ratio exceeds a threshold value that is independent of the input pattern x, we can take optimal actions"
Minimum Error Rate Classification
$$
R(\alpha_x\mid \boldsymbol{x}) = 1-P(\omega_i\mid \boldsymbol{x})
$$
Minimizing the risk → Maximizing the posterior $P(\omega_i\mid \boldsymbol{x})$
Network Representation of a Classifier
Bayes Classifier
$$
g_i(\boldsymbol{x}) = P(\boldsymbol{x}\mid \omega_i)P(\omega_i)\\
g_i(\boldsymbol{x}) = \ln{P(\boldsymbol{x}\mid \omega_i)}+\ln{P(\omega_i)}
$$
Binary classification → Multi‐class classification
Normal Distribution
$\cal{N}(\boldsymbol{\mu}, \Sigma)$(with dimension $d$)
$$
P(\boldsymbol{x})= \frac{1}{\left(2\pi\right)^{\frac{d}{2}}\left|\Sigma\right|^{\frac{1}{2}}}{\exp{\left[-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}\right)^T\Sigma^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}\right)\right]}}
$$
Maximum‐Likelihood Estimation
Gaussian with unknown $\mu$ and $\Sigma$
$$
\boldsymbol{\hat{\mu}}=\frac{1}{n}\sum_{k=1}^{n}\boldsymbol{x}_k\\
\boldsymbol{\hat{\Sigma}}=\frac{1}{n}\sum_{k=1}^{n}{\left(\boldsymbol{x}_k-\boldsymbol{\hat{\mu}}\right)\left(\boldsymbol{x}_k-\boldsymbol{\hat{\mu}}\right)^T}
$$
Bayesian Estimation
$$
P(\omega_i\mid \mathbf{x}, \mathcal{D})=\frac{p(\mathbf{x}\mid \omega_i, \mathcal{D}_i)P(\omega_i)}{\sum_{j=1}^c{p(\mathbf{x}\mid \omega_j, \mathcal{D}_j)P(\omega_j)}}
$$
where
$$
p(x\mid D)=\int{p(x,\theta\mid D)d\theta}=\int{p(x\mid\theta, D)p(\theta\mid D)d\theta}\\
=\int{p(x\mid \theta)p(\theta\mid D)d\theta}
$$
$P(x\mid D)$ is the central problem of Bayesian learning.
Linear Discriminant Analysis
Covariance matrices of all classes are identical but can be arbitrary
$$
g_i(x)=\mu_i^T\Sigma^{-1}x-\frac{1}{2}\mu_i^T\Sigma^{-1}\mu_i+\ln{P(\omega_i)}
$$
Error Probabilities and Integrals
$$
P(error)=\int_{R_2}{p(x\mid\omega_1)P(\omega_1)dx}+\int_{R_1}{p(x\mid\omega_2)P(\omega_2)dx}
$$
Naive Bayes Classifier
$$
P(\omega\mid x_1,\dots,x_p)\propto P(x_1,\dots, x_p\mid \omega)P(\omega)\\
P(x_1,\dots,x_p\mid \omega)=P(x_1\mid \omega)\cdots P(x_p\mid \omega)
$$
Smoothing
$$
P(x_i\mid \omega_k)=\frac{\left|x_{ik}\right|+1}{N_{\omega k}+K}
$$
The Bagging Algorithm
Random Forest
Boosting
AdaBoost
weighting coefficients give greater weight to the more accurate classifiers when computing the overall output
In each round, the algorithm try to get a different base learner, so that model diversity is achieved.
$$
\frac{\sum_{n=1}^N{w_n^{(m+1)}I(y^{(m+1)}(x_n)\neq t_n)}}{\sum_{n=1}^N{w_n^{(m)}}}=\frac{1}{2}
$$
Optimization of Exponential Loss
$$
E=\sum^N_{n=1}\exp{\left\{-t_nf_m(x_n)\right\}}, f_m(x)=\frac{1}{2}\sum_{i=1}^m{\alpha_ly_l(x)},t_n\in[-1,1]
$$
K-means
Given k, the k‐means algorithm is implemented in four steps:
K-Medoids
Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
Partitioning Around Medoids (PAM)
Calculate the pair‐wise distance matrix W first.
Gaussian Mixture Model
$$
p(x;\Theta)=\sum_{k=1}^K{\pi_kp_k(x;\theta_k)}\\
\Theta=\left\{\pi_1, \dots, \pi_k, \theta_1, \dots, theta_k\right\},
\sum{\pi_k}=1\\
p_k(x;\theta_k)=\mathcal{N}(x;\mu_k;\Sigma_k)
$$
For each point $x^{(i)}$, $z^{(i)}$ associate with a hidden variable denotes $x^{(i)}$ which Gaussian $x^{(i)}$ belongs to
EM algorithm
$$
L(\theta)=\sum_{i=1}^m{\log{\sum_{z_{(i)}}{p(x^{(i)},z^{(i)};\theta)}}}\\
=\sum_{i=1}^m{\log{\sum_{z_{(i)}}{Q^i(z^{i})\frac{p(x^{(i)},z^{(i)};\theta)}{Q^i(z^{i})}}}}\\
\geqslant\sum_{i=1}^m{Q^i(z^{i})\sum_{z_{(i)}}{\log{\frac{p(x^{(i)},z^{(i)};\theta)}{Q^i(z^{i})}}}}=l(\theta)
$$
Note that $f(x)=\log(x)$is a concave function and by Jensenʹs inequality, we have:
$$
f\left(E_{z^{(i)}\sim Q}\left[\frac{p(x^{(i)},z^{(i)};\theta)}{Q^i(z^{i})}\right]\right)\geqslant E_{z^{(i)}\sim Q}\left[f\left(\frac{p(x^{(i)},z^{(i)};\theta)}{Q^i(z^{i})}\right)\right]
$$
where the $z^{(i)}\sim Q$ subscripts above indicate that the expectations are with respect to $z^{(i)}$ drawn from $Q$.
we'll make the inequality above hold with equality at our particular value of $\theta$.(E-step)
$$
Q^i(z^{(i)})=p(z^{(i)}\mid x^{(i)};\theta)
$$
So the fraction is constant that does not depend on $z^{(i)}$
$$
L(\theta_{n+1})\geqslant l_n(\theta_{n+1})\geqslant l_n(\theta_{n}=L(\theta_n)
$$
Where $l_n(\theta_{n+1})$ means the lower‐bound function is computed with $Q_n$ parameterized by $\theta_{n+1}$. $l_n(\theta_{n+1})\geqslant l_n(\theta_{n})$ because $\theta_{n+1}=\arg\max_{\theta}l_n(\theta)$.
In GMM, with $P(z^{(i)}=k)=\frac{\sum_k{Q_k^{(i)}}}{n}=\pi_k$. Q can be obtained by bayesian theorem.
Spectral Clustering
Divide vertices into two disjoint groups (A,B). Minimize weight of between‐group connections.
$$
cut(A, B)= \sum_{i\in A, j\in B}{w_{ij}}\\
assoc(A,V)=\sum_{i\in A, j\in V}{w_{ij}}\\
\min Ncut(A, B)=\frac{cut(A,B)}{assoc(A,V)}+\frac{cut(A,B)}{assoc(B,V)}
$$
Objective Function of Ncut
$$
D=[d_{ij}], d_{ii}=\sum_{j}{w_{ij}}\\
\min_{y}{\frac{y^T(D-W)y}{y^TDy}}, y\in[2,-2b]^n, y^TD1=0
$$
where
$$
k=\frac{\sum_{x_i>0}d_i}{\sum_id_i}, b=\frac{k}{1-k}
$$
Rayleigh quotient
$$
\min_{y}{\frac{y^TLy}{y^TDy}}, y\in R^n, y^TD1=0\\
\Rightarrow (D-W)y=\lambda Dy
$$
Generalized Eigen‐problem
In this problem (with D and W), Vector 1 is the eigenvector corresponding to the eigenvalue 0.
The eigenvector corresponding to the 2nd small eigenvalue.
Each eigen vector (from small to large) of is a k dimensional representation of the original data point.
$$
\mathcal{F}(x\in R^p)=z\in R^d
$$
if f is linear, linear dimensionality reduction
$$
A^Tx=z
$$
Selection: choose a best subset of size d from the available p features
Extraction: given p features (set X), extract d new features (set Z) by linear or non‐linear combination of all the p features
PCA
$$
\max_{a}{a^TSa}, s.t. a^Ta=1 \\
\Rightarrow Sa=\lambda a
$$ where
$$
S=\frac{1}{n}{\sum_{i=1}^n{(x_i-\bar{x})(x_i-\bar{x})^T}}
$$ is the covariance matrix.
The kth largest eigenvalue of S is the variance of kth PC.
Reconstruction
$$
\bar{X}=A(A^TX)
$$
Optimality Property of PCA
$$
\min_{A\in R^{p\times d}}{\left\|X-AA^TX\right\|^2_F}, s.t. A^TA=I_d
$$
X is centered data matrix.
LDA
$$
J(a)=\frac{(\tilde\mu_1-\tilde\mu_2)^2}{\tilde{s}^2_1+\tilde{s}^2_2}=\frac{a^TS_Ba}{a^TS_Wa}
$$
$$
S_Ba=\lambda S_Wa\\
a^\star=S_W^{-1}(\mu_1-\mu_2)
$$
LPP
Locality Preserving Projections
Graph W. L and D are the same as PCA.
Unsupervised, but it is very easy to have supervised (semi‐supervised) extensions.
$$
\min{\sum_{ij}{w_{ij}(z_i-z_j)^2}}=\min{\frac{a^TXLX^Ta}{a^TXDX^Ta}}\\
\Rightarrow XLX^Ta = \lambda XDX^Ta\\
\Rightarrow XWX^Ta = \lambda XDX^Ta
$$ when using L, smallest eigenvalue, when using W, largest eigenvalue.
LPP is equivalent to LDA when a specifically designed supervised graph is used
LPP is similar to PCA when an inner product graph is used. ($W=X^TX$)
Graph based Dimensionality Reduction $XWX^Ta = \lambda XDX^Ta$
Regularization: only max, add $\gamma I$ to the S in denominator(right side) (LDA, LPP).
Laplacian Eigenmap
$$
\min{y^TLy}\\s.t. y^TDy=1
$$
Document‐Term Matrix
Query
$$
cos(q,d)=\frac{q^Td}{\left\|q\right\|\left\|d\right\|}
$$
Polysemy: words with multiple meanings: sim < cos
Synonymy: separate words that have the same meaning: sim > cos
Language Model Paradigm
$$
P(R_d=1\mid q)\propto P(q\mid R_d=1)P(R_d=1)
$$
Naive Approach
$$
\hat{P}(w\mid d)=\frac{n(d,w)}{\sum_{w'}{n(d, w')}}
$$
Probabilistic Latent Semantic Analysis
$$
\hat{P}(w\mid d)=\sum_z{P(w\mid z)P(z\mid d)}
$$
Latent Dirichlet allocation
‐ adds a Dirichlet prior on the per‐document topic distribution, trying to address an often‐criticized shortcoming of PLSA, namely that it is not a proper generative model for new documents and at the same time avoid the overfitting problem.
$$
UV=X
$$