
Bayes' Theorem: $$ P(A|B) = \frac{P(B\mid A)P(A)}{P(B)} $$
Decision rule with only the prior information Decide $\omega_1$ if $P(\omega_1)\gt P(\omega_2)$, otherwise decide $\omega_2$
Likelihood $P(x\mid \omega_j)$is called the likelihood of $\omega_j$ with respect to $x$; the category $\omega_j$ for which $P(x\mid \omega_j)$ is large is more likely to be the true category
Maximum likelihood decision Assign input pattern $x$ to class $\omega_1$ if $P(x \mid \omega_1) > P(x \mid \omega_2),$ otherwise $\omega_2$
Posterior $$ P(\omega_i\mid x)=\frac{P(x\mid \omega_i)P(\omega_i)}{P(x)}\ P(x)=\sum_{i=1}^{k}{P(x\mid \omega_i)P(\omega_i)} $$ Posterior = (Likelihood × Prior) / Evidence
Optimal Bayes Decision Rule Decision given the posterior probabilities, Optimal Bayes Decision rule Assign input pattern $x$ to class $\omega_1$ if $P(\omega_1 \mid x) > P(\omega_2 \mid x),$ otherwise $\omega_2$ Bayes decision rule minimizes the probability of error, that is the term Optimal comes from. $$ P(error\mid x) = \min{\left{P(\omega_1\mid x),P(\omega_2\mid x)\right}} $$
Special cases:
Bayes Risk
Likelihood Ratio if $$ \frac{P(\boldsymbol{x}\mid\omega_1)}{P(\boldsymbol{x}\mid\omega_2)}\gt \frac{\lambda_{12}-\lambda_{22}}{\lambda_{12}-\lambda_{11}} \times \frac{P(\omega_2)}{P(\omega_1)} $$ decide $\omega_1$. "If the likelihood ratio exceeds a threshold value that is independent of the input pattern x, we can take optimal actions"
Minimum Error Rate Classification $$ R(\alpha_x\mid \boldsymbol{x}) = 1-P(\omega_i\mid \boldsymbol{x}) $$ Minimizing the risk → Maximizing the posterior $P(\omega_i\mid \boldsymbol{x})$
Network Representation of a Classifier

Bayes Classifier $$ g_i(\boldsymbol{x}) = P(\boldsymbol{x}\mid \omega_i)P(\omega_i)\ g_i(\boldsymbol{x}) = \ln{P(\boldsymbol{x}\mid \omega_i)}+\ln{P(\omega_i)} $$
Binary classification → Multi‐class classification
Normal Distribution $\cal{N}(\boldsymbol{\mu}, \Sigma)$(with dimension $d$) $$ P(\boldsymbol{x})= \frac{1}{\left(2\pi\right)^{\frac{d}{2}}\left|\Sigma\right|^{\frac{1}{2}}}{\exp{\left[-\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{\mu}\right)^T\Sigma^{-1}\left(\boldsymbol{x}-\boldsymbol{\mu}\right)\right]}} $$
Maximum‐Likelihood Estimation
Gaussian with unknown $\mu$ and $\Sigma$ $$ \boldsymbol{\hat{\mu}}=\frac{1}{n}\sum_{k=1}^{n}\boldsymbol{x}k\ \boldsymbol{\hat{\Sigma}}=\frac{1}{n}\sum{k=1}^{n}{\left(\boldsymbol{x}_k-\boldsymbol{\hat{\mu}}\right)\left(\boldsymbol{x}_k-\boldsymbol{\hat{\mu}}\right)^T} $$
Bayesian Estimation $$ P(\omega_i\mid \mathbf{x}, \mathcal{D})=\frac{p(\mathbf{x}\mid \omega_i, \mathcal{D}i)P(\omega_i)}{\sum{j=1}^c{p(\mathbf{x}\mid \omega_j, \mathcal{D}_j)P(\omega_j)}} $$ where $$ p(x\mid D)=\int{p(x,\theta\mid D)d\theta}=\int{p(x\mid\theta, D)p(\theta\mid D)d\theta}\ =\int{p(x\mid \theta)p(\theta\mid D)d\theta} $$ $P(x\mid D)$ is the central problem of Bayesian learning.
Linear Discriminant Analysis Covariance matrices of all classes are identical but can be arbitrary $$ g_i(x)=\mu_i^T\Sigma^{-1}x-\frac{1}{2}\mu_i^T\Sigma^{-1}\mu_i+\ln{P(\omega_i)} $$
Error Probabilities and Integrals $$ P(error)=\int_{R_2}{p(x\mid\omega_1)P(\omega_1)dx}+\int_{R_1}{p(x\mid\omega_2)P(\omega_2)dx} $$
Naive Bayes Classifier $$ P(\omega\mid x_1,\dots,x_p)\propto P(x_1,\dots, x_p\mid \omega)P(\omega)\ P(x_1,\dots,x_p\mid \omega)=P(x_1\mid \omega)\cdots P(x_p\mid \omega) $$
Smoothing $$ P(x_i\mid \omega_k)=\frac{\left|x_{ik}\right|+1}{N_{\omega k}+K} $$

$$
\frac{\partial J}{\partial net_k}\cdot \frac{\partial net_k}{\partial w_{kj}}=-\delta_k\frac{\partial net_k}{\partial w_{kj}}\
\Delta w_{kj}=\eta\delta_ky_j
$$
The Bagging Algorithm
Random Forest
Boosting
AdaBoost
weighting coefficients give greater weight to the more accurate classifiers when computing the overall output In each round, the algorithm try to get a different base learner, so that model diversity is achieved. $$ \frac{\sum_{n=1}^N{w_n^{(m+1)}I(y^{(m+1)}(x_n)\neq t_n)}}{\sum_{n=1}^N{w_n^{(m)}}}=\frac{1}{2} $$
Optimization of Exponential Loss $$ E=\sum^N_{n=1}\exp{\left{-t_nf_m(x_n)\right}}, f_m(x)=\frac{1}{2}\sum_{i=1}^m{\alpha_ly_l(x)},t_n\in[-1,1] $$
K-means Given k, the k‐means algorithm is implemented in four steps:
K-Medoids Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.
Partitioning Around Medoids (PAM) Calculate the pair‐wise distance matrix W first.
Gaussian Mixture Model $$ p(x;\Theta)=\sum_{k=1}^K{\pi_kp_k(x;\theta_k)}\ \Theta=\left{\pi_1, \dots, \pi_k, \theta_1, \dots, theta_k\right}, \sum{\pi_k}=1\ p_k(x;\theta_k)=\mathcal{N}(x;\mu_k;\Sigma_k) $$ For each point $x^{(i)}$, $z^{(i)}$ associate with a hidden variable denotes $x^{(i)}$ which Gaussian $x^{(i)}$ belongs to
EM algorithm $$ L(\theta)=\sum_{i=1}^m{\log{\sum_{z_{(i)}}{p(x^{(i)},z^{(i)};\theta)}}}\ =\sum_{i=1}^m{\log{\sum_{z_{(i)}}{Q^i(z^{i})\frac{p(x^{(i)},z^{(i)};\theta)}{Q^i(z^{i})}}}}\ \geqslant\sum_{i=1}^m{Q^i(z^{i})\sum_{z_{(i)}}{\log{\frac{p(x^{(i)},z^{(i)};\theta)}{Q^i(z^{i})}}}}=l(\theta) $$ Note that $f(x)=\log(x)$is a concave function and by Jensenʹs inequality, we have: $$ f\left(E_{z^{(i)}\sim Q}\left[\frac{p(x^{(i)},z^{(i)};\theta)}{Q^i(z^{i})}\right]\right)\geqslant E_{z^{(i)}\sim Q}\left[f\left(\frac{p(x^{(i)},z^{(i)};\theta)}{Q^i(z^{i})}\right)\right] $$ where the $z^{(i)}\sim Q$ subscripts above indicate that the expectations are with respect to $z^{(i)}$ drawn from $Q$. we'll make the inequality above hold with equality at our particular value of $\theta$.(E-step) $$ Q^i(z^{(i)})=p(z^{(i)}\mid x^{(i)};\theta) $$ So the fraction is constant that does not depend on $z^{(i)}$ $$ L(\theta_{n+1})\geqslant l_n(\theta_{n+1})\geqslant l_n(\theta_{n}=L(\theta_n) $$ Where $l_n(\theta_{n+1})$ means the lower‐bound function is computed with $Q_n$ parameterized by $\theta_{n+1}$. $l_n(\theta_{n+1})\geqslant l_n(\theta_{n})$ because $\theta_{n+1}=\arg\max_{\theta}l_n(\theta)$. In GMM, with $P(z^{(i)}=k)=\frac{\sum_k{Q_k^{(i)}}}{n}=\pi_k$. Q can be obtained by bayesian theorem.
Spectral Clustering
Divide vertices into two disjoint groups (A,B). Minimize weight of between‐group connections. $$ cut(A, B)= \sum_{i\in A, j\in B}{w_{ij}}\ assoc(A,V)=\sum_{i\in A, j\in V}{w_{ij}}\ \min Ncut(A, B)=\frac{cut(A,B)}{assoc(A,V)}+\frac{cut(A,B)}{assoc(B,V)} $$
Objective Function of Ncut $$ D=[d_{ij}], d_{ii}=\sum_{j}{w_{ij}}\ \min_{y}{\frac{y^T(D-W)y}{y^TDy}}, y\in[2,-2b]^n, y^TD1=0 $$ where $$ k=\frac{\sum_{x_i>0}d_i}{\sum_id_i}, b=\frac{k}{1-k} $$
Rayleigh quotient $$ \min_{y}{\frac{y^TLy}{y^TDy}}, y\in R^n, y^TD1=0\ \Rightarrow (D-W)y=\lambda Dy $$
Generalized Eigen‐problem In this problem (with D and W), Vector 1 is the eigenvector corresponding to the eigenvalue 0. The eigenvector corresponding to the 2nd small eigenvalue. Each eigen vector (from small to large) of is a k dimensional representation of the original data point.
$$
\max_{a}{a^TSa}, s.t. a^Ta=1 \\
\Rightarrow Sa=\lambda a
$$ where
$$
S=\frac{1}{n}{\sum_{i=1}^n{(x_i-\bar{x})(x_i-\bar{x})^T}}
$$ is the covariance matrix.
The kth largest eigenvalue of S is the variance of kth PC.
Reconstruction $$ \bar{X}=A(A^TX) $$
Optimality Property of PCA $$ \min_{A\in R^{p\times d}}{\left|X-AA^TX\right|^2_F}, s.t. A^TA=I_d $$ X is centered data matrix.
LDA
$$ J(a)=\frac{(\tilde\mu_1-\tilde\mu_2)^2}{\tilde{s}^2_1+\tilde{s}^2_2}=\frac{a^TS_Ba}{a^TS_Wa} $$
$$ S_Ba=\lambda S_Wa\ a^\star=S_W^{-1}(\mu_1-\mu_2) $$
LPP
Locality Preserving Projections
Graph W. L and D are the same as PCA.
Unsupervised, but it is very easy to have supervised (semi‐supervised) extensions. $$ \min{\sum_{ij}{w_{ij}(z_i-z_j)^2}}=\min{\frac{a^TXLX^Ta}{a^TXDX^Ta}}\ \Rightarrow XLX^Ta = \lambda XDX^Ta\ \Rightarrow XWX^Ta = \lambda XDX^Ta $$ when using L, smallest eigenvalue, when using W, largest eigenvalue.
LPP is equivalent to LDA when a specifically designed supervised graph is used
LPP is similar to PCA when an inner product graph is used. ($W=X^TX$)
Graph based Dimensionality Reduction $XWX^Ta = \lambda XDX^Ta$
Regularization: only max, add $\gamma I$ to the S in denominator(right side) (LDA, LPP).
Laplacian Eigenmap $$ \min{y^TLy}\s.t. y^TDy=1 $$
Document‐Term Matrix
Query $$ cos(q,d)=\frac{q^Td}{\left|q\right|\left|d\right|} $$
Polysemy: words with multiple meanings: sim < cos
Synonymy: separate words that have the same meaning: sim > cos
Language Model Paradigm $$ P(R_d=1\mid q)\propto P(q\mid R_d=1)P(R_d=1) $$
Naive Approach $$ \hat{P}(w\mid d)=\frac{n(d,w)}{\sum_{w'}{n(d, w')}} $$
Probabilistic Latent Semantic Analysis $$ \hat{P}(w\mid d)=\sum_z{P(w\mid z)P(z\mid d)} $$
Latent Dirichlet allocation ‐ adds a Dirichlet prior on the per‐document topic distribution, trying to address an often‐criticized shortcoming of PLSA, namely that it is not a proper generative model for new documents and at the same time avoid the overfitting problem.
粤公网安备44030002005029号