Drawing Lines
Author
Raghav Saboo
Last Updated
8 anni fa
License
Creative Commons CC BY 4.0
Abstract
Notes on Machine Learning.
Notes on Machine Learning.
\documentclass[a4,12pt,oneside,openany]{memoir}
\usepackage[utf8x]{inputenc}
\usepackage[english]{babel}
\setlength{\parskip}{1em}
\usepackage{url}
\usepackage{amssymb,amsmath,amsthm,amsfonts}
\usepackage{mathtools}
\newtheorem{theorem}{Theorem}
\title{Drawing Lines}
\author{Raghav Saboo}
\begin{document}
\maketitle
%% TODO
%%% Pages MLBO Book Section Name
%%% 53 - 60 Deterministic Linear Regression [DONE]
%%% 64 - 72 Bias, MSE, MVU etc. [DONE]
%%% 72 - 77 Regularization
%%% 77 - 84 ML Method
%%% 84 - 89 Bayesian Method
%%% 105 - 109 MSE Linear Estimation
%%% 140 - 148 MSE Linear Models
%%% 162 - 186 Stoch. Grad. Desc.: LMS Algorithm
%%% 233 - 239 LS Family
%%% 327 - 337 Param. Learning: Convex Analytic
%%% 345 - 347 Lin. Reg. Param. Estimation
%%% 509 - 528 RKHS Gen. Linear Models
%%% 585 - 589 Bayesian Approach
%%% 631 - 637 PDFs with Exp. Form
%%% 639 - 651 Bayesian Learning
%%%%%%%% NOT COVERED BY THESE NOTES %%%%%%%%%%
%% Approximate Inference (Variational Techniques) Bayesian Methods
%% Probabilistic Graphical Methods
\chapter*{Oh the Places You'll Go}
Much of machine learning can be boiled down to a function estimation task, and in order to achieve this there are two types of methods adopted: parametric and non-parametric. Parametric methods are where the functional dependence is defined through a fixed number of parameters. On the other hand, non-parametric methods are where the number of parameters that define the function vary with the size of the data set. In both cases the values of the parameters are unknown. These set of notes we will only focus on parametric methods.
Whilst considering parametric methods, we will come across two ways of estimating the unknown values of the parameters. The first set of methods will assume that the unknown parameters have specific ``true'' values which can be obtained through deterministic steps. The second will take a probabilistic approach, wherein the unknown parameters will be treated as random variables. Thus probability distributions will be used to describe the input and output variables, and thus removing the need to find specific values for the unknown parameters.
As we take a detailed tour of the different approaches to parameter estimation we will look singularly at linear functions. The fact that this is not limiting our ability to generalize to non-linear models will become apparent as we progress further through the material. Ultimately the idea is to create a clear map of the vast landscape of parameter estimation which can otherwise be so hard to comprehend as a student.
\newpage
\begin{KeepFromToc}
\tableofcontents
\end{KeepFromToc}
\chapter{Basic Concepts}
\section{General Setup}
\begin{itemize}
\item We will start by \textit{adopting} a form, such as a linear quadtratic function, with the goal of estimating the associated unknown coefficients so that the function matches the spacial arrangement of the data as closely as possible.
\item Given a set of data points $(y_n, \boldsymbol{x}_n),$ $y_n \in \mathbb{R},$ $\boldsymbol{x}_n \in \mathbb{R}^l,$ $n = 1, 2, ..., N,$ and a parametric set of functions,
\begin{equation}
\mathcal{F} := \{ f_{\boldsymbol{\theta}}(\cdot) : \boldsymbol{\theta} \in \mathcal{A} \subseteq \mathbb{R}^K \}
\end{equation}
find a function in $\mathcal{F}$, which will be denoted as $f(\cdot) := f_{\boldsymbol{\theta_*}}(\cdot)$, such that given a value of $\boldsymbol{x} \in \mathbb{R}^l$, $f(\boldsymbol{x})$ best approximates the corresponding value $y \in \mathbb{R}$. The value $\boldsymbol{\theta_*}$ is the value that results from the estimation procedure.
\item The choice of $\mathcal{F}$ has to be based on as much a priori information as possible concerning the physical mechanism that underlies the generation of the data. The most common approach is to iterate over different families of functions and evaluate the performance of each according to a chosen criterion.
\item Having adopted a parametric family of functions, $\mathcal{F}$, one has to get an estimate for the unknown set of parameters. A \textit{non-negative loss} function is usually adopted, which quantifies the deviation/error between the measured value of $y$ and the predicted one using the corresponding measurements $\boldsymbol{x}$, as in $f_{\boldsymbol{\theta}}(x)$.
\begin{equation}
\mathcal{L}(\cdot,\cdot) : \mathbb{R} \times \mathbb{R} \longmapsto [0, \infty),
\end{equation}
and compute $\boldsymbol{\theta}_*$ so as to minimize the total loss,
\begin{equation}
f(\cdot) := f_{\boldsymbol{\theta}_{*}}(\cdot) : \boldsymbol{\theta}_{*} = \arg \min_{\boldsymbol{\theta}\in\mathcal{A}} J(\boldsymbol{\theta})
\end{equation}
assuming that a minimum exists.
\item There may be more than one optimal values $\boldsymbol{{\theta}_{*}}$, depending on the shape of $J(\boldsymbol{\theta})$.
\item A common combination used to introduce estimation techniques is the linear class of functions with a least squares (LS) loss function,
\begin{equation}
\mathcal{L}(y,f_{\theta}(\boldsymbol{x})) = (y - f_{\theta}(\boldsymbol{x}))^2,
\end{equation}
and there are good very good reasons for this.
\begin{enumerate}
\item Linearity with the LS loss function turns out to simplify the algebra and hence allows one to understand the various "secrets" that underlie the area of parameter estimation.
\item Understanding linearity is very important. Treating nonlinear tasks, most often, turns out to finally resort to a linear problem.
A good example of this is the transformation
\begin{equation}
\mathbb{R} \ni x \longmapsto \boldsymbol{\phi}(x) := \begin{bmatrix} x \\ x^2 \end{bmatrix} \in \mathbb{R}
\end{equation}
which is the same as
\begin{equation}
y = \theta_0 + \theta_1\phi_1(x) + \theta_2\phi_2(x)
\end{equation}
which is a linear model with respect to the components $\phi_k(x), k = {1, 2}$, of the two-dimension image, $\boldsymbol{\phi}(x)$, of $x$. This simple trick is at the heart of a number of nonlinear methods that will be treated later on in these notes.
\end{enumerate}
\end{itemize}
\section{Introduction to Linear Regression}
\begin{itemize}
\item Regression involves the task of modeling the relationship of a dependent random variable, $y$, which is considered to be the response of a system to changes in independent variables, $x_1, x_2, ..., x_l$.
\item The independent variables will be represented as the components of an equivalent random vector $\boldsymbol{x}$.
\item The relationship is modeled via an additive disturbance or noise term, $\eta$.
\begin{equation}
y = \theta_0 + \theta_{1}x_1 + ... + \theta_{l}x_{l} + \eta = \boldsymbol{\theta}^{T}\boldsymbol{x} + \eta
\end{equation}
\item The noise variable, $\eta$, is an unobserved random variable, and the goal of the regression task is to estimate the parameter vector, $\boldsymbol{\theta}$, given a set of data, $(y_n, \boldsymbol{x}_n)$ where $n = 1, 2, ..., N$. This is also known as the training data set.
\item We can state that the prediction model for a $\boldsymbol{\hat{\theta}}$ determined using training data set, is:
\begin{equation}
\hat{y} = \boldsymbol{\hat{\theta}}^{T}x
\end{equation}
where $\boldsymbol{\hat{\theta}}$ can be set to $\boldsymbol{\theta_*}$ obtained from minimizing the LS loss function
\begin{equation}
J(\boldsymbol{\theta}) = \sum_{n=1}^{N}(y_{n} - \boldsymbol{\theta}^{T}\boldsymbol{x}_{n})^2
\end{equation}
\item Taking the derivative with respect to $\boldsymbol{\theta}$ and equating to the zero vector $\boldsymbol{0}$ results in
\begin{equation}
\left(\sum_{n=1}^{N}\boldsymbol{x}_{n}\boldsymbol{x}_{n}^T\right)\boldsymbol{\hat{\theta}} = \sum_{n=1}^{N}\boldsymbol{x}_{n}y_{n}
\end{equation}
which is equivalent to
\begin{equation}
X^{T}X\boldsymbol{\hat{\theta}} = X^{T}\boldsymbol{y}
\end{equation}
and the LS estimate is
\begin{equation}
\boldsymbol{\hat{\theta}} = (X^{T}X)^{-1}X^{T}\boldsymbol{y}
\end{equation}
assuming that $(X^{T}X)^{-1}$ exists.
This solution is unique, provided that the $X^TX$ is invertible. The uniqueness is due to the strictly convex parabolic shape the LS loss function.
\end{itemize}
\section{Biased versus Unbiased Estimation}
\begin{itemize}
\item In supervised learning, we are given a set of training points, $(y_n, \boldsymbol{x}_n)$, $n = 1, 2, ..., N$, and we return an estimate of the unknown paramter vector, say $\boldsymbol{\hat{\theta}}$.
\item However, the training points are random variables (in the deterministic world), and thus if we are given another set of N observations of the same random variables, these are going to be different, and obviously the resulting estimate will also be different. In other words, by changing our training data different estimates result.
\item An estimate, such as $\boldsymbol{\hat{\theta}}$, has a specific value, which is the result of a function acting on a set of observations, on which our chosen estimate depends. In general, we could generalize and write that
\begin{equation}
\boldsymbol{\hat{\theta}} = f(\boldsymbol{y}, X).
\end{equation}
\item Once we allow the set of observations to change randomly, and the estimate becomes itself a random variable, we write the previous equation in terms of the corresponding random variables,
\begin{equation}
\hat{\Theta} = f(\boldsymbol{y}, X),
\end{equation}
and we refer to this functional dependence as the estimator of the unknown vector $\boldsymbol{\theta}$.
\item In order to simplify the analysis and focus on the insight behind the methods, we will assume that our parameter space is that of real numbers, $\mathbb{R}$. We will also assume that the model (i.e., the set of functions $\mathcal{F}$), which we have adopted for modeling our data, is the correct one and the value of the associated true parameter is equal to $\theta_0$ (unknown to us).
\item Let $\hat{\Theta}$ denote the random variable of the associated estimator; then the squared error loss function to quantify deviations, a reasonable criterion to measure the performance of an estimate is the \textit{mean-square error} (MSE),
\begin{equation}\label{MSE_basic}
MSE = \mathbb{E}\left[\left(\hat{\theta}-\theta_0\right)^2\right]
\end{equation}
where the mean $\mathbb{E}$ is taken over all possible training data sets of size N. If the MSE is small, then we expect that, on average, the resulting estimates will be close to the true value. Additionally, if we insert the mean value $\mathbb{E}[\hat{\theta}]$ of $\hat{\theta}$ into \eqref{MSE_basic} to get
\begin{equation} \label{MSE_second}
\begin{split}
MSE & = \mathbb{E}\left[\left(\left(\hat{\theta}-\mathbb{E}[\hat{\theta}]\right)+\left(\mathbb{E}[\hat{\theta}]-\theta_0\right)\right)^2\right] \\
& = \mathbb{E}\left[\left(\hat{\theta}-\mathbb{E}[\hat{\theta}]\right)^2\right]+\left(\mathbb{E}[\hat{\theta}]-\theta_0\right)^2
\end{split}
\end{equation}
where first term is $Variance$ around the mean value, and the second term is $Bias^2$: the deviation of the mean value of the estimator from the true one.
\item One may think that choosing an estimator that is $unbiased$, as is $\mathbb{E}[\hat{\theta}] = \theta_0$, such that the second term in \eqref{MSE_second} becomes zero, is a reasonable choice. Assume that we have L different training sets, each comprising N points. Let us denote each data set by $\mathcal{D}_i, i = 1, 2, ..., L.$ For each one, an estimate $\hat{\theta}_i$, $i = 1, 2, ..., L$ will result. Then, form the new estimator by taking the average value,
\begin{equation}
\hat{\theta}^{(L)} := \frac{1}{L}\sum_{i = 1}^{L}\hat{\theta}_i
\end{equation}
This is also an unbiased estimator, because
\begin{equation}
\mathbb{E}[\hat{\theta}^{(L)}] = \frac{1}{L}\sum_{i = 1}^{L}\mathbb{E}[\hat{\theta}_i] = \theta_0.
\end{equation}
\item Moreover, assuming that the involved estimators are mutually uncorrelated,
\begin{equation}
\mathbb{E}\left[(\hat{\theta}_i-\theta_0)(\hat{\theta}_j-\theta_0)\right] = 0,
\end{equation}
and of the same variance, $\sigma^2$, then the variance of the new estimator is now much smaller.
\begin{equation}
\sigma^{2}_{\hat{\theta}^{(L)}} = \mathbb{E}\left[\left(\hat{\theta}^{(L)}-\theta_0\right)^2\right] = \frac{\sigma^2}{L}
\end{equation}
\item Hence, by averaging a large number of such unbiased estimators, we expect to get an estimate close to the true value. However, in practice, data is not always abundant. As a matter of fact, very often the opposite is true and one has to be very careful about how to exploit it. In such cases, where one cannot afford to obtain and average a large number of estimators, an unbiased estimator may not necessarily be the best choice.
\item Also going back to \eqref{MSE_second}, there is no reason to suggest that by making the second term equal to zero, the MSE becomes minimum.
\item Instead of computing the MSE for a given estimator, let us replace $\hat{\theta}$ with $\theta$ in \eqref{MSE_second} and compute an estimator that will minimize the MSE with respect to $\theta$, directly.
\item In this case, focusing on unbiased estimators, $\mathbb{E}[\theta]=\theta_0$, introduces a constraint to the task of minimizing the MSE, and it is well-known that an unconstrained minimization problem always results in loss function values that are less than or equal to any value generated by a constrained counterpart,
\begin{equation}\label{MSE_min}
\min_{\theta} MSE(\theta)\leq\min_{\theta: \mathbb{E}[\theta]=\theta_0}MSE(\theta)
\end{equation}
where the dependence of MSE on the estimator $\theta$ in \eqref{MSE_min} is explicitly denoted.
\item Let us denote by $\hat{\theta}_{MVU}$ a solution of the task $\min_{\theta : \mathbb{E}[\theta]=\theta_0} MSE(\theta)$, i.e. the unbiased estimator.
\item Motivated by $\eqref{MSE_min}$, our next goal is to search for a biased estimator, which results, hopefully, in a smaller MSE. Let us denote this estimator as $\hat{\theta}_b$.
\item For the sake of illustration, and in order to order to limit our search for $\hat{\theta}_b$, we consider here only $\hat{\theta}_b$s that are scalar multiples of $\hat{\theta}_{MVU}$, so that
\begin{equation}\label{MSE_MVU}
\hat{\theta}_b = (1+\alpha)\hat{\theta}_{MVU},
\end{equation}
where $\alpha \in \mathbb{R}$ is a free parameter. Notice that $\mathbb{E}[\hat{\theta}_b]=(1+\alpha)\theta_0$. By substituting \eqref{MSE_MVU} into \eqref{MSE_basic} and after some simple algebra we obtain
\begin{equation}\label{MSE_alpha}
MSE(\hat{\theta}_b)=(1+\alpha)^2MSE(\hat{\theta}_{MVU})+\alpha^2\theta_0^2.
\end{equation}
\item In order to get $MSE(\hat{\theta}_b) < MSE(\hat{\theta}_{MVU}), \alpha$ must be in the range
\begin{equation}
-\frac{MSE(\hat{\theta}_{MVU})}{MSE(\hat{\theta}_{MVU})+\theta_\sigma^2} < \alpha < 0.
\end{equation}
\item It is easy to verify that the previous range implies that $|1+\alpha| < 1$. Hence, $|\hat{\theta}_b|=|(1+\alpha)\hat{\theta}_{MVU}|<|\hat{\theta}_{MVU}$. We can go a step further and try to compute the optimum value of $\alpha$, which corresponds to the minimum MSE. By taking the derivative of $MSE(\hat{\theta}_b)$ in \eqref{MSE_alpha} with respect to $\alpha$, it turns out that this occurs for
\begin{equation}
\alpha_* = - \frac{MSE(\hat{\theta}_{MVU})}{MSE(\hat{\theta}_{MVU})+\theta_0^2} = - \frac{1}{1+\frac{\theta_0^2}{MSE(\hat{\theta}_{MVU})}}
\end{equation}
\item Therefore, we have found a way to obtain the optimum estimator, among those in the set ${\hat{\theta}_b = (1 + \alpha)\hat{\theta}_{MVU} : \alpha \in \mathbb{R}}$, which results in minimum MSE. This is true, but it is not realizable as the optimal value of $\alpha$ is given in terms of the unknown $\theta_0$. However it does show one important result: If we want to do better than the MVU, then, one way is to \textbf{shrink} the norm of the MVU estimator.
\item Shrinking the norm is a way of introducing bias into an estimator.
\item What we have said so far is readily generalized to parameter vectors. An unbiased parameter vector satisfies
$$
\mathbb{E}[\Theta]=\boldsymbol{\theta}_0,
$$
and the MSE around the true value, $\boldsymbol{\theta}_0$, is defined as
$$
MSE = \mathbb{E}\left[(\Theta - \boldsymbol{\theta}_0)^T(\Theta - \boldsymbol{\theta}_0)\right]
$$
\item Looking carefully at the previous definition reveals that the MSE for a paramter vector is the sum of the MSEs of the components, $\theta_i, i = 1, 2, ..., l$, around the corresponding true values $\theta_{0i}$.
\end{itemize}
\subsection{Cram\'{e}r-Rao Lower Bound}
\begin{itemize}
\item The Cram\'{e}r-Rao lower bound is an elegant theorem and one of the most well-known techniques used in statistics. It provides a lower bound on the variance of any unbiased estimator.
\item This is very important because:
\begin{enumerate}
\item It offers the means to assert whether an unbiased estimator has minimum variance, which, of course, in this case coincides with the corresponding MSE in \eqref{MSE_basic}.
\item And if the above is not the case, it can be used to indicate how far away the performance of an unbiased estimator is from the optimal one.
\item Finally it provides the designer with a tool to known the best possible performance that can be achieved by an unbiased estimator.
\end{enumerate}
\item We are looking for a bound of the variance of an unbiased estimator, whose randomness is due to the randomness of the training data.
\begin{theorem}[Cram\'{e}r-Rao Bound]
Let $\boldsymbol{x}$ denote a random vector and let $\mathcal{X} = {\boldsymbol{x_1, x_2, ..., x_N}}$ denote the set of N observations, corresponding to a random vector. The corresponding joint pdf is parameterized in terms of the parameter vector $\boldsymbol{\theta} \in \mathbb{R}^l$. The log-likelihood is then defined as,
$$
L(\boldsymbol{\theta}) := \log p(\mathcal{X}; \boldsymbol{\theta}).
$$
Define the Fisher's Information Matrix as
\begin{equation}
J = \begin{bmatrix}
\mathbb{E}\left[\frac{\partial^2 \log p}{\partial\theta_1^2}\right] & \mathbb{E}\left[\frac{\partial^2\log p}{\partial\theta_1\partial\theta_2}\right] & \cdots & \mathbb{E}\left[\frac{\partial^2\log p}{\partial\theta_1\partial\theta_l}\right] \\
\vdots & \vdots & \cdots & \vdots \\
\mathbb{E}\left[\frac{\partial^2\log p}{\partial\theta_l\partial\theta_1}\right] & \mathbb{E}\left[\frac{\partial^2\log p}{\partial\theta_l\partial\theta_2}\right] & \cdots & \mathbb{E}\left[\frac{\partial^2\log p}{\partial\theta_l^2}\right]
\end{bmatrix}
\end{equation}
Let $I := J^{-1}$ and let $I(i,i)$ denote the $i$th diagonal element of I. If $\hat{\theta}_i$ is any unbiased estimator of the $i$th component, $\theta_i$, of $\boldsymbol{\theta}$, then the corresponding variance of the estimator,
\begin{equation}
\sigma_{\hat{\theta}_i}^2 \geq I(i,i).
\end{equation}
This is known as the Cram\'{e}r-Rao lower bound, and if an estimator achieves this bound it is said to be efficient and it is unique.
\end{theorem}
\item Moreover, the necessary and sufficient condition for obtaining an unbiased estimator that achieves the bound is the existence of a function $g(\cdot)$ such that for all possible values of $\theta_i$,
\begin{equation}
\frac{\partial \log p(\mathcal{X}; \theta)}{\partial\theta_i} = I(\theta)(g(\mathcal{X}-\theta)).
\end{equation}
The MVU estimate is then given by
\begin{equation}
\hat{\theta}_i = g(\mathcal{X}) := g(\boldsymbol{x_1, x_2, ..., x_N})
\end{equation}
and the variance of the respective estimator is equal to $\frac{1}{I(\theta_i)}$
\end{itemize}
\subsubsection{Applied to Linear Regression}
\begin{itemize}
\item Lets consider a simple linear regression model as seen below
$$
y_n = \theta x + \eta_n
$$
\item To simplify the discussion, we assume that our N observations are the result of different realizations of the noise variable ONLY (i.e. fixed input $x$).
\item Further assume that $\eta_n$ are samples of a Gaussian white noise with zero mean and variance equal to $\sigma_\eta^2$
\item The joint pdf of the output observations is given by
$$
p(y;\theta)=\prod_{n=1}^N\frac{1}{\sqrt[]{2\pi\sigma_\eta^2}}\exp{-\frac{(y_n-\theta)^2}{2\sigma_\eta^2}}
$$
\item We can derive the corresponding Cram\'{e}r-Rao bound as follows:
$$
\frac{\partial\log p(y;\theta)}{\partial\theta} = \frac{1}{\sigma_\eta^2}\sigma_{n=1}{N}(y_n-\theta)=\frac{N}{\sigma_\eta^2}(\bar{y}-\theta)
$$
where
$$
\bar{y}:=\frac{1}{N}\sum_{n=1}^{N}y_n
$$
and the second derivative as required by the Cram\'{e}r-Rao theorem
$$
\frac{\partial^2\log p(y;\theta)}{\partial \theta^2}=-\frac{N}{\sigma_\eta^2}.
$$
\item Hence
$$
I(\theta)=\frac{N}{\sigma_\eta^2}
$$
and an efficient estimator would be
$$
\sigma_{\hat{\theta}}^2 \geq \frac{\sigma_\eta^2}{N}
$$
\item We can easily verify that the corresponding estimator, $\bar{y}$ is indeed an unbiased one
$$
\mathbb{E}[\bar{y}]=\frac{1}{N}\sum_{n=1}^N\mathbb{E}[y_n]=\frac{1}{N}\sum_{n=1}^N\mathbb{E}[\theta+\eta_n]=\theta
$$
\item For this particular task and having assumed that the noise is Gaussian, the LS estimator is equal to the MVU estimator and it attains the Cram\'{e}r-Rao bound.
\item However, if the input is not fixed, but also varies from experiment to experiment and the training data become $(y_n, x_n)$, then the LS estimator attains the Cram\'{e}r-Rao bound only asymptotically, for large values of N.
\item Also if the assumptions for the noise being Gaussian \textbf{and} white are not valid, then the LS estimator is not efficient anymore.
\end{itemize}
\subsection{Sufficient Statistic}
\begin{itemize}
\item If an efficient estimator does not exist, this does not necessarily mean that the MVU estimator cannot be determined.
\item The notion of \textbf{sufficient statistic} and the Rao-Blackwell theorem help us here. Given a random vector, $\textbf{x}$ which depends on parameter $\theta$, sufficient statistic for the unknown parameter is a function
$$
T(\mathcal{X}) := T(\boldsymbol{x_1, x_2, ..., x_N})
$$
of the respective observations, which contains \textbf{all} information about $\theta$.
\item From a mathematical point of view, a statistic $T(\mathcal{X}$ is said to be sufficient for the paramter $\theta$ if the conditional joint pdf
$$
p(\mathcal{X}|T(\mathcal{X};\theta),
$$
\textit{does not} depend on $\theta$ and thus $T(\mathcal{X})$ must provide \textit{all} information about $\theta$ which is contained in the set $\mathcal{X}$.
\item Once $T(\mathcal{X})$ is known, $\mathcal{X}$ is no longer needed: hence the name "sufficient statistic" i.e. no more information can be extracted from $\mathcal{X}$.
\item In the case of parameter vectors $\boldsymbol{\theta}$, the sufficient statistic may be a \textit{set} of functions, called a \textit{jointly sufficient statistic}.
\begin{theorem} [Factorization Theorem]
A statistic $T(\mathcal{X})$ is sufficient if and only if the respective joint pdf can be factorized as
$$
p(\mathcal{X};\boldsymbol{\theta}) = h(\mathcal{X})g(T(\mathcal{X}),\boldsymbol{\theta}).
$$
One part of the factorization only depends on the statistic and parameters, and a second part that is independent of the parameters.
\end{theorem}
\end{itemize}
\subsubsection{Applied to Linear Regression}
\begin{itemize}
\item Let x be a Gaussian, $\mathcal{N}(\mu, \sigma^2)$, random variable and let the set of observations be $\mathcal{X}={x_1, x_2, ..., x_N}$. Assume $\mu$ to be the unknown parameter.
\item The joint pdf is given by
$$
p(\mathcal{X};\mu) = \frac{1}{(2\pi\sigma^2)^\frac{N}{2}}\exp{-\frac{\sum_{n=1}^{N}(x_n-\mu)^2}{2\sigma^2}}
$$
\item Given the identity
$$
\sum_{n=1}^{N}(x_n-\mu)^2 = \sum_{n=1}^{N}(x_n-S_\mu)^2+N(S_\mu-\mu)^2
$$
the joint pdf becomes
$$
p(\mathcal{X};\mu) = \frac{1}{(2\pi\sigma^2)^\frac{N}{2}}\exp{-\frac{\sum_{n=1}^{N}(x_n-S_\mu)^2}{2\sigma^2}}\exp{-\frac{\sum_{n=1}^{N}N(S_\mu-\mu)^2}{2\sigma^2}}
$$
\item Similarly if the unknown parameter is the variance $\sigma^2$ then the sufficient statistic is
$$
\bar{S_\sigma^2}:=\frac{1}{N}\sum_{n=1}^N(x_n-\mu)^2
$$
\item If both $\mu$ and $\sigma^2$ are unkown then the sufficient statistic is the set $(S_\mu, S_\sigma^2)$ where
$$
S_\sigma^2:=\frac{1}{N}\sum_{n=1}^N(x_n-S_\mu)^2
$$
\end{itemize}
\section{Regularization}
\begin{itemize}
\item One approach to improve the performance of an estimator is to shrink the norm of the MVU estimator. \textit{Regularization} is a mathematical tool to impose a priori information on the structure of the solution, which comes as the outcome of an optimization task.
\item We can reformulate the LS minimization task as
\begin{equation} \text{minimize:} \hspace{10pt}
J(\boldsymbol{\theta})=\sum_{n=1}^N(y_n-\boldsymbol{\theta}^T\boldsymbol{x}_n)^2
\end{equation}
\begin{equation} \text{subject to:} \hspace{10pt}
||\boldsymbol{\theta}||^2 \leq \rho
\end{equation}
where $||\cdot||$ is the Euclidean norm of a vector.
\item Here we do not allow the LS criterion to be completely ``free'' to reach a solution, but we limit the space in which to search for it. The optimal value of $\rho$ cannot be analytically obtained and thus we have to experiment in order to select an estimator that has good performance.
\item Thus the optimization task for a LS loss function can be written as
\begin{equation}
\text{minimize:} \hspace{10pt}
L(\boldsymbol{\theta},\lambda)=\sum_{n=1}^N(y_n-\boldsymbol{\theta}^T\boldsymbol{x}_n) + \lambda||\boldsymbol{\theta}||^2
\hspace{10pt}
\end{equation}
which is often referred to as Ridge Regression.
\item The specific choices of $\lambda \geq 0$ and $\rho$ are equivalent tasks.
\item Taking the gradient of $L$ in the equation above with respect to $\boldsymbol{\theta}$ results in the following solution:
\begin{equation}
\left(\sum_{n=1}^{N} \boldsymbol{x}_n \boldsymbol{x}_n^T + \lambda I \right)\boldsymbol{\hat{\theta}}=\sum_{n=1}^{N}y_{n}x_{n}
\end{equation}
\item The presence of $\lambda$ biases the new solution away from that which would have been obtained from the unregularized LS formulation. Thus Ridge Regression aims to reduce the norm of the estimated vector \textit{at the same time} as trying to keep the sum of squared errors small.
\item This is achieved by modifying the vector components, $\theta_i$, so as to reduce the contribution in the misfit measuring term from less informative directions in the input space.
\item That is reducing the norm can be considered as an attempt to ``simplify'' the structure of the estimator, because smaller number of components of the regressor now have an important say.
\item Other regularizers can be used in place of the Euclidean norm, such as the $\ell_p$ norms with $p \geq 1$.
\item In practice the bias paramter, $\theta_0$ is left out from the norm in the regularization term; penalization of the intercept would make the procedure dependent on the origin chosen for $y$.
\item Reducing the norm can be considered as an attempt to ``simplify'' the structure of an estimator, because a smaller number of components of the regressor now have an important say. This viewpoint becomes more clear if one considers nonlinear models.
\end{itemize}
\subsubsection{Inverse problems: Ill-conditioning and overfitting}
\begin{itemize}
\item Most tasks in machine learning belong to the so-called \textit{inverse problems}. This encompasses all the problems where one has to infer/predict/estimate the values of a model based on a set of available input/output observations-training data.
\item In a less mathematical terminology, inverse problems unravel unkown causes from known effects; in other words, to reverse the cause-effect relations.
\item Inverse problems are typically ill-posed, as opposed to the well-posed ones. Well-posed problems are characterized by (a) the existence of a solution, (b) the uniqueness of the solutions and (c) the stability of the solution.
\item In machine learning problems, the obtained solution may be very sensitive to changes of the training set. \textit{Ill conditioning} is another term used to describe this sensitivity.
\item The reson for ill-conditioning is that the model used to describe the data can be complex; i.e. the number of the unknown free parameters is large with respect to the number of data points. This is also known as \textit{overfitting}.
\item \textit{Overfitting} means that the estimated parameters of the unknown model learn too much about the idiosyncrasies of the specific training data set, and the model performs badly when it deals with another set of data, other than that used for the training.
\item Regularization is an elegant and efficient tool to cope with the complexity of a model; that is, to make it less complex, and more smooth.
\item When dealing with more complex, compared to linear, models, one can use constraints on the smoothness of the involved nonlinear function; for example, by involving derivatives of the model function in the regularization term.
\item Examples:
\begin{enumerate}
\item In the LS linear regression task, if the number, N, of the training points is less than the dimension of the regressors $\textbf{x}_n$, then the $\ell \times \ell$ matrix, $\bar{\sum} = \sum_{n}x_{n}x^T_n$, is not invertible. Indeed, each term in the summation is the outer product of a vector with itself and hence it is a matrix of rank one. Thus, as we know from linear algebra, we need at least $\ell$ linearly independent terms of such matrices to guarantee that the sum is of full rank, hence invertible. In ridge regression however, this can be bypassed, because of the presence of $\lambda I$ guarantees that the matrix in invertible.
\item Another example where regularization can help to obtain a solution, and, more important, a unique solution to an otherwise unsolvable problem, is when the model's order is large compared to the number of data, albeit we know that it is sparse. That is, only a very small percentage of the model's paramters are nonzero - here LS linear regression approach has no solution. However, regularizing the LS loss function using the $\ell_1$ norm of the parameters' vector can lead to a unique solution; the $\ell_1$ norm of a vector comprises the sum of absolute values of its components.
\end{enumerate}
\item Regularization is also closely related tot he task of using priors in Bayesian learning.
\end{itemize}
\chapter{Deterministic Methods}
\section{Mean Squared Error}
\subsection{The Normal Equations}
\subsection{The Loss Function Surface}
\subsection{MSE Linear Estimator}
\subsection{Geometric Viewpoint: Orthogonality Condition}
\subsection{Gauss Markov Theorem}
\subsection{Stochastic Gradient Descent}
\section{Least Squares}
\subsection{Geometric Perspective}
\subsection{Statistical Properties}
\subsection{Orthogonalizing the Column Space of X: the SVD Method}
\subsection{Recursive Least Squares}
\subsubsection{Comparison with Stochastic Gradient Descent}
\section{Convex Analysis}
%pgs 327 onwards
\subsection{Applied to Linear Regression}
\section{Reproducing Kernel Hilbert Spaces}
\chapter{Bayesian (Probabilistic) Methods}
\section{Exponential Family of Probability Distributions}
\subsection{The Maximum Entropy}
\section{Maximum Likelihood}
\section{Maximum a Posteriori (MAP) Probability Estimation}
% pgs 84 - 89
\section{The Curse of Dimensionality}
% pgs 89 - 91
\section{The EM Algorithm}
\subsection{The Evidence Function}
\subsection{Laplacian Approximation of the Evidence Function}
\subsection{Latent Variables and the EM Algorithm}
\subsection{Lower Bound Maximization View}
\subsection{Linear Regression using EM Algorithm}
\subsection{Mixing Linear Regression Models}
\end{document}