Tom Mitchell defines it as: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E."

**Example: playing checkers**
E = the experience of playing many games of checkers
T = the task of playing checkers.
P = the probability that the program will win the next game.
In general, any machine learning problem can be assigned to one of two broad classifications:
Supervised learning and Unsupervised learning.

In supervised machine learning a computer program is given a training dataset with input and output. The goal is to find a good model (a hypothesis or a function) that maps the input to the output and it, the model, generalizes well to new input data.

Supervised learning problems are categorized into (i) **regression** problems, where the hypothesis funciton tries to map input variables to a continuous (or continuous-like) output range; and (ii) **classification** problems, where the hypothesis tries to map the input variables in a discrete output.

Unsupervised learning allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.

We can derive this structure by *clustering* the data based on relationships among the variables in the data.With unsupervised learning there is no feedback based on the prediction results.

In supervised learning problem, we are given a training dataset, and we wish to learn a function (a hypothesis) that maps the input $X$ to the output $Y$. The linear regression model has the form $\theta_0 + \theta_0 x_1 + \theta_2 x_2$.... The model form has has two types of variables (i) the inputs $X$ that are given only during the training stage; and (ii) the parameters $\theta$ that we need to learn their values by running a learning algorithm like *Gradient descent* over the training dataset.

A *hypothesis* represents a signle instance of the model---the model for specific set of paramters.
\begin{equation*}
h_{\theta}(x) = \theta_0 + \theta_1 x_1
\end{equation*}

A cost function measures the overall difference (or error) between the predictions of a hypothesis and the given output in the training dataset.

A commonly used cost function for linear regression is the *The Squared error function* $J(\theta)$. It is the mean of the squared differences between the true outputs $Y$ and the predications of the hypothesis $h(x)$ when it is run over the training dataset for a fixed set of $\theta$ values.

\begin{equation*} J(\theta)=\frac{1}{2m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})^2 \end{equation*}

where $m$ is the length of the training set, and $i$ is an the training set index

The goal of supervised learning is to learn a model that represents the data well. To do so, we need to find a set of parameters that minimize the difference between the predictions and the given output values of the training dataset. Moreover, the model should generalize---make good predictions for the non-training-dataset inputs---well also.

How does Gradient Descent work? The derivative of a function is either a posative value, a negative value, or zero. A convex (a bow shaped) function has posative derivative on the right, zero in the middel and negative on the left. Gradient Descent exploits this fact and updates the values of the $\theta$ based on the derivative value of the cost function until it reaches the global minimum.

The Squared error cost function of a linear hypothesis is convex.

The Gradient Descent algorithm

*repeat until convergence:*{

\begin{equation*} \theta_j = \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta) \end{equation*} } where j=0,1,... represents the feature (paramater) index number

- The hypothesis $ h_{\theta}(x) = \theta_0 + \theta_1 x_1 $
- The partial derivative $ \frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)}) - y^{(i)})x_j $
- The partial derivatives of the cost function of the hypothesis
- $\theta_0 = \theta_0 - \frac{\alpha}{m}\sum_{i=1}^{m} h_{\theta}(x^{(i)})-y^{(i)} $
- $\theta_1 = \theta_1 - \frac{\alpha}{m}\sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)})x_1 $

In [1]:

```
import numpy as np
def gradientDescent(X,y,theta,iters, alpha):
# X,y are column vectors (numpy arrays)
m = len(y)
for iter in range(iters):
theta = theta - (alpha/m) * np.transpose(X).dot( (X.dot(theta) -y) );
return theta
```

Multivariate linear regression refers to a linear regression model with multiple input features. The general form of its hypothesis is
\begin{equation*}
h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 ... \theta_n x_n
\end{equation*}
where $n$ is the number of features. Multivariate linear regeression models things that are affected multiple features, like houses prices which depend on many things like the size, the age, the location, etc.

The hypothesis function can be written in concisely in matrix form as:
\begin{equation*}
h_{\theta}(x)=
\begin{bmatrix}
\theta_0 & \theta_1 & \dots & \theta_n
\end{bmatrix}
\begin{bmatrix}
x_0\\
x_1\\
\vdots\\
x_n\\
\end{bmatrix}
=\theta^Tx
\end{equation*}

\begin{equation} repeat:{ \theta_j = \theta_j - \frac{\alpha}{m} \sum_{i=1}^{m} (h_{\theta}(x^{(i)})-y^{(i)}).x_j^{(i)} } \;\;\;\;\;\; \text{for} \;\; j=0\dots n \end{equation}

Gradient descent converges faster when the input variables have similar ranges. If the ranges are significantly different gradient descent will oscillate inefficiently down to the optimum because the algorithm descends quicker on the small ranges than on the large ones.

Ideally, the range of a feature should be between $\;\; -1 \leq x_j \leq 1 $. This can be done in two stages (i) **feature scaling** which involves dividing a feature values by the range ($max(x_j- min(x_j $) of this feature. This will upper bound the difference between the values of the feature by one, (ii) **mean normalization** which is achieved by subtracting the average value of a feature $\mu_j$ from its individual values. This will make the mean of the new values zero, see the formula.
\begin{equation}
x^{(i)}_j = \frac{x^{(i)}_j - \mu_j}{max(x_j)- min(x_j)}
\end{equation}

If a straight line does not fit the data well, we can engineer new features to carvy the hypothesis line. For example, if we have the linear hypothesis $ h_{\theta}(x) = \theta_0 + \theta_1 x_1$ and we wish to have, for example, a quadratic hypothesis function then we can set the second feature to be $x^2_1$.

If we have the following polynomial hypothersis $ h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x^2_1 $ , then by assigning $x_2 = x^2_1$ we can apply the linear regression machinery on the this hypothesis $ h_{\theta}(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 $ to optimize the prediction (or classification).

Note: The gradient descent optimizes for $\theta$ and $x$ values are given at this stage.