Logistic Regression
#Computers
A model that predicts the probability of a data point belonging to a certain class. This probability may then be used to classify (e.g. $\displaystyle P(y=1|x)>0.5\rightarrow y=1$) if desired
Topics
$\displaystyle L({\beta}{0},{\beta}{1})=-\log(L(p|Y))-\sum_{i}\left( y_{i}\ln\left( p_{i} \right)+(1-y_{i})\ln\left( 1-p_{i} \right) \right)$
- Negative log of the product of log likelihood...
$\displaystyle p_{i}=\frac{1}{1+e^{-({\beta}{0}+{\beta}{1}x_{i})}}$
Training
Use gradient descent to minimize the negative log likelihood of the dataset (same as finding the Maximum Likelihood Estimate) by modifying $\displaystyle \theta$, or the weights of the model
$\displaystyle J(\theta)=-\sum_{n}\left{ y_{n}\log h_{\theta}(x_{n})+(1-y_{n})\log[1-h_{\theta}(x_{n})] \right}$
- The log likelihood of the dataset
- $\displaystyle h_{\theta}(x_{n})$ is the same as $\displaystyle p_{i}$
$\displaystyle \frac{\partial J(\theta) }{\partial \theta}=\sum_{n}\left{ h_{\theta}(x_{n})-y_{n} \right}x_{n}$
- $\displaystyle \left{ h_{\theta}(x_{n})-y_{n} \right}$ is the error of the $\displaystyle n$th training sample
Testing
$\displaystyle h_{\theta}(x)=P(y=1|x)=\sigma(a(x))$
- $\displaystyle \sigma(x)$ is the sigmoid function
- $\displaystyle a(x)$ is the activation function
- If $\displaystyle h_{\theta}(x)>0.5$
- Classify $\displaystyle y=1$
- Else
- Classify $\displaystyle y=0$
$\displaystyle O(d)$
- $\displaystyle d$ is the dimensionality of the data