Topics
Y^=β^1X+β^0+ε
- The hats are considered estimators, or values we have to calculate to predict Y
- β^1 and β^0 are regression coefficients
- ε is random noise
L(β0,β1)=n1i=1∑n[yi−(β1X+β0)2]=J(θ)=∥y−Xθ∥22
∇J(θ)=2(X⊤Xθ−X⊤y)
- Gradient of loss function
β^1=w=∑i(xi−xˉ)2∑i(xi−xˉ)(yi−yˉ)
- Coefficient of regression that minimizes L
- Same as var(X)cov(X,Y)
β^0=b=yˉ−β^1xˉ
- Coefficient of regression that minimizes L
β^=θ=βargminMSE(β)=(X⊤X)−1X⊤Y
- If X⊤X is not invertible, this happens both/either because
- N<D, or there are fewer data points than features
- X is not linearly independent, in which case
SE(β^0)=σn1+∑i(xi−xˉ)2xˉ2
- σ is the variance in y^
- xˉ is the average value of our sample
SE(β^1)=∑i(xi−xˉ)2σ
σ≈i∑n−2(f^(x)−yi)2
- Used for when the noise ε is unknown