Misc

$∥ x ∥_{p} = (i \sum ∣ x_{i} ∣^{p})^{\frac{1}{p}}$
$L_{ridge} = λ ∥ θ ∥_{2}$ , circular
$\nabla b^{^{⊤}} x = b$
$\nabla x^{T} A x = 2 A x$

Validation

$C V (Model) = \frac{1}{K} i = 1 \sum K L (\hat{f}_{C_{- i}} (C_{i}))$
- $K =$ number of chunks
- $\hat{f}_{C_{- i}} =$ model fitted to the training set of all chunks that aren’t $C_{i}$ , the validation set

Decision Trees

$Δ H = H_{parent} - i \sum p_{i} H_{children}$
$H (X) \equiv - E [lo g p (X)]$
$g = 1 - i \sum p_{i}^{2}$

K-Nearest Neighbors

$nn_{k} (x) = argmin_{n \in ([N] - \sum_{i = 1}^{k - 1} nn_{i} (x))} ∥ x - x_{n} ∥_{2}^{2}$
- Gives the index in $[N]$ of the $k$ th nearest neighbor of $x$ , or the data point we’re interested in classifying
- $[N]$ is an array of indices for each data sample point
- We use the square of the L2 norm here to calculate distance but other measures could work
$knn (x) = {nn_{1} (x), nn_{2} (x), \dots, nn_{K} (x)}$
- Gives the set of indices of the $K$ nearest neighbors to $x$
$v_{c} = n \in knn (x) \sum I (y_{n} == c) \forall c \in [C]$
- Gives the number of “votes” for a particular class/label $c$
- We iterate over each of the $K$ -nearest neighbors and increment by one every time the neighbor equals the class of interest
$y = h (x) = argmax_{c \in [C]} v_{c}$
- The prediction is the class $c$ that maximizes the number of votes

Steps To Classify

Initialize
1. Place the data point in a feature space
Calculate Distance
2. Find the Euclidean distance from the data point with all other labeled data
3. May have to normalize data point values along different coordinates to be Z-scores for each coordinate
Sort Distance
1. Sort distances of the data point with other labeled data points in increasing order
2. For classification, the most common class of $k$ -nearest neighbors determines the data point class
3. For regression, use the average of the $k$ -nearest neighbors’ labels

Perceptron

$H = {h ∣ h : X \to Y, h (x) = sign (a)}$
$$\displaystyle \text{Margin}(\mathcal{D},w)=\begin{cases}

\text{min}{(x,y)\in \mathcal{D}} \frac{1}{\left\lVert w\right\rVert}y{n}w^{T}x_{n} & \text{for separating hyperplane }w \
-\infty & \text{else}
\end{cases}$$

$Margin (D) = max_{w} Margin (D, w)$
$a = d = 1 \sum D w_{d} x_{d} + b = w^{T} x + b$

Mistake Bound Theorem

For a dataset ${(x_{1}, y_{1}), \dots, (x_{N}, y_{N})}$ with $R \geq ∥ x_{n} ∥_{2}$ and labels $y_{n} \in {- 1, 1}$
Suppose $\exists u \in R^{D} : \exists γ > 0 \land γ \leq y_{n} u^{^{⊤}} x_{n}$
- $u$ can be thought of as a $w$ of a candidate for the margin of the dataset
Then the PerceptronTrainingAlgorithm will make $\leq \frac{R ^{2}}{γ ^{2}}$ mistakes on the training sequence
Proof Preliminaries: $w_{0} = 0, ∥ x_{n} ∥ \leq R, y_{n} u^{T} x_{n} \geq γ$
Proof (1/3): After $t$ mistakes, $u^{T} w_{t} \geq t γ$ by induction from $w_{0} = 0$ and using $u^{T} w_{t + 1} = u^{T} (w_{t} + y_{n} x_{n})$
Proof (2/3): After $t$ mistakes, $∥ w_{t} ∥^{2} \leq t R^{2}$
Proof (3/3): $R t \geq ∥ w_{t} ∥ \geq u^{T} w_{t} \geq t γ$

Logistic Regression

$J (θ) = - n \sum {y_{n} lo g h_{θ} (x_{n}) + (1 - y_{n}) lo g [1 - h_{θ} (x_{n})]}$
$h_{θ} (x) = P (y = 1∣ x; θ) = σ (a) = \frac{1}{1 + e ^{- a}}$
$\frac{\partial J ( θ )}{\partial θ} = n \sum {h_{θ} (x_{n}) - y_{n}} x_{n}$

Convexity

$f (λa + (1 - λ) b) \leq λ f (a) + (1 - λ) f (b), λ \in [0, 1] \to f (x)$ is convex
$f (x)$ is convex $\to f^{''} (x) > 0 \forall x$
$H_{f} ⪰ 0 \to f$ is convex
$x^{⊤} A x \geq 0 \to A ⪰ 0$
$H_{f} = \nabla_{x}^{2} f (x) \in R^{n \times n} : (\nabla_{x}^{2} f (x))_{ij} = \partial_{x_{i}, x_{j}} f (x)$

Linear Regression

$\hat{Y} = \hat{β}_{1} X + \hat{β}_{0} + ε$
$ε = η \sim N (0, σ^{2})$
$J (θ) = n \sum [y_{n} - (θ_{1} X + θ_{0})^{2}] = ∥ y - Xθ ∥_{2}^{2}$
- The loss function for simple linear regression
- Minimizing this by varying $β_{0}$ / $β_{1}$ or $θ$ gives the linear model that best fits with the data $y$
- $y$ is the target vector
$θ_{1} = \frac{\sum _{i} ( x _{i} - x ˉ ) ( y _{i} - y ˉ )}{\sum _{i} ( x _{i} - x ˉ ) ^{2}}$
$θ_{0} = \overset{y}{ˉ} - θ_{1} \overset{x}{ˉ}$
$MSE (β) = \frac{1}{n} ∥ Y - X θ ∥^{2}$
$θ = β argmin MSE (β) = (X^{⊤} X)^{- 1} X^{⊤} Y$
$σ^{2} = \sum \frac{( f ^ ( x ) - y _{i} ) ^{2}}{n}$

Regularized

$\hat{θ} = (X^{^{⊤}} X + λ I)^{- 1} X^{^{⊤}} y$
$\nabla J (w) = 2 (X^{^{⊤}} Xw - X^{^{⊤}} y + λ w)$

Run Times

Algorithm	Training Time Complexity	Test Time Complexity
Decision Tree	$O (n lo g (n))$ to $O (n^{2})$ (depends on splitting strategy)	$O (lo g (n))$ (balanced) or $O (n)$ (unbalanced)
K-Nearest Neighbors (KNN)	$O (1)$ (no training, just storing data)	$O (n d)$ (linear search) or $O (lo g (n))$ (KD-tree, low dimensions)
Perceptron	$O (n d)$ per epoch	$O (d)$
Logistic Regression	$O (n d)$ per iteration (gradient-based)	$O (d)$
Linear Regression (Analytical - Normal Equation)	$O (d^{3} + n d^{2})$ (matrix inversion)	$O (d)$
Linear Regression (Gradient Descent)	$O (n d k)$ (depends on iterations $k$ )	$O (d)$
Full Batch Gradient Descent	$O (n d k)$	$O (d)$
Stochastic Gradient Descent (SGD)	$O (d k)$	$O (d)$
Mini-Batch Gradient Descent	$O (\frac{n}{b} d k)$ (where $b$ is batch size)	$O (d)$

$n$ = number of training samples
$d$ = number of features (dimensions)
$k$ = number of iterations in gradient-based methods
$b$ = batch size for mini-batch gradient descent

Knowledge

Explorer

CS M146 Midterm Cheatsheet

Misc

Validation

$C V (Model) = \frac{1}{K} i = 1 \sum K L (\hat{f}_{C_{- i}} (C_{i}))$

Decision Trees

K-Nearest Neighbors

$y = h (x) = argmax_{c \in [C]} v_{c}$

Steps To Classify

Perceptron

$$\displaystyle \text{Margin}(\mathcal{D},w)=\begin{cases}

Mistake Bound Theorem

Logistic Regression

Convexity

Linear Regression

Regularized

Run Times

Algorithms

Graph View

Table of Contents

Knowledge

Explorer

CS M146 Midterm Cheatsheet

Misc

Validation

CV(Model)=K1​i=1∑K​L(f^​C−i​​(Ci​))

Decision Trees

K-Nearest Neighbors

y=h(x)=argmaxc∈[C]​vc​

Steps To Classify

Perceptron

$$\displaystyle \text{Margin}(\mathcal{D},w)=\begin{cases}

Mistake Bound Theorem

Logistic Regression

Convexity

Linear Regression

Regularized

Run Times

Algorithms

Graph View

Table of Contents

$C V (Model) = \frac{1}{K} i = 1 \sum K L (\hat{f}_{C_{- i}} (C_{i}))$

$y = h (x) = argmax_{c \in [C]} v_{c}$