Notes in Week 5 - Multiclass Classification and Softmax Regression

To Subscribe, use this Key

Status	Last Update	Fields
Published	11/26/2024	Multiclass classification extends binary classification, where the output \( y \) can be any of {{c1::multiple (> 2) values}}.
Published	11/26/2024	In logistic regression, there are {{c1::two possible outputs}}, while in softmax regression, there can be {{c2::N possible outputs}}.
Published	11/26/2024	{{c1::Softmax regression}} calculates each class probability as \( P(y = j \| x) = \frac{e^{z_j}}{\sum_{k=1}^{N} e^{z_k}} \), where \( z_j = w_j \cdot …
Published	11/26/2024	The softmax function outputs probabilities that sum to {{c1::1}}, ensuring that the classes are mutually exclusive.
Published	11/26/2024	In a neural network, the softmax output layer couples neurons, as they share a {{c1::normalization constraint}}.
Published	11/26/2024	The softmax layer produces a probability distribution, making it suitable for {{c1::multiclass classification}} tasks.
Published	11/26/2024	Sigmoid output is generally used for {{c1::binary classification}}, while softmax is used for {{c2::multiclass classification}}.
Published	11/26/2024	Softmax output calculates the probability for each class by applying an exponential to each \( z_j \) and then {{c1::normalizing}} by the sum of expon…
Published	11/26/2024	The loss function for softmax output is the {{c1::cross-entropy loss}}, which penalizes large errors in probability predictions.
Published	11/26/2024	Cross-entropy loss for a class \( y \) is defined as \( -\log(a_y) \), where \( a_y \) is the {{c1::predicted probability}} for the true class.
Published	11/26/2024	One-hot encoding represents a categorical variable as a vector with {{c1::1}} in the position of the category and {{c2::0}} elsewhere.
Published	11/26/2024	In one-hot encoding, the length of the vector corresponds to the {{c1::number of classes}}, ensuring each class has a unique position.
Published	11/26/2024	Cross-entropy loss is minimized when the predicted probability for the true class is close to {{c1::1}}.
Published	11/26/2024	Batch gradient descent updates model parameters using all training samples, which is computationally {{c1::expensive}}.
Published	11/26/2024	Stochastic gradient descent (SGD) uses a {{c1::single sample}} per iteration, introducing noise into the gradient estimates.
Published	11/26/2024	Mini-batch gradient descent strikes a balance between batch gradient descent and SGD by using a {{c1::subset of the data}} in each update.
Published	11/26/2024	Using mini-batches allows for {{c1::vectorized computations}} and faster convergence in gradient descent.
Published	11/26/2024	Momentum accelerates gradient descent by building velocity in directions of {{c1::consistent gradient signs}}.
Published	11/26/2024	In momentum-based gradient descent, the parameter update combines the current gradient with a {{c1::fraction of the previous update}}.
Published	11/26/2024	The momentum term is often represented as \( \beta \), typically set near {{c1::0.9}} for smoothing gradients.
Published	11/26/2024	Momentum helps to {{c1::reduce oscillations}} in directions where gradients vary in sign.
Published	11/26/2024	The momentum-adjusted update formula is {{c1::\( w_{t+1} = w_t - \alpha m_t \),}} where \( m_t \) is the accumulated momentum.
Published	11/26/2024	`Adaptive learning rates adjust step sizes for each parameter, reducing the chance of {{c1::overshooting minima}}.
Published	11/26/2024	An adaptive learning rate is computed by dividing the gradient by a running average of {{c1::squared gradients}}.
Published	11/26/2024	In adaptive methods, each parameter update is scaled by {{c1::\( \frac{\alpha}{\sqrt{s_t} + \epsilon} \),}} where \( s_t \) is the accumulated gr…
Published	11/26/2024	The learning rate in adaptive methods reduces for parameters with frequently high gradients, helping to {{c1::stabilize convergence}}.
Published	11/26/2024	The parameter \( \epsilon \) prevents division by zero and is typically set to a {{c1::small constant}} like \( 10^{-8} \).
Published	11/26/2024	A machine learning dataset is often split into {{c1::training}}, {{c2::validation}}, and {{c3::test}} sets.
Published	11/26/2024	The validation set helps in {{c1::model selection}}, as it estimates the model’s generalization performance.
Published	11/26/2024	Model performance is often evaluated by measuring the {{c1::error on the test set}}, which is not used in training or validation.
Published	11/26/2024	Generalization error reflects the model’s ability to perform well on {{c1::unseen data}}.
Published	11/26/2024	For effective model selection, one should avoid using the test set for {{c1::hyperparameter tuning}}.
Published	11/26/2024	If a model has high bias, it is likely {{c1::underfitting}} the training data and requires more capacity or features.
Published	11/26/2024	A model with high variance, or {{c1::overfitting}}, fits the training data too closely and often performs poorly on new data.
Published	11/26/2024	Common strategies for high variance include adding {{c1::regularization}} or using simpler models.
Published	11/26/2024	To reduce high bias, one might increase the model’s complexity by {{c1::adding layers or features}}.
Published	11/26/2024	To improve generalization, one may increase the {{c1::regularization parameter}} to prevent the model from fitting noise in the training data.
Status	Last Update	Fields