AC
AnkiCollab
AnkiCollab
Sign in
Explore Decks
Helpful
Join Discord
Download Add-on
Documentation
Leave a Review
Notes in
Week 5 - Multiclass Classification and Softmax Regression
To Subscribe, use this Key
juliet-lithium-fruit-wisconsin-seventeen-black
Status
Last Update
Fields
Published
11/26/2024
Multiclass classification extends binary classification, where the output \( y \) can be any of {{c1::multiple (> 2) values}}.
Published
11/26/2024
In logistic regression, there are {{c1::two possible outputs}}, while in softmax regression, there can be {{c2::N possible outputs}}.
Published
11/26/2024
{{c1::Softmax regression}} calculates each class probability as \( P(y = j | x) = \frac{e^{z_j}}{\sum_{k=1}^{N} e^{z_k}} \), where \( z_j = w_j \cdot …
Published
11/26/2024
The softmax function outputs probabilities that sum to {{c1::1}}, ensuring that the classes are mutually exclusive.
Published
11/26/2024
In a neural network, the softmax output layer couples neurons, as they share a {{c1::normalization constraint}}.
Published
11/26/2024
The softmax layer produces a probability distribution, making it suitable for {{c1::multiclass classification}} tasks.
Published
11/26/2024
Sigmoid output is generally used for {{c1::binary classification}}, while softmax is used for {{c2::multiclass classification}}.
Published
11/26/2024
Softmax output calculates the probability for each class by applying an exponential to each \( z_j \) and then {{c1::normalizing}} by the sum of expon…
Published
11/26/2024
The loss function for softmax output is the {{c1::cross-entropy loss}}, which penalizes large errors in probability predictions.
Published
11/26/2024
Cross-entropy loss for a class \( y \) is defined as \( -\log(a_y) \), where \( a_y \) is the {{c1::predicted probability}} for the true class.
Published
11/26/2024
One-hot encoding represents a categorical variable as a vector with {{c1::1}} in the position of the category and {{c2::0}} elsewhere.
Published
11/26/2024
In one-hot encoding, the length of the vector corresponds to the {{c1::number of classes}}, ensuring each class has a unique position.
Published
11/26/2024
Cross-entropy loss is minimized when the predicted probability for the true class is close to {{c1::1}}.
Published
11/26/2024
Batch gradient descent updates model parameters using all training samples, which is computationally {{c1::expensive}}.
Published
11/26/2024
Stochastic gradient descent (SGD) uses a {{c1::single sample}} per iteration, introducing noise into the gradient estimates.
Published
11/26/2024
Mini-batch gradient descent strikes a balance between batch gradient descent and SGD by using a {{c1::subset of the data}} in each update.
Published
11/26/2024
Using mini-batches allows for {{c1::vectorized computations}} and faster convergence in gradient descent.
Published
11/26/2024
Momentum accelerates gradient descent by building velocity in directions of {{c1::consistent gradient signs}}.
Published
11/26/2024
In momentum-based gradient descent, the parameter update combines the current gradient with a {{c1::fraction of the previous update}}.
Published
11/26/2024
The momentum term is often represented as \( \beta \), typically set near {{c1::0.9}} for smoothing gradients.
Published
11/26/2024
Momentum helps to {{c1::reduce oscillations}} in directions where gradients vary in sign.
Published
11/26/2024
The momentum-adjusted update formula is {{c1::\( w_{t+1} = w_t - \alpha m_t \),}} where \( m_t \) is the accumulated momentum.
Published
11/26/2024
`Adaptive learning rates adjust step sizes for each parameter, reducing the chance of {{c1::overshooting minima}}.
Published
11/26/2024
An adaptive learning rate is computed by dividing the gradient by a running average of {{c1::squared gradients}}.
Published
11/26/2024
In adaptive methods, each parameter update is scaled by {{c1::\( \frac{\alpha}{\sqrt{s_t} + \epsilon} \),}} where \( s_t \) is the accumulated gr…
Published
11/26/2024
The learning rate in adaptive methods reduces for parameters with frequently high gradients, helping to {{c1::stabilize convergence}}.
Published
11/26/2024
The parameter \( \epsilon \) prevents division by zero and is typically set to a {{c1::small constant}} like \( 10^{-8} \).
Published
11/26/2024
A machine learning dataset is often split into {{c1::training}}, {{c2::validation}}, and {{c3::test}} sets.
Published
11/26/2024
The validation set helps in {{c1::model selection}}, as it estimates the model’s generalization performance.
Published
11/26/2024
Model performance is often evaluated by measuring the {{c1::error on the test set}}, which is not used in training or validation.
Published
11/26/2024
Generalization error reflects the model’s ability to perform well on {{c1::unseen data}}.
Published
11/26/2024
For effective model selection, one should avoid using the test set for {{c1::hyperparameter tuning}}.
Published
11/26/2024
If a model has high bias, it is likely {{c1::underfitting}} the training data and requires more capacity or features.
Published
11/26/2024
A model with high variance, or {{c1::overfitting}}, fits the training data too closely and often performs poorly on new data.
Published
11/26/2024
Common strategies for high variance include adding {{c1::regularization}} or using simpler models.
Published
11/26/2024
To reduce high bias, one might increase the model’s complexity by {{c1::adding layers or features}}.
Published
11/26/2024
To improve generalization, one may increase the {{c1::regularization parameter}} to prevent the model from fitting noise in the training data.
Status
Last Update
Fields