AC
AnkiCollab
AnkiCollab
Sign in
Explore Decks
Helpful
Join Discord
Download Add-on
Documentation
Support Us
Notes in
week 4
To Subscribe, use this Key
early-sweet-hot-freddie-kansas-alpha
Status
Last Update
Fields
Published
04/07/2024
Language RepresentationFor processing in networks, we need to represent words by vectors. The simplest of which is one-hot encoding. What was One-Hot …
Published
04/07/2024
EmbeddingAlternative: Embed p-dimensional one-hot encoded vectors X in D-dimensional space. (v = Ex)Learn matrix E from large text corpus through …
Published
04/07/2024
Embedding (II)We learn the semantic structure. What does this mean? {{c1::Learning semantic structure means related words get similar representat…
Published
04/07/2024
TokenizationWords do not always capture semantics optimally...'time','times','timer','timely','timing',…Create vocabulary at level needed for task:● c…
Published
04/07/2024
Byte-pair encodingThis method is similar to data compression and hierarchical clustering. Which two steps does this method perform? Hint: Image. …
Published
04/07/2024
Sequence modelsMost general method of gaining sequence models is to use the joint distribution of full token sequences p(X1, X2, ... Xn)What is this g…
Published
04/07/2024
Sequence models IIThe simplest model is to assume complete independance between tokens. This so called 'bag of words' approach does not use order…
Published
04/07/2024
Markov modelsAn intermediate solution is to assume probability of observing tokens is independant of history, given M previous tokens.What can a Marko…
Published
04/07/2024
Neural network modelsWith increasing Markov order M, what actually increases?{{c1::Higher Markov order M increases the size of probability tables.Appr…
Published
04/07/2024
Hidden Markov ModelHMMs are quite wasteful. Generally, not all sequences are possible. But 𝑝 𝐱𝑛|𝐱𝑛−1, … , 𝐱𝑛−𝑀 is specified for all 𝐱𝑛−1, … , 𝐱𝑛−…
Published
04/07/2024
Recurrent neural networksSimilarly, we can approximate𝑦𝑛 = 𝑝 𝐱𝑛|𝐱𝑛−1, … , 𝐱𝑛−𝑀 using arecurrent neural network(RNN, 1990)At each time step, the same n…
Published
04/07/2024
RNN IssuesThere are some issues:Backpropagation: {{c1::backpropagation through time suffers from vanishing/exploding gradients. Vanishing gradient ari…
Published
04/07/2024
Attention - TransformersWhat is the 'big thing' that Transformers do, the 'attention' part?{{c1::Learns pairwise relations between words,i.e. which wo…
Published
04/08/2024
Attention implementationHow do we implement 'attention' in code? class MultiheadDotProductAttention(nn.Module): """Multihead dot pro…
Published
04/08/2024
Naïve transformerWe use tokens with positional embeddings.Attention with a normalization layer with skip connections.Training takes place on 'shifted'…
Published
04/09/2024
Increasing attention flexibility / performanceThe naïve transformer calculates one attention matrix between all tokens, this makes it difficult to cap…
Published
04/08/2024
Improved transformer blockWe combine communication between tokens (The attention) with communication between channels using positionwise MLP (Multilay…
Published
04/09/2024
Learning relevant context from dataThe original 'attention' mechanism is introduced for translation context in 2015.What made this model special? …
Published
04/08/2024
Transformer architecture for sequence learningThe transformer architecture was introduced by the landmark paper: {{c1::"Attention is all you need…
Published
04/09/2024
Transformer prerequisitesTo stack layers, input and output dimensions of a layershould match𝐗! = Transformerlayer 𝐗Within each layer, what type of exc…
Published
04/08/2024
Attention weightsWithin each transformer layer, sequence context gets assigned a weight● every position is weighted by all other positionsWhat does th…
Published
04/08/2024
Matrix questionWhat is the size (shape) of the attention weights?{{c1::The attention weights are a square N x N matrix, as each token gets compared to…
Published
04/08/2024
Self-AttentionFor each item in the input sequence we compute three things. What are those? {{c1::For each item in the input sequence we compute a…
Published
04/08/2024
Parameterizing attention weightsWe have the Query, Key and Value. What do they represent?{{c1::Query: What I am looking forKey: This is what I am look…
Published
04/08/2024
Question about windowsFor a context window of 128 tokens each embedded into 6 dimensions, give the dimensions of the following matrices: ● 𝐖(q), …
Published
04/08/2024
Positional embeddingHow does the 'vanilla' attention mechanism 'look' at sentence order with respect to input permutations?{{c1::The vanilla attention…
Published
04/08/2024
Positional embedding - OptionsMultiple options exist for adding position information. Name one: {{c1::This is to treat each position as a categor…
Status
Last Update
Fields