Notes in week 4

To Subscribe, use this Key


Status Last Update Fields
Published 04/07/2024 Language RepresentationFor processing in networks, we need to represent words by vectors. The simplest of which is one-hot encoding. What was One-Hot …
Published 04/07/2024 EmbeddingAlternative: Embed p-dimensional one-hot encoded vectors X in D-dimensional space. (v = Ex)Learn matrix E from large text corpus through&nbsp…
Published 04/07/2024 Embedding (II)We learn the semantic structure. What does this mean? {{c1::Learning semantic structure means related words get similar representat…
Published 04/07/2024 TokenizationWords do not always capture semantics optimally...'time','times','timer','timely','timing',…Create vocabulary at level needed for task:● c…
Published 04/07/2024 Byte-pair encodingThis method is similar to data compression and hierarchical clustering. Which two steps does this method perform? Hint: Image. …
Published 04/07/2024 Sequence modelsMost general method of gaining sequence models is to use the joint distribution of full token sequences p(X1, X2, ... Xn)What is this g…
Published 04/07/2024 Sequence models IIThe simplest model is to assume complete independance between tokens. This so called 'bag of words' approach does not use order…
Published 04/07/2024 Markov modelsAn intermediate solution is to assume probability of observing tokens is independant of history, given M previous tokens.What can a Marko…
Published 04/07/2024 Neural network modelsWith increasing Markov order M, what actually increases?{{c1::Higher Markov order M increases the size of probability tables.Appr…
Published 04/07/2024 Hidden Markov ModelHMMs are quite wasteful. Generally, not all sequences are possible. But 𝑝 𝐱𝑛|𝐱𝑛−1, … , 𝐱𝑛−𝑀 is specified for all 𝐱𝑛−1, … , 𝐱𝑛−…
Published 04/07/2024 Recurrent neural networksSimilarly, we can approximate𝑦𝑛 = 𝑝 𝐱𝑛|𝐱𝑛−1, … , 𝐱𝑛−𝑀 using arecurrent neural network(RNN, 1990)At each time step, the same n…
Published 04/07/2024 RNN IssuesThere are some issues:Backpropagation: {{c1::backpropagation through time suffers from vanishing/exploding gradients. Vanishing gradient ari…
Published 04/07/2024 Attention - TransformersWhat is the 'big thing' that Transformers do, the 'attention' part?{{c1::Learns pairwise relations between words,i.e. which wo…
Published 04/08/2024 Attention implementationHow do we implement 'attention' in code? class MultiheadDotProductAttention(nn.Module):    """Multihead dot pro…
Published 04/08/2024 Naïve transformerWe use tokens with positional embeddings.Attention with a normalization layer with skip connections.Training takes place on 'shifted'…
Published 04/09/2024 Increasing attention flexibility / performanceThe naïve transformer calculates one attention matrix between all tokens, this makes it difficult to cap…
Published 04/08/2024 Improved transformer blockWe combine communication between tokens (The attention) with communication between channels using positionwise MLP (Multilay…
Published 04/09/2024 Learning relevant context from dataThe original 'attention' mechanism is introduced for translation context in 2015.What made this model special?&nbsp…
Published 04/08/2024 Transformer architecture for sequence learningThe transformer architecture was introduced by the landmark paper: {{c1::"Attention is all you need…
Published 04/09/2024 Transformer prerequisitesTo stack layers, input and output dimensions of a layershould match𝐗! = Transformerlayer 𝐗Within each layer, what type of exc…
Published 04/08/2024 Attention weightsWithin each transformer layer, sequence context gets assigned a weight● every position is weighted by all other positionsWhat does th…
Published 04/08/2024 Matrix questionWhat is the size (shape) of the attention weights?{{c1::The attention weights are a square N x N matrix, as each token gets compared to…
Published 04/08/2024 Self-AttentionFor each item in the input sequence we compute three things. What are those? {{c1::For each item in the input sequence we compute a…
Published 04/08/2024 Parameterizing attention weightsWe have the Query, Key and Value. What do they represent?{{c1::Query: What I am looking forKey: This is what I am look…
Published 04/08/2024 Question about windowsFor a context window of 128 tokens each embedded into 6 dimensions, give the dimensions of the following matrices: ● 𝐖(q), …
Published 04/08/2024 Positional embeddingHow does the 'vanilla' attention mechanism 'look' at sentence order with respect to input permutations?{{c1::The vanilla attention…
Published 04/08/2024 Positional embedding - OptionsMultiple options exist for adding position information. Name one: {{c1::This is to treat each position as a categor…
Status Last Update Fields