Notes in week 4

To Subscribe, use this Key

Status	Last Update	Fields
Published	04/07/2024	Language RepresentationFor processing in networks, we need to represent words by vectors. The simplest of which is one-hot encoding. What was One-Hot …
Published	04/07/2024	EmbeddingAlternative: Embed p-dimensional one-hot encoded vectors X in D-dimensional space. (v = Ex)Learn matrix E from large text corpus through&nbsp…
Published	04/07/2024	Embedding (II)We learn the semantic structure. What does this mean? {{c1::Learning semantic structure means related words get similar representat…
Published	04/07/2024	TokenizationWords do not always capture semantics optimally...'time','times','timer','timely','timing',…Create vocabulary at level needed for task:● c…
Published	04/07/2024	Byte-pair encodingThis method is similar to data compression and hierarchical clustering. Which two steps does this method perform? Hint: Image. …
Published	04/07/2024	Sequence modelsMost general method of gaining sequence models is to use the joint distribution of full token sequences p(X1, X2, ... Xn)What is this g…
Published	04/07/2024	Sequence models IIThe simplest model is to assume complete independance between tokens. This so called 'bag of words' approach does not use order…
Published	04/07/2024	Markov modelsAn intermediate solution is to assume probability of observing tokens is independant of history, given M previous tokens.What can a Marko…
Published	04/07/2024	Neural network modelsWith increasing Markov order M, what actually increases?{{c1::Higher Markov order M increases the size of probability tables.Appr…
Published	04/07/2024	Hidden Markov ModelHMMs are quite wasteful. Generally, not all sequences are possible. But 𝑝 𝐱𝑛\|𝐱𝑛−1, … , 𝐱𝑛−𝑀 is specified for all 𝐱𝑛−1, … , 𝐱𝑛−…
Published	04/07/2024	Recurrent neural networksSimilarly, we can approximate𝑦𝑛 = 𝑝 𝐱𝑛\|𝐱𝑛−1, … , 𝐱𝑛−𝑀 using arecurrent neural network(RNN, 1990)At each time step, the same n…
Published	04/07/2024	RNN IssuesThere are some issues:Backpropagation: {{c1::backpropagation through time suffers from vanishing/exploding gradients. Vanishing gradient ari…
Published	04/07/2024	Attention - TransformersWhat is the 'big thing' that Transformers do, the 'attention' part?{{c1::Learns pairwise relations between words,i.e. which wo…
Published	04/08/2024	Attention implementationHow do we implement 'attention' in code? class MultiheadDotProductAttention(nn.Module):    """Multihead dot pro…
Published	04/08/2024	Naïve transformerWe use tokens with positional embeddings.Attention with a normalization layer with skip connections.Training takes place on 'shifted'…
Published	04/09/2024	Increasing attention flexibility / performanceThe naïve transformer calculates one attention matrix between all tokens, this makes it difficult to cap…
Published	04/08/2024	Improved transformer blockWe combine communication between tokens (The attention) with communication between channels using positionwise MLP (Multilay…
Published	04/09/2024	Learning relevant context from dataThe original 'attention' mechanism is introduced for translation context in 2015.What made this model special?&nbsp…
Published	04/08/2024	Transformer architecture for sequence learningThe transformer architecture was introduced by the landmark paper: {{c1::"Attention is all you need…
Published	04/09/2024	Transformer prerequisitesTo stack layers, input and output dimensions of a layershould match𝐗! = Transformerlayer 𝐗Within each layer, what type of exc…
Published	04/08/2024	Attention weightsWithin each transformer layer, sequence context gets assigned a weight● every position is weighted by all other positionsWhat does th…
Published	04/08/2024	Matrix questionWhat is the size (shape) of the attention weights?{{c1::The attention weights are a square N x N matrix, as each token gets compared to…
Published	04/08/2024	Self-AttentionFor each item in the input sequence we compute three things. What are those? {{c1::For each item in the input sequence we compute a…
Published	04/08/2024	Parameterizing attention weightsWe have the Query, Key and Value. What do they represent?{{c1::Query: What I am looking forKey: This is what I am look…
Published	04/08/2024	Question about windowsFor a context window of 128 tokens each embedded into 6 dimensions, give the dimensions of the following matrices: ● 𝐖(q), …
Published	04/08/2024	Positional embeddingHow does the 'vanilla' attention mechanism 'look' at sentence order with respect to input permutations?{{c1::The vanilla attention…
Published	04/08/2024	Positional embedding - OptionsMultiple options exist for adding position information. Name one: {{c1::This is to treat each position as a categor…
Status	Last Update	Fields