nlp

view markdown


Some notes on natural language processing, focused on modern improvements based on deep learning.

nlp basics

  • basics come from book “Speech and Language Processing”
  • language models - assign probabilities to sequences of words
    • ex. n-gram model - assigns probs to shorts sequences of words, known as n-grams
      • for full sentence, use markov assumption
    • eval: perplexity (PP) - inverse probability of the test set, normalized by the number of words (want to minimize it)
      • $PP(W_{test}) = P(w_1, …, w_N)^{-1/N}$
      • can think of this as the weighted average branching factor of a language
      • should only be compared across models w/ same vocab
    • vocabulary
      • sometimes closed, otherwise have unkown words, which we assign its own symbol
      • can fix training vocab, or just choose the top words and have the rest be unkown
  • topic models (e.g. LDA) - apply unsupervised learning on large sets of text to learn sets of associated words
  • embeddings - vectors for representing words
    • ex. tf-idf - defined as counts of nearby words (big + sparse)
      • pointwise mutual info - instead of counts, consider whether 2 words co-occur more than we would have expected by chance
    • ex. word2vec - short, dense vectors
      • intuition: train classifier on binary prediction: is word $w$ likely to show up near this word? (algorithm also called skip-gram)
        • the weights are the embeddings
      • also GloVe, which is based on ratios of word co-occurrence probs
  • some tasks
    • tokenization
    • pos tagging
    • named entity recognition
      • nested entity recognition - not just names (but also Jacob’s brother type entity)
    • sentiment classification
    • language modeling (i.e. text generation)
    • machine translation
    • hardest: coreference resolution
    • question answering
    • natural language inference - does one sentence entail another?
  • most popular datasets
    • (by far) WSJ
    • then twitter
    • then Wikipedia
  • eli5 has nice text highlighting for interp

dl for nlp

  • some recent topics based on this blog
  • rnns
    • when training rnn, accumulate gradients over sequence and then update all at once
    • stacked rnns have outputs of rnns feed into another rnn
    • bidirectional rnn - one rnn left to right and another right to left (can concatenate, add, etc.)
  • standard seq2seq
    • encoder reads input and outputs context vector (the hidden state)
    • decoder (rnn) takes this context vector and generates a sequence
  • misc papers

attention / transformers

  • self-attention layer implementation and mathematics

  • **self-attention ** - layer that lets word learn its relation to other layers
    • for each word, want score telling how much importance to place on each other word (queries $\cdot$ keys)
    • we get an encoding for each word
      • the encoding of each word returns a weighted sum of the values of the words (the current word gets the highest weight)
      • softmax this and use it to do weighted sum of valuesScreen Shot 2019-08-17 at 2.51.53 PM
    • (optional) implementation details
      • multi-headed attention - just like having many filters, get many encodings for each word
        • each one can take input as the embedding from the previous attention layer
      • position vector - add this into the embedding of each word (so words know how far apart they are) - usually use sin/cos rather than actual position number
      • padding mask - add zeros to the end of the sequence
      • look-ahead mask - might want to mask to only use previous words (e.g. if our final task is decoding)
      • residual + normalize - after self-attention layer, often have residual connection to previous input, which gets added then normalized
    • decoder - each word only allowed to attend to previous positions
    • 3 components
      • queries
      • keys
      • values
  • attention
    • encoder reads input and ouputs context vector after each word
    • decoder at each step uses a different weighted combination of these context vectors
      • specifically, at each step, decoder concatenates its hidden state w/ the attention vector (the weighted combination of the context vectors)
      • this is fed to a feedforward net to output a word
      • Screen Shot 2019-04-11 at 7.57.14 PM
    • at a high level we have $Q, K, V$ and compute $softmax(QK^T)V$
      • instead could simplify it and do $softmax(XX^T)V$ - this would then be based on kernel
  • transformer
    • uses many self-attention layers
    • many stacked layers in encoder + decoder (not rnn: self-attention + feed forward)
    • details
      • initial encoding: each word -> vector
      • each layer takes a list of fixed size (hyperparameter e.g. length of longest sentence) and outputs a list of that same fixed size (so one output for each word)
        • can easily train with a masked word to predict the word at the predicted position in the encoding
    • multi-headed attention has several of each of these (then just concat them)
  • recent papers
  • these ideas are starting to be applied to vision cnns