CS224N-Lec15

Natural Language Generation

NLG

  • subcomponent of
    • Machine Translation
    • summarization
    • dialogue
    • Freeform question answering(not only from the context)
    • Image Captioning

Recap

  • Language modeling? the task of predicting the next word: \[P(y_t|y_1, ..., y_(t-1))\]

  • Language model

  • RNN-LM

  • Conditional Language Modeling \[P(y_t|y_1, ..., y_(t-1), x)\]

    • what is x? condition.
    • Examples:
      • Machine Translation (x = source sentence, y = target sentence)
      • Summarization (context and summarized)
      • Dialogue (dialogue history and next utterance)
  • training a RNN-LM? \[J = \dfrac{1}{T}\sum\limits_{t=1}^T J_t\]

    • "Teacher Forcing": always use the gold to feed into the decoder
  • decoding algorithms

    • Greedy decoding: argmax each step
    • Beam search: aims to find a high prob seq.
      • k most probable partial seqs (hypotheses)
      • k is the beam size (e.g. 2)
      • when reaching some stopping criterion, output
      • what's the effect of changing k?
        • k=1: greedy decoding
        • larger k: more hypotheses, computationaly expensive
          • for NMT, increasing k too much decreases BLEU, reason: producing shorter translations
          • for chit-chat dialogue, producing too generic responses
    • sampling-based decoding
      • pure sampling: randomly sample, instead of argmax in greedy
      • top-n sampling, randomly sample from top-n. truncate. n is another hyperparameter
        • increasing n, diverse and risky
        • decreasing n, generic and safe
    • Softmax teperature -- not actually a decoding algorithm, but a technique applied at test time in conjunction with decoding algorithm
      • temperature hypparam \(\tau\) to the softmax:
      • larger \(\tau\): \(P_t\) becomes more uniform, more diverse output(probability is spread around vocab)
  • Decoding algorithms: summary

    • Greedy
    • Beam search
    • Sampling methods
    • Softmax temperature

Section 2: NLG tasks and neural approaches

Summarization

  • definition: x -> y, y is shorter and contains main info of x
  • examples:
    • Gigaword: headline -> headline. sentence compression
    • LCSTS (Chinese microblogging), paragraph -> sentence summary
    • ...
  • Sentence simplification:
    • different but related
    • rewrite, simpler & shorter
    • examples:
      • simple wikipedia
      • Newsela: news rewriting for children
  • summarization: 2 mains strategies
    • extractive: highlighter
    • abastractive: writing
  • summarization evaluation: ROUGE
    • like BLEU, based on n-gram overlap

    • but no brevity penalty

    • ROUGE based on recall while BLEU based on precision

    • BLEU is a single number combining the precisions for n=1,2,3,4 n-grams

    • ROUGE: ROUGE-1/ROUGE-2/ROUGE-L(Largest common subseq overlap)

  • Neural summarization:
    • seq2seq + attention NMT
    • Reinforcement learning
  • neural: copy mechanism
    • probability of generation and probability of copying
    • Pgen: hard(0/1) or soft?
    • Problem:
      • copy too much: extractive to abstractive
      • bad at overall content selection, if input is long
      • no overall strategy for selecting content
  • better content selection
    • 2 stages: content selection & surface realization
    • seq2seq+att, mixed, word-level content selection(attention)
    • but no global content selection strategy
    • One solution: bottom-up summarization
  • Bottom-up summarization
    • content selection stage: neural sequence tagging
    • masked, attention
  • Neural summarization via RL
    • main idea: directly optimize Rouge-L
    • Better practice(both ROUGE & human judgement): ML&RL

Dialogue