Physics of Transformers

July 27, 2024 (7mo ago)

Table of Contents

Objective

We want to design numerous, controlled experiments AND probe into the network, to statistically uncover the mechanisms behind how diverse Al tasks are being accomplished via pretraining / finetuning. This is more like experimental physics than neuroscience.

What We Previously Knew about Transformers

Induction Heads

  • Seeing ...[A][B]..., some transformer heads are in charge of making the induction: [A] -> [B]
  • "In-context Learning and Induction Heads" from Anthropic in 2022
  • Can fulfull tasks such as sequence copying, translation, and pattern matching
  • More complex induction heads may come from deeper layers of the transformer

Inhibition Heads & Name Mover Heads

  • "Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small"
  • When Mary and John went to the store, John gave a drink to ___. The inhibition head is in charge of not using John again if John already appeared in the sentence.
  • Name Mover Head: moving a name to later of a sentence
  • Backup Name Mover Head: if the Name Mover Head fails, the Backup Name Mover Head will move the name to the end of the sentence
  • The author mentioned: "to the best of our knowledge, ...the most detailed attempt at reverse-engineering a natural end-to-end behavior in a transformer-based language model."

Conclusion

It's very hard to reverse-engineer the behavior of a transformer-based language model.

Context Free Grammar

Demo

Non-Terminal symbols -> Terminal Symbols
root -> a b c
root -> a d
a -> b E
a -> F c
b -> d d
b -> d G
C -> G G
d -> G F
d -> F G

Example:
root -> a b c
-> b E d d G G
-> d G E G F F G G G
-> G F G E G F F G G G

Property #1: Long-range Planning

  • Example: "...Y(...)[[...]]X".
  • X depends on Y that could be hundreds of tokens before
  • Note: this is not captured by "induction head"

Property #2: Locally Ambiguous

  • GG can come from many different NTs
  • Example: "c -> GG", "bd -> dGGF", "bcd -> dGGGGF"
  • Bracket matching is not so locally ambiguous: e.g. ([])
  • English grammar is not so locally ambiguous: e.g. ADJP <=> RB JJ | JJ PP

Transfomers Learning CFGs

Synthetic CFGs

CFG
  • The global parsing tree is still unique - you can use dynamic programming to parse the globally unique solution
  • We ask the transformer to learn the CFGs without the CFGs being given, which is harder than the most naive version of dynamic programming
  • About 10^80 strings satisfy cfg3f among 10^140 possible strings of length 300 or more

Can We Use GPT2-Small to Learn CFGs?

  • 12 layer, 12 head, 768=12*64 dimensions
  • GPT2 pretrained on cfg3f data (infinite amount, online data)
  • Generation accruacy: let the model generate from "start" token (using temperature 1)
  • Completion accuracy: generate a competion from x:50 (using temperature 1)

Results 1.0

  • GPT with relative and rotary embedding works much better than (99 vs 64) GPT with absolute embedding
  • First takeaway: we should abandon absolute embedding
  • GPT with position-based attention (Att_i->j only depends on i,j) and GPT with uniform attention (hth head looks back at 2^h-1 tokens with uniform weights) did not perform as well as the state-of-the-art GPT (99 vs 94).

Results 1.2 - Generation Diversity

  • GPT might memorize the same sentence and repeat it
  • Entropy estimation: sum of the log of the conditional probabilities of the sentences. All models are on the same level.
  • Output generated by GPTs is as diverse as the ground truth

Results 1.3 - Generation Distribution

  • Are we generating it using the correct probability comparing to the ground truth?
  • Vanilla GPT is bad. GPT with relative attention and GPT with rotary embedding performed better.

How Transformers Learn CFGs

Evidence that Transformer Learns to do Dynamic Programming

  • Result 2: hidden states encode NT ancestor/boundary info.
  • Result 3: attention favors "adjacent" NT-end pairs i, j
  • DP = storage + recurrent formula
  • DP (i, j, a) = whether xi, xi+1,...x; can be generated from symbol a following the CFG rules.
  • Thus, learning NT-ancestors/boundaries -> learning a "backbone" of necessary DP(i, j, a) on the correct CFG parsing
  • ... we (intentionally) did not prob those DP(i, j, a) not in the "backbone"
  • ... even when human writes this DP, depending on the pruning, the order of the loops...

Robustly Pretrained GPT

  • Pretraining
    • Vanilla pretrained GPT trained on the correct CFG data
    • Robustly pretrained GPT trained on the perturbed CFG data (with some fraction of data perturbed)
  • Results
    • Vanilla pretrained GPT is not very robust against correpted prefixes while robustly pretrained GPT is robust against corrupted prefix
    • Robustly pretrained GPT learns to mode-switch between the correct and incorrect data at temperature 1.0

Conclusion

🌟 Transformer learns a "backbone" of necessary DP(i, j, a) shared by all DP implementations

🌟 We should abandon aboslute positonal embedding

🌟 Masked-Language Modeling (BERT) is incapable of learning something deep in the language while Language Modeling (GPT) is capable of doing that

🌟 It's important to also include corrupt data (e.g. sentences with grammar mistakes) in the pretaining. During test, a low temperature needs to be chosen. Otherwise, the model will switch between correct data and incorrect data


References:

  1. https://arxiv.org/abs/2305.13673
  2. https://www.youtube.com/watch?v=kf_eGgVtOcs