Notes by Cole Gawin

v0

Back to LING 385

Lecture 19

neural networks are for learning better representations

Review: Transformer (LLM) General Structure

an LLM has many encoders
each encoder develops better representation thatn the one before
- does so by having ensemble of NNs evaluate information, then combining info from all of them

Languages have two important kinds of structure:

hierarchical/tree structure relating words
word, i.e., what word comes first.

transformers separate these two kinds of structure
GPT LLMs predict each word from all previous words

in order to do so, needs to know:

hierarchical relation between each word and each other word
position of each word

Sentence Input to the LLM

run one-hot encoding of each word through embedding NN to get embedding vectors
calculate positional encoding for each word
word vector = embedding + positional encoding

for LLM to "understand" the sentence, it will pass thru encoder sequences to that better representations are gradually found
- representations compute relation between each word, regardless of their positions
- this is called syntactic distance

Syntactic Distance

tree siblings are closer than tree cousins, which are closer than second cousings, etc

To find cognitive/syntactic distance:

take any two words
find the lowest node in the tree that dominates them both
measure the number of nodes in the route between the two objects (including lowest node dominating both of them)

What GPT accomplishes (approx)

now, imagine that each word has a vector associated with it
between any two vectors, we can measure the IP
- IP measures vector distance
if we can move vectors such that IP increasingly reflects hierarchical distances, vectors represent both words AND syntax trees through interdistances

Positional Encodings

each word is represented by a wave
number of maxima in the wave representing the word order
discovered by Vaswani et al. (2017), paper at the start of LLMs