Mikhail Breslav

Positional Embeddings are Strange

Recently I’ve been reviewing the “basics” of large language models and decided to finally peek into the details of positional embeddings which I had ignored in the past. In this post I want to share what I’ve learned from reviewing this topic.

Positional Embedding Motivation

In the foundational Attention Is All You Need paper, positional embeddings are introduced as a way to add ordering information to token embeddings so that the transformer model has some way of understanding the order of the tokens. To state the somewhat obvious, we want language models to understand word order (and by extension token order) because word order impacts the semantics of what is being said.

A question arises:

Why does the transformer model needs some additional mechanism to understand the order of words, aren’t we already feeding the words in to the model in an ordered way?

The reason is that the transformer architecture is based on self-attention which produces the same set of output vectors regardless of input order. Conceptually, if you are producing the same set of vectors for different word orderings then how can you differentiate between the different meanings (or lack of meaning) represented by different permutations of a sentence. So in short the self-attention mechanism is why transformers need some way of encoding the order of input tokens.

Sinusoidal Positional Embeddings

In the foundational paper mentioned above, the authors encode the absolute position of a token by constructing a \(d\) dimensional vector composed of a series of sine and cosine waves of varying frequencies. The sine or cosine waves used for some dimension \(i\) will have an angular frequency value that also depends on \(i\). The absolute position of the token in the input sequence is then used to evaluate the sinusoid at a specific point in time leading to a concrete value for that dimension.

To lessen my confusion I took to Google and found several helpful blogs and videos which I will link below. There were a few concepts that I found interesting and helpful in gaining at least some intuition for these sinusoidal embeddings and I wanted to put them in my own words here.

Rotary Position Embedding (RoPE)

Since the Attention Is All You Need paper is 8 years old as of this writing, I also wanted to get a sense of what state of the art looks like for encoding position. This led me to a popular paper published in 2023 titled Rotary Position Embeddings (RoPE). RoPE has been used in models like LLama 3 from Meta.

RoPE’s approach to positional embeddings is derived from the objective of finding a function \(f\), that produces a dot product with a specific property. Specifically, if we have a word embedding vector \(x\) representing a token at position \(n\), and we have a word embedding vector \(y\) representing a token at position \(m\), then we would like their dot product to only depend on \(x\), \(y\) and their relative position \(m-n\).

The paper shows that when \(f\) is chosen to be a rotation matrix the dot product satisfies the desired objective. To apply \(f\) to a \(d\) dimensional word embedding you would in theory construct a \(d \times d\) block diagonal rotation matrix, where the amount of rotation changes every 2 dimensions. In practice applying \(f\) is efficient because the matrix is very sparse and thus a full matrix multiply is not needed.

As before, there were several references that helped me better understand what RoPE is doing and here are my main takeaways.

Conclusion

My main takeaway is that researchers have identified interesting mathematical tricks that fufill the goal of allowing LLMs to understand the position of tokens (with a particular emphasis on relative position). Like much of the field of ML, the success of an approach is primarily driven by how well it works in practice. Questions like “why does this work so well?” and “does this really make sense?” often require additional research. As an example there is this paper that re-examines RoPE.

This post serves as my non-comprehensive partial understanding of this space and there are still many aspects I don’t fully understand. In the interest of time, I’m moving on to reviewing the core attention mechanism of the transformer, but I think it’s fair to say that positional embeddings are kind of strange.

If you made it this far, thanks and check out the The Door’s song People Are Strange which partly inspired the title of this post.

Lingering Questions

Some questions that came to mind during my research:

References

These are the references I found to be helpful: