/blue or, “LLM embeddings and outlier dimensions”
Is 1536 a large number?
/xantham it depends.
When looking at a 1536-dimension vector, sometimes 1536 is a large number. Other times it is not a large number.
In particular: when it explodes into 2**1536
options, it is a very large number, in ways that most people don’t truly appreciate. When it is just a linear 1536, it is larger than numbers like 3, but in mundane ways that often don’t matter.
When looking at the 1536-dimension embeddings generated by OpenAI’s “text-embedding-ada-002”, one thing immediately jumps out: the dimension with the highest magnitude is dimension 194. For almost every term. The mean value over the (about) 7500 terms I queried was -0.67 . For a vector with 1536 dimensions and norm 1, this is very large.
/yellow the second-highest mean value was dimension 954, averaging around +0.21. The third-highest was dimension 1120 at -0.16. Most dimensions have averages between -0.01 and +0.01, which is what one would expect.
/green everything is 0-indexed; the dimensions are numbered 0 to 1535
/red it is tempting to claim this is “an obvious bug” and “low-hanging fruit to fix”. There are two problems with this approach.
First, we must look for similarities in the input dataset. Which, initially, was “the titles of the 5000 most-important Wikipedia articles”.
But a few wildly-different terms than “Cotton” or “Abraham Lincoln” give similar results. The first sentence from Harry Potter gives similar results. A code snippet gives similar results. A simple sentence in Chinese gives similar results. A string that consists of five cow emoji gives similar results.
I have not been able to find a term that generates a result with a value outside of (-1, -0.6) for dimension 194.
/red the second reason not to be concerned is that this is self-correcting. If you were to remove dimension 194 and re-normalize, you would get half-a-bit of extra precision on the other dimensions. Which isn’t nothing. But it is one of those minor optimizations that can be delayed almost indefinitely. And it might not even be an optimization; having a few dimensions be a “slush factor” might be optimal.
/mogue all this means in practice is that, once I get a “sufficiently representative” dataset, I will have to mean+stddev normalize the values. The reasons why a norm-1 vector is necessary don’t apply to this analysis.