a course in LLM embeddings (1/17)

Is 1536 a large number?

When looking at a 1536-dimension vector, sometimes 1536 is a large number. Other times it is not a large number.

In particular: when it explodes into 2**1536 options, it is a very large number, in ways that most people don’t truly appreciate. When it is just a linear 1536, it is larger than numbers like 3, but in mundane ways that often don’t matter.


When looking at the 1536-dimension embeddings generated by OpenAI’s “text-embedding-ada-002”, one thing immediately jumps out: the dimension with the highest magnitude is dimension 194. For almost every term. The mean value over the (about) 7500 terms I queried was -0.67 . For a vector with 1536 dimensions and norm 1, this is very large.

First, we must look for similarities in the input dataset. Which, initially, was “the titles of the 5000 most-important Wikipedia articles”.

But a few wildly-different terms than “Cotton” or “Abraham Lincoln” give similar results. The first sentence from Harry Potter gives similar results. A code snippet gives similar results. A simple sentence in Chinese gives similar results. A string that consists of five cow emoji gives similar results.

I have not been able to find a term that generates a result with a value outside of (-1, -0.6) for dimension 194.