Turing Award winner Geoffrey Hinton famously said "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly." While funny, this advice is unfortunately not practically useful. What is practical, however, is using an embedding space visualizer to orient yourself in vector space.
Embeddings encode semantic information about objects into Euclidean space. When you encode two documents of text, say document $d_1$ and document $d_2$, with an embedding service like Open AI, you can use vector operations to answer questions about the documents.
While vector operations are fun, they can be hard to interpret. Embedding visualization lets you see the relationships between your vectors through a human interpretable lens. Since embeddings are vectors that represent your text, they also shine light onto the hidden relationships between documents in your text dataset.
First obtain embeddings from OpenAI. You can use the example in this notebook to get started.
import openai
embedding = openai.Embedding.create(
input="Your text goes here", model="text-embedding-ada-002"
)["data"][0]["embedding"]
In this blog post, we will demonstrate visualizing random embeddings allowing you to copy-paste code and get running quickly. Substituting embeddings obtained with your OpenAI API key will work as well. Once you have embeddings, send them over to Atlas. Atlas is a machine learning tool used to visualize large unstructured datasets of embeddings, text and images.
import nomic
from nomic import atlas
import numpy as np
num_embeddings = 1000
embeddings = np.random.rand(num_embeddings, 256)
project = atlas.map_embeddings(
embeddings=embeddings
)
Atlas produces a two-dimensional map of the embeddings. In the above, example we used random 256 dimensional embeddings sampled from a Gaussian distribution with mean zero.
If you used OpenAI embeddings, your embedding visualization will look like this. Notice, it is made up of clusters. Each cluster represents a subset of embeddings that is grouped together for some semantic reason. You can manually hover over the points to see the text and try to figure out why OpenAI's embedding model views the cluster of embeddings as similar. Atlas will also automatically find clusters of similar embeddings and attempt to summarize them for you.