How to Visualize Embeddings from a Vector Database (Pinecone, Weaviate)

published February 24, 2023
all comments to $\text{}$

Turing Award winner Geoffrey Hinton famously said "To deal with a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly." While funny, this advice is unfortunately not practically useful. What is practical, however, is using an embedding space visualizer to orient yourself in vector space.

You've likely stumbled on this post because you have embeddings sitting in a Pinecone Vector Database Index. While useful for vector search, this index can often feel like a black box. What patterns and trends present across your embeddings are locked away from reach?

Why should I visualize my embeddings?

Embeddings encode semantic information about objects into Euclidean space. When you encode two documents of text, say document $d_1$ and document $d_2$, with an embedding service like Cohere or Open AI, you can use vector operations to answer questions about the documents.

While vector operations are fun, they can be hard to interpret. Embedding visualization lets you see the relationships between your vectors through a human interpretable lens. Since embeddings are vectors that represent your text, they also shine light onto the hidden relationships between documents in your text dataset.

When you put these embeddings in a vector database with no visualization layer, you only have the ability to observe local information about your data (e.g. what is similar to what). By visualizing your embeddings through a visualization tool like Atlas you can gain a global view on your data helping you make more informed decisions.

Visualizing Embeddings from a Pinecone Vector Database Index

First, find your Pinecone and Atlas API keys. You can find this in your Pinecone console and on your Atlas dashboard.

Below we will create an example Pinecone Vector Database Index and fill it with 1000 random embeddings. If you have an existing Pinecone Index, you can skip this step and just import the Index as usual.

import pinecone
import numpy as np
from nomic import atlas
import nomic
pinecone.init(api_key='YOUR PINECONE API KEY', environment='us-east1-gcp')
nomic.login('YOUR NOMIC API KEY')

#create and insert embeddings into your pinecone index
pinecone.create_index("quickstart", dimension=128, metric="euclidean", pod_type="p1")

index = pinecone.Index("quickstart")

num_embeddings = 1000
embeddings_for_pinecone = np.random.rand(num_embeddings, 128)
index.upsert([(str(i), embeddings_for_pinecone[i].tolist()) for i in range(num_embeddings)])

Next, you'll need to get the ID's of all of your embeddings to extract them from your Pinecone Index. In our previous example, we just used the integers 0-999 as our ID's. Then, extract the embeddings out into a numpy array. Once you have embeddings, send them over to Atlas. Atlas is a machine learning tool used to visualize large unstructured datasets of embeddings, text and images.

vectors = index.fetch(ids=[str(i) for i in range(num_embeddings)])

ids = []
embeddings = []
for id, vector in vectors['vectors'].items():

embeddings = np.array(embeddings)

atlas.map_embeddings(embeddings=embeddings, data=[{'id': id} for id in ids], id_field='id')

Atlas produces an interactive two-dimensional map of the embeddings. In the above, example we used random 256 dimensional embeddings sampled from a Gaussian distribution with mean zero.

Random Embeddings Map

If your Pinecone index contained embeddings of text generated through an API like Cohere or Open AI, your embedding visualization will look like this. Notice, it is made up of clusters. Each cluster represents a subset of embeddings that is grouped together for some semantic reason. You can manually hover over the points to see the text and try to figure out why the embedding model views the cluster of embeddings as similar. This may allow you to debug problems in your vector search results.

Text Embedding Map

Have fun building! You can find a full code example at this link.