Embeddings
Our embeddings API lets you easily send arbitrary data to Pinecone to be vectorized, including:
- Text snippets
- Text files (.docx, .txt, .pdf, .csv)
- Multiple files via URL
- Crawl URLs and embed the text of a webpage (todo)
Config
Config for how text is embedded is handled by the configs/embeddings.yml
file where you can set:
- The embedding model you want to use (
text-embedding-3-large
ortext-embedding-3-small
). - The embedding dimension (optional)
Leave the embedding dimension blank and it will be added for you based on which embedding model you choose.
Which model should I use?
There is a trade-off between the large and small models. text-embedding-3-large
will likely provide more
accurate retrieval results, but it will fill up your Pinecone index twice as fast.
We suggest starting the text-embedding-3-small
model and changing to large later if you’re not getting
good results.
File uploads
When a file is uploaded via the embeddings API then it is also stored in your S3 storage in case it needs to be reindexed later.
URL Crawling
URLs are crawled in a very simple way, the content of the page is fetched with a basic GET request, and parsed into only the textual content of the page. The content is then embedded.
Our crawler doesn’t execute any JavaScript so won’t work for complex pages. If you want you could implement JSDOM or Puppeteer/PlayWright and use those as a full webpage crawler.
Context IDs
Every time you add something to the embeddings database we return a contextId
property. This is what you
need to use in order to use these embeddings in subsequent queries.
If you upload multiple documents at once with the /embeddings/urls
endpoint, then they will all be given
the same contextId
so that they can be searched together.
If you upload documents from the Example App Knowledge Base page you can also specify a contextId
with
each upload.
How do I fetch the embeddings?
Currently we haven’t implemented an endpoint for this specifically. If you use the Chat endpoint and provide
a contextIds
property then the embeddings that match will automatically be searched and added to the
context of the query.