Embeddings
Embeddings
Turn text into vectors with local embedding models (LM Studio). Built for customer-side Retrieval-Augmented Generation: you keep the documents and the vector store on your own infrastructure, CanaryLLM only does the transient embedding call. Nothing is stored on our side.
Endpoints
| Endpoint | Mode | Use Case |
|---|---|---|
POST /api/llm/embeddings | Queued | Batch ingestion. Returns a queue id; poll /api/llm/queue/result for vectors. |
POST /v1/embeddings | Synchronous | OpenAI-compatible. Drop-in for OpenAI SDKs, LangChain, LlamaIndex. Vectors returned directly. |
Native request (queued)
curl -X POST https://canaryllm.canarycoders.es/api/llm/embeddings \
-H "Authorization: Bearer $CANARY_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"provider": "lmstudio",
"model": "nomic-embed-text-v1.5",
"input": ["first chunk of text", "second chunk of text"],
"tag": "kb:contracts"
}'The response contains a queueId. Poll POST /api/llm/queue/result with it to get { embeddings: number[][], dimensions, usage }.
OpenAI-compatible request (synchronous)
Use the provider/model format. Point any OpenAI embeddings client at /v1.
from openai import OpenAI
client = OpenAI(
base_url="https://canaryllm.canarycoders.es/v1",
api_key="$CANARY_API_KEY",
)
resp = client.embeddings.create(
model="lmstudio/nomic-embed-text-v1.5",
input=["first chunk of text", "second chunk of text"],
)
vectors = [d.embedding for d in resp.data]TypeScript SDK
The official @canarycoders/canaryllm SDK submits the queued request and polls for you, returning the typed vectors in one await.
import { CanaryLLM } from "@canarycoders/canaryllm";
const client = new CanaryLLM({ apiKey: process.env.CANARYLLM_API_KEY });
const { embeddings, dimensions } = await client.embeddings.create({
provider: "lmstudio",
model: "nomic-embed-text-v1.5",
input: ["first chunk of text", "second chunk of text"],
});RAG toolkit
For full Retrieval-Augmented Generation over your own documents, the open-source @canarycoders/canaryllm-rag toolkit adds chunking, text extraction, and a pluggable vector store (pgvector adapter included) on top of the SDK. Your documents and vectors stay in your store; only chunk text is embedded transiently.
import { Pool } from "pg";
import { CanaryLLM } from "@canarycoders/canaryllm";
import { canaryEmbedder, ingestDocuments, retrieve, buildRagMessages } from "@canarycoders/canaryllm-rag";
import { PgVectorStore } from "@canarycoders/canaryllm-rag/store/pgvector";
const client = new CanaryLLM({ apiKey: process.env.CANARYLLM_API_KEY });
const embedder = canaryEmbedder(client, { model: "nomic-embed-text-v1.5" });
const store = new PgVectorStore(new Pool({ connectionString: process.env.DATABASE_URL }), { dimensions: 768 });
await store.migrate();
// ingest → embed → store (your data, your store)
await ingestDocuments([{ id: "handbook.md", text }], { embedder, store });
// retrieve → grounded answer via the gateway
const hits = await retrieve("How many vacation days do I get?", { embedder, store, topK: 5 });
const messages = buildRagMessages("How many vacation days do I get?", hits);
const answer = await client.chat.complete({ provider: "lmstudio", model: "qwen3-32b", messages });Parameters
| Field | Type | Description |
|---|---|---|
provider | string | Native endpoint only. lmstudio (default), openai, gemini, or vertex. (OpenAI-compat encodes this in model.) |
model | string | Embedding model id, e.g. nomic-embed-text-v1.5, bge-m3. OpenAI-compat: provider/model. |
input | string | string[] | One string or an array (up to 2048) of strings to embed. |
dimensions | integer | Optional. Output dimensionality for models that support truncation (Matryoshka). |
encodingFormat / encoding_format | string | float (default) or base64 (little-endian float32). |
Providers & residency
Local LM Studio is the default. It runs on EU premises and nothing leaves your infrastructure. External providers are opt-in when you want a specific model: pick one per request with provider and model.
| Provider | Where it runs | Residency |
|---|---|---|
lmstudio (default) | Local, your premises | EU-only. No transfer, no sub-processor. |
vertex | Google Vertex AI | EU-pinned (europe-west1). |
openai | OpenAI | US, under the Data Privacy Framework. |
gemini | Google Gemini API | Global, not EU-pinned. |
Embeddings are personal data, so sending a corpus to a US provider is a heavier transfer than a single chat prompt: you usually embed a whole document set, not one message. For residency-sensitive data, prefer local LM Studio or EU-pinned Vertex.
Privacy & data residency
The embeddings path runs on local inference (LM Studio) on EU premises — no third-country transfer, no sub-processor. Input text is processed in memory for the duration of the request and never written to disk or database. Embeddings are returned to you and not retained: your application is the sole store of record for the vectors and the documents they came from.
Embeddings derived from personal data are themselves personal data. Keeping them on your own infrastructure (e.g. pgvector, sqlite-vec) keeps you in control of access, retention, and erasure.