Skip to content

Embeddings

Embeddings

Turn text into vectors with local embedding models (LM Studio). Built for customer-side Retrieval-Augmented Generation: you keep the documents and the vector store on your own infrastructure, CanaryLLM only does the transient embedding call. Nothing is stored on our side.

Endpoints

EndpointModeUse Case
POST /api/llm/embeddingsQueuedBatch ingestion. Returns a queue id; poll /api/llm/queue/result for vectors.
POST /v1/embeddingsSynchronousOpenAI-compatible. Drop-in for OpenAI SDKs, LangChain, LlamaIndex. Vectors returned directly.

Native request (queued)

bash
curl -X POST https://canaryllm.canarycoders.es/api/llm/embeddings \
  -H "Authorization: Bearer $CANARY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "lmstudio",
    "model": "nomic-embed-text-v1.5",
    "input": ["first chunk of text", "second chunk of text"],
    "tag": "kb:contracts"
  }'

The response contains a queueId. Poll POST /api/llm/queue/result with it to get { embeddings: number[][], dimensions, usage }.

OpenAI-compatible request (synchronous)

Use the provider/model format. Point any OpenAI embeddings client at /v1.

python
from openai import OpenAI

client = OpenAI(
    base_url="https://canaryllm.canarycoders.es/v1",
    api_key="$CANARY_API_KEY",
)

resp = client.embeddings.create(
    model="lmstudio/nomic-embed-text-v1.5",
    input=["first chunk of text", "second chunk of text"],
)
vectors = [d.embedding for d in resp.data]

TypeScript SDK

The official @canarycoders/canaryllm SDK submits the queued request and polls for you, returning the typed vectors in one await.

typescript
import { CanaryLLM } from "@canarycoders/canaryllm";

const client = new CanaryLLM({ apiKey: process.env.CANARYLLM_API_KEY });

const { embeddings, dimensions } = await client.embeddings.create({
  provider: "lmstudio",
  model: "nomic-embed-text-v1.5",
  input: ["first chunk of text", "second chunk of text"],
});

RAG toolkit

For full Retrieval-Augmented Generation over your own documents, the open-source @canarycoders/canaryllm-rag toolkit adds chunking, text extraction, and a pluggable vector store (pgvector adapter included) on top of the SDK. Your documents and vectors stay in your store; only chunk text is embedded transiently.

typescript
import { Pool } from "pg";
import { CanaryLLM } from "@canarycoders/canaryllm";
import { canaryEmbedder, ingestDocuments, retrieve, buildRagMessages } from "@canarycoders/canaryllm-rag";
import { PgVectorStore } from "@canarycoders/canaryllm-rag/store/pgvector";

const client = new CanaryLLM({ apiKey: process.env.CANARYLLM_API_KEY });
const embedder = canaryEmbedder(client, { model: "nomic-embed-text-v1.5" });

const store = new PgVectorStore(new Pool({ connectionString: process.env.DATABASE_URL }), { dimensions: 768 });
await store.migrate();

// ingest → embed → store (your data, your store)
await ingestDocuments([{ id: "handbook.md", text }], { embedder, store });

// retrieve → grounded answer via the gateway
const hits = await retrieve("How many vacation days do I get?", { embedder, store, topK: 5 });
const messages = buildRagMessages("How many vacation days do I get?", hits);
const answer = await client.chat.complete({ provider: "lmstudio", model: "qwen3-32b", messages });

Parameters

FieldTypeDescription
providerstringNative endpoint only. lmstudio (default), openai, gemini, or vertex. (OpenAI-compat encodes this in model.)
modelstringEmbedding model id, e.g. nomic-embed-text-v1.5, bge-m3. OpenAI-compat: provider/model.
inputstring | string[]One string or an array (up to 2048) of strings to embed.
dimensionsintegerOptional. Output dimensionality for models that support truncation (Matryoshka).
encodingFormat / encoding_formatstringfloat (default) or base64 (little-endian float32).

Providers & residency

Local LM Studio is the default. It runs on EU premises and nothing leaves your infrastructure. External providers are opt-in when you want a specific model: pick one per request with provider and model.

ProviderWhere it runsResidency
lmstudio (default)Local, your premisesEU-only. No transfer, no sub-processor.
vertexGoogle Vertex AIEU-pinned (europe-west1).
openaiOpenAIUS, under the Data Privacy Framework.
geminiGoogle Gemini APIGlobal, not EU-pinned.

Embeddings are personal data, so sending a corpus to a US provider is a heavier transfer than a single chat prompt: you usually embed a whole document set, not one message. For residency-sensitive data, prefer local LM Studio or EU-pinned Vertex.

Privacy & data residency

The embeddings path runs on local inference (LM Studio) on EU premises — no third-country transfer, no sub-processor. Input text is processed in memory for the duration of the request and never written to disk or database. Embeddings are returned to you and not retained: your application is the sole store of record for the vectors and the documents they came from.

Embeddings derived from personal data are themselves personal data. Keeping them on your own infrastructure (e.g. pgvector, sqlite-vec) keeps you in control of access, retention, and erasure.