Embeddings

Turn text into vectors with local embedding models (LM Studio). Built for customer-side Retrieval-Augmented Generation: you keep the documents and the vector store on your own infrastructure, CanaryLLM only does the transient embedding call. Nothing is stored on our side.

Endpoints

Endpoint	Mode	Use Case
`POST /api/llm/embeddings`	Queued	Batch ingestion. Returns a queue id; poll `/api/llm/queue/result` for vectors.
`POST /v1/embeddings`	Synchronous	OpenAI-compatible. Drop-in for OpenAI SDKs, LangChain, LlamaIndex. Vectors returned directly.

Native request (queued)

bash

curl -X POST https://canaryllm.canarycoders.es/api/llm/embeddings \
  -H "Authorization: Bearer $CANARY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "provider": "lmstudio",
    "model": "nomic-embed-text-v1.5",
    "input": ["first chunk of text", "second chunk of text"],
    "tag": "kb:contracts"
  }'

The response contains a queueId. Poll POST /api/llm/queue/result with it to get { embeddings: number[][], dimensions, usage }.

OpenAI-compatible request (synchronous)

Use the provider/model format. Point any OpenAI embeddings client at /v1.

python

from openai import OpenAI

client = OpenAI(
    base_url="https://canaryllm.canarycoders.es/v1",
    api_key="$CANARY_API_KEY",
)

resp = client.embeddings.create(
    model="lmstudio/nomic-embed-text-v1.5",
    input=["first chunk of text", "second chunk of text"],
)
vectors = [d.embedding for d in resp.data]

TypeScript SDK

The official @canarycoders/canaryllm SDK submits the queued request and polls for you, returning the typed vectors in one await.

typescript

import { CanaryLLM } from "@canarycoders/canaryllm";

const client = new CanaryLLM({ apiKey: process.env.CANARYLLM_API_KEY });

const { embeddings, dimensions } = await client.embeddings.create({
  provider: "lmstudio",
  model: "nomic-embed-text-v1.5",
  input: ["first chunk of text", "second chunk of text"],
});

RAG toolkit

For full Retrieval-Augmented Generation over your own documents, the open-source @canarycoders/canaryllm-rag toolkit adds chunking, text extraction, and a pluggable vector store (pgvector adapter included) on top of the SDK. Your documents and vectors stay in your store; only chunk text is embedded transiently.

typescript

import { Pool } from "pg";
import { CanaryLLM } from "@canarycoders/canaryllm";
import { canaryEmbedder, ingestDocuments, retrieve, buildRagMessages } from "@canarycoders/canaryllm-rag";
import { PgVectorStore } from "@canarycoders/canaryllm-rag/store/pgvector";

const client = new CanaryLLM({ apiKey: process.env.CANARYLLM_API_KEY });
const embedder = canaryEmbedder(client, { model: "nomic-embed-text-v1.5" });

const store = new PgVectorStore(new Pool({ connectionString: process.env.DATABASE_URL }), { dimensions: 768 });
await store.migrate();

// ingest → embed → store (your data, your store)
await ingestDocuments([{ id: "handbook.md", text }], { embedder, store });

// retrieve → grounded answer via the gateway
const hits = await retrieve("How many vacation days do I get?", { embedder, store, topK: 5 });
const messages = buildRagMessages("How many vacation days do I get?", hits);
const answer = await client.chat.complete({ provider: "lmstudio", model: "qwen3-32b", messages });

Parameters

Field	Type	Description
`provider`	string	Native endpoint only. `lmstudio` (default), `openai`, `gemini`, or `vertex`. (OpenAI-compat encodes this in `model`.)
`model`	string	Embedding model id, e.g. `nomic-embed-text-v1.5`, `bge-m3`. OpenAI-compat: `provider/model`.
`input`	string \| string[]	One string or an array (up to 2048) of strings to embed.
`dimensions`	integer	Optional. Output dimensionality for models that support truncation (Matryoshka).
`encodingFormat` / `encoding_format`	string	`float` (default) or `base64` (little-endian float32).

Providers & residency

Local LM Studio is the default. It runs on EU premises and nothing leaves your infrastructure. External providers are opt-in when you want a specific model: pick one per request with provider and model.

Provider	Where it runs	Residency
`lmstudio` (default)	Local, your premises	EU-only. No transfer, no sub-processor.
`vertex`	Google Vertex AI	EU-pinned (`europe-west1`).
`openai`	OpenAI	US, under the Data Privacy Framework.
`gemini`	Google Gemini API	Global, not EU-pinned.

Embeddings are personal data, so sending a corpus to a US provider is a heavier transfer than a single chat prompt: you usually embed a whole document set, not one message. For residency-sensitive data, prefer local LM Studio or EU-pinned Vertex.

Privacy & data residency

The embeddings path runs on local inference (LM Studio) on EU premises — no third-country transfer, no sub-processor. Input text is processed in memory for the duration of the request and never written to disk or database. Embeddings are returned to you and not retained: your application is the sole store of record for the vectors and the documents they came from.

Embeddings derived from personal data are themselves personal data. Keeping them on your own infrastructure (e.g. pgvector, sqlite-vec) keeps you in control of access, retention, and erasure.