Image, Video & Audio

Generate images, videos, and audio using dedicated endpoints.

Image Generation

Available via POST /api/llm/generate-image.

Request

json

{
  "provider": "openai",
  "prompt": "A futuristic city skyline at sunset, digital art",
  "model": "gpt-image-1",
  "n": 1,
  "size": "1024x1024",
  "quality": "hd",
  "tag": "marketing"
}

Ollama (Local) image request

json

{
  "provider": "ollama",
  "prompt": "A watercolor painting of a forest",
  "model": "flux2-klein",
  "n": 1,
  "size": "1024x1024",
  "tag": "local-gen"
}

Gemini / Vertex image request

json

{
  "provider": "gemini",
  "prompt": "A watercolor painting of a forest",
  "model": "imagen-4.0-generate-001",
  "n": 2,
  "aspectRatio": "16:9",
  "tag": "marketing"
}

Result (via /queue/result)

json

{
  "success": true,
  "data": {
    "queueId": "img_abc123",
    "status": "completed",
    "result": {
      "images": [
        { "data": "<base64_encoded_image>", "mimeType": "image/png" }
      ],
      "model": "gpt-image-1",
      "provider": "openai"
    }
  }
}

Parameters

Parameter	Type	Description
`provider`	string	openai, gemini, vertex, xai, ollama
`prompt`	string	Image description
`model`	string	Model ID (optional, uses default)
`n`	number	Number of images (1-10)
`size`	string	e.g. "1024x1024" (OpenAI)
`aspectRatio`	string	e.g. "16:9" (Gemini/Vertex)
`quality`	string	"standard", "hd", "ultra"
`tag`	string	Optional label for usage tracking (max 100 chars)

Video Generation

Available via POST /api/llm/generate-video.

Request

json

{
  "provider": "gemini",
  "prompt": "A timelapse of clouds moving over a mountain range",
  "model": "veo-3.1-generate-preview",
  "aspectRatio": "16:9",
  "durationSeconds": 8,
  "numberOfVideos": 1,
  "tag": "content-creation"
}

xAI Image-to-Video

Pass an imageUrl to generate a video from a source image (xAI only).

json

{
  "provider": "xai",
  "prompt": "Slow camera zoom out with cinematic lighting",
  "model": "grok-imagine-video",
  "imageUrl": "https://example.com/photo.jpg"
}

Result (via /queue/result)

Video generation is asynchronous. Results are available via /queue/result.

json

{
  "success": true,
  "data": {
    "queueId": "vid_abc123",
    "status": "completed",
    "result": {
      "videos": [
        { "data": "<base64_encoded_video>", "mimeType": "video/mp4" }
      ],
      "model": "veo-3.1-generate-preview",
      "provider": "gemini"
    }
  }
}

Parameters

Parameter	Type	Description
`provider`	string	gemini, vertex, xai
`prompt`	string	Video description
`model`	string	Model ID (default: veo-3.1-generate-preview)
`aspectRatio`	string	e.g. "16:9", "9:16", "1:1"
`durationSeconds`	number	Duration in seconds (1-60)
`numberOfVideos`	number	Number of videos (1-4)
`tag`	string	Optional label for usage tracking (max 100 chars)
`resolution`	string	e.g. "1080p" (xAI)
`imageUrl`	string	Source image URL for image-to-video (xAI)

Text-to-Speech (TTS)

Available via POST /api/llm/generate-audio.

ElevenLabs request

json

{
  "provider": "elevenlabs",
  "text": "Hello, welcome to the interview.",
  "model": "eleven_flash_v2_5",
  "voiceId": "JBFqnCBsd6RMkjVDRZzb",
  "outputFormat": "mp3_44100_128",
  "tag": "interview"
}

MLX Audio (Local) request

json

{
  "provider": "mlxaudio",
  "text": "Hallo, welkom bij het interview.",
  "model": "kokoro",
  "voiceId": "af_heart",
  "tag": "interview"
}

Result (via /queue/result)

json

{
  "success": true,
  "data": {
    "queueId": "llm_abc123",
    "status": "completed",
    "result": {
      "audio": "<base64_encoded_audio>",
      "mimeType": "audio/mpeg",
      "model": "eleven_flash_v2_5",
      "characterCount": 32
    }
  }
}

Parameters

Parameter	Type	Description
`provider`	string	"elevenlabs" or "mlxaudio"
`text`	string	Text to convert to speech
`model`	string	TTS model ID
`voiceId`	string	Voice ID (ElevenLabs ID or MLX Audio voice name)
`outputFormat`	string	mp3_44100_128, pcm_16000, etc. (ElevenLabs only)
`tag`	string	Optional label for usage tracking

Sound Effects

Available via POST /api/llm/generate-sound-effect. Powered by ElevenLabs text-to-sound-effects.

Request

json

{
  "text": "thunder rolling in the distance, rain on a tin roof",
  "model": "eleven_text_to_sound_v2",
  "durationSeconds": 10,
  "promptInfluence": 0.5,
  "tag": "ambient"
}

Result (via /queue/result)

json

{
  "success": true,
  "data": {
    "queueId": "llm_abc123",
    "status": "completed",
    "result": {
      "audio": "<base64_encoded_audio>",
      "mimeType": "audio/mpeg",
      "durationSeconds": 10,
      "model": "eleven_text_to_sound_v2"
    }
  }
}

Parameters

Parameter	Type	Description
`text`	string	Sound effect description (required)
`model`	string	Model ID (default: eleven_text_to_sound_v2)
`durationSeconds`	number	Duration 0.5-30 seconds
`promptInfluence`	number	How closely to follow the prompt (0-1, default 0.3)
`loop`	boolean	Generate a loopable sound effect
`tag`	string	Optional label for usage tracking

Music Generation

Available via POST /api/llm/generate-music. Powered by ElevenLabs text-to-music.

Request

json

{
  "prompt": "upbeat jazz jingle, 10 seconds, bright piano and saxophone",
  "model": "music_v1",
  "durationMs": 10000,
  "forceInstrumental": true,
  "tag": "jingle"
}

Result (via /queue/result)

json

{
  "success": true,
  "data": {
    "queueId": "llm_abc123",
    "status": "completed",
    "result": {
      "audio": "<base64_encoded_audio>",
      "mimeType": "audio/mpeg",
      "durationMs": 10000,
      "model": "music_v1"
    }
  }
}

Parameters

Parameter	Type	Description
`prompt`	string	Music description (required)
`model`	string	Model ID (default: music_v1)
`durationMs`	number	Duration in milliseconds (3000-600000)
`forceInstrumental`	boolean	Force instrumental only (no vocals)
`tag`	string	Optional label for usage tracking

Dialogue (Multi-Speaker)

Available via POST /api/llm/generate-dialogue. Generate multi-speaker dialogue audio where each turn uses a different voice. Only the eleven_v3 model is supported. Maximum 10 unique voice IDs per request.

Example Request

POST /api/llm/generate-dialogue
{
  "inputs": [
    { "text": "Hello, how are you?", "voiceId": "JBFqnCBsd6RMkjVDRZzb" },
    { "text": "I'm doing great, thanks!", "voiceId": "Aw4FAjKCGjjNkVhN1Xmq" }
  ],
  "outputFormat": "mp3_44100_128",
  "languageCode": "en"
}

Parameter	Type	Description
`inputs`	array	Array of {text, voiceId} objects (required, 1-100 items)
`model`	string	Only eleven_v3 (default)
`outputFormat`	string	Audio format (default: mp3_44100_128)
`languageCode`	string	Language code (e.g. "en", "nl")
`voiceSettings`	object	Voice settings (stability, similarityBoost, style, useSpeakerBoost)
`seed`	integer	Reproducibility seed (0-4294967295)
`applyTextNormalization`	string	"auto", "on", or "off"

Voice Previews

Browse and preview all available TTS voices — including ElevenLabs and MLX Audio — in the .

Speech-to-Text (STT)

Available via POST /api/llm/transcribe.

ElevenLabs request (with diarization)

json

{
  "provider": "elevenlabs",
  "audio": "<base64_encoded_audio>",
  "mimeType": "audio/mpeg",
  "model": "scribe_v2",
  "language": "nl",
  "tag": "interview",
  "diarize": true,
  "numSpeakers": 2,
  "timestampsGranularity": "word",
  "tagAudioEvents": true
}

MLX Audio (Local) request

json

{
  "provider": "mlxaudio",
  "audio": "<base64_encoded_audio>",
  "mimeType": "audio/wav",
  "model": "whisper-large-v3",
  "language": "nl",
  "tag": "transcription"
}

Result (via /queue/result)

json

{
  "success": true,
  "data": {
    "queueId": "llm_abc123",
    "status": "completed",
    "result": {
      "text": "Hello, welcome to the interview.",
      "language": "nl",
      "model": "scribe_v2",
      "words": [
        { "text": "Hello,", "start": 0.08, "end": 0.54, "type": "word", "speakerId": "speaker_0" },
        { "text": "welcome", "start": 0.56, "end": 0.92, "type": "word", "speakerId": "speaker_0" },
        { "text": "to", "start": 0.94, "end": 1.02, "type": "word", "speakerId": "speaker_0" }
      ]
    }
  }
}

Parameters

Parameter	Type	Description
`provider`	string	"elevenlabs" or "mlxaudio"
`audio`	string	Base64-encoded audio data
`mimeType`	string	audio/mpeg, audio/wav, etc.
`model`	string	STT model ID
`language`	string	ISO language code (optional)
`tag`	string	Optional label for usage tracking
`diarize`	boolean	Enable speaker diarization (ElevenLabs only)
`numSpeakers`	number	Expected number of speakers, 1-32 (ElevenLabs only)
`timestampsGranularity`	string	"word" or "character" (ElevenLabs, default: "word")
`tagAudioEvents`	boolean	Tag audio events like laughter, applause (ElevenLabs only)

Image Recognition (Vision)

Local image recognition. Endpoints under /api/vision/. The local detection endpoints (object, zero-shot, faces, training) require VISION_SERVICE_ENABLED=true. Web detection runs on Google Cloud Vision (the gcvision provider, billed per analysis) and does not use the local sidecar.

Endpoint	Model	Use Case
POST /api/vision/detect	yolo11n/s/m	Object detection (COCO classes)
POST /api/vision/detect/zero-shot	florence-2-base	Zero-shot detection (text prompt)
POST /api/vision/detect/faces	haarcascade, mediapipe, yolo-face	Face detection + optional blur
POST /api/vision/detect/web	gcvision (default) · tineye	Reverse image search — does the image appear elsewhere on the web. Pick the engine with `engine`
POST /api/vision/train	-	Start custom model training (smart defaults)
POST /api/vision/auto-label	florence-2-base	Auto-label images using Florence-2
POST /api/vision/auto-train	florence-2-base + YOLO	Auto-label + train in one step
GET /api/vision/train/:jobId	-	Training job status + metrics
GET /api/vision/models	-	List available models
DELETE /api/vision/models/:modelId	-	Delete custom model

Object Detection (YOLO)

json

// POST /api/vision/detect
{
  "image": "<base64_encoded_image>",
  "model": "yolo11n",
  "confidence": 0.25
}

Zero-Shot Detection (Florence-2)

json

// POST /api/vision/detect/zero-shot
{
  "image": "<base64_encoded_image>",
  "prompt": "coca-cola bottle, pepsi can"
}

Face Detection + Blurring

json

// POST /api/vision/detect/faces
{
  "image": "<base64_encoded_image>",
  "model": "mediapipe",
  "confidence": 0.5,
  "blur": true,
  "blurStrength": 51
}

Web Detection (reverse image search)

Tells you if an image already exists elsewhere online (e.g. a product photo scraped from a stock site rather than shot on location). Two engines via the engine field:

gcvision (default) — Google Cloud Vision: semantic web entities, best-guess labels and visually-similar images.
tineye — TinEye: exact and edited-copy matches with per-page provenance (host domain + crawl date). Best for copyright / where-is-my-image tracking. Processed in Canada.

json

// POST /api/vision/detect/web
{
  "image": "<base64_encoded_image>",
  "maxResults": 10,
  "engine": "tineye"   // optional, defaults to "gcvision"
}

json

// gcvision result (via /queue/result)
{
  "likelyFromWeb": true,
  "fullMatchingImages": [{ "url": "https://stock.example/a.jpg" }],
  "partialMatchingImages": [{ "url": "https://blog.example/b.jpg", "score": 0.7 }],
  "pagesWithMatchingImages": [{ "url": "https://stock.example/page", "pageTitle": "Storefront" }],
  "visuallySimilarImages": [{ "url": "https://sim.example/c.jpg" }],
  "webEntities": [{ "description": "storefront", "score": 0.9 }],
  "bestGuessLabels": ["storefront"],
  "model": "web-detection",
  "provider": "gcvision",
  "inferenceTimeMs": 412
}

json

// tineye result (via /queue/result)
{
  "likelyFromWeb": true,
  "totalResults": 2,
  "totalBacklinks": 3,
  "fullMatchingImages": [{ "url": "https://wp.example/a.jpg", "score": 0.97 }],
  "pagesWithMatchingImages": [
    { "url": "https://wp.example/post", "domain": "wp.example", "crawlDate": "2024-09-11", "score": 0.97 }
  ],
  "partialMatchingImages": [],
  "visuallySimilarImages": [],
  "webEntities": [],
  "bestGuessLabels": [],
  "model": "tineye-search",
  "provider": "tineye",
  "inferenceTimeMs": 1840
}

Detection Result (via /queue/result)

json

{
  "detections": [
    {
      "label": "coca-cola-bottle",
      "confidence": 0.94,
      "bbox": { "x": 120, "y": 45, "width": 80, "height": 200 },
      "count": 3
    }
  ],
  "model": "yolo11n",
  "inferenceTimeMs": 32,
  "imageSize": { "width": 1920, "height": 1080 }
}

Custom Model Training

Training uses smart defaults based on your dataset size. Hyperparameters, augmentation, and epochs are automatically optimized. You can override epochs if needed.

json

// POST /api/vision/train
{
  "datasetB64": "<base64_zip_with_images_and_labels>",
  "baseModel": "yolo11n",
  "classes": ["coca-cola", "pepsi", "fanta"],
  "imageSize": 640
}
// epochs are auto-calculated based on dataset size

Training Response

json

{
  "jobId": "a1b2c3d4",
  "status": "pending",
  "datasetStats": { "train": 80, "val": 20, "total": 100 },
  "trainingProfile": "medium-dataset",
  "augmentation": "high",
  "effectiveEpochs": 300,
  "warnings": []
}

Auto-Label (Florence-2)

Send raw images + class names. Florence-2 generates bounding box labels automatically and packages them as a YOLO-format dataset zip.

json

// POST /api/vision/auto-label
{
  "images": ["<base64_image_1>", "<base64_image_2>", ...],
  "classes": ["coca-cola", "pepsi"],
  "confidence": 0.3
}

Auto-Label Response

json

{
  "datasetB64": "<base64_zip>",
  "format": "yolo-zip",
  "stats": {
    "totalImages": 50,
    "trainImages": 40,
    "valImages": 10,
    "imagesWithDetections": 45,
    "imagesWithoutDetections": 5,
    "detectionsPerClass": { "coca-cola": 42, "pepsi": 38 },
    "totalDetections": 80
  },
  "processingTimeMs": 12340
}

Auto-Train (Label + Train in one step)

Combines auto-labeling and training into a single request. Requires minimum 10 images.

json

// POST /api/vision/auto-train
{
  "images": ["<base64_image_1>", ...],
  "classes": ["coca-cola"],
  "baseModel": "yolo11n",
  "confidence": 0.3
}

Detection Parameters

Parameter	Type	Description
`image`	string	Base64-encoded image
`model`	string	Object detection: yolo11n/s/m or custom ID. Face detection: haarcascade, mediapipe, yolo-face (default: haarcascade)
`confidence`	number	Minimum confidence threshold (0-1, default 0.25 for objects, 0.5 for faces)
`prompt`	string	Text description for zero-shot detection
`blur`	boolean	Enable face blurring (face detection only)
`blurStrength`	number	Gaussian blur kernel size (odd number, default 51)

Choosing the Right Approach

Scenario	Recommendation
< 10 images	Use zero-shot detection (Florence-2) - no training needed
10-50 images	Auto-label + aggressive training (auto-applied)
50-200 images	Standard training with high augmentation
200+ images	Optimal training with standard augmentation

Training Best Practices

Dataset Requirements

Size	Auto Epochs	Augmentation	Expected Quality
10-50	500	Aggressive	Basic - good for prototyping
50-200	300	High	Good - suitable for most use cases
200-1000	150	Standard	Production ready
1000+	100	Standard	Optimal

Image Diversity Checklist

Different angles and perspectives
Varying lighting conditions (daylight, indoor, shadow)
Different backgrounds and contexts
Multiple distances (close-up, medium, far)
Partial occlusion (objects partially hidden)
Different orientations and rotations

Confidence Thresholds

Use Case	Threshold
Default / general use	0.25
High recall (catch everything)	0.15
High precision (minimize false positives)	0.50
Production / critical	0.40+

The training response includes a recommended_confidence based on the model's mAP metrics.

Quick Start: Train Your First Model

Collect 50+ images of your target object

More diverse images = better model. Mix angles, lighting, backgrounds.

Auto-label with Florence-2

Send images to POST /api/vision/auto-label or use POST /api/vision/auto-train for one-shot.

Train and detect

Smart defaults handle hyperparameters. Poll GET /api/vision/train/:jobId until complete, then use your custom model ID in detection requests.