Image, Video & Audio
Image, Video & Audio
Generate images, videos, and audio using dedicated endpoints.
Image Generation
Available via POST /api/llm/generate-image.
Request
{
"provider": "openai",
"prompt": "A futuristic city skyline at sunset, digital art",
"model": "gpt-image-1",
"n": 1,
"size": "1024x1024",
"quality": "hd",
"tag": "marketing"
}Ollama (Local) image request
{
"provider": "ollama",
"prompt": "A watercolor painting of a forest",
"model": "flux2-klein",
"n": 1,
"size": "1024x1024",
"tag": "local-gen"
}Gemini / Vertex image request
{
"provider": "gemini",
"prompt": "A watercolor painting of a forest",
"model": "imagen-4.0-generate-001",
"n": 2,
"aspectRatio": "16:9",
"tag": "marketing"
}Result (via /queue/result)
{
"success": true,
"data": {
"queueId": "img_abc123",
"status": "completed",
"result": {
"images": [
{ "data": "<base64_encoded_image>", "mimeType": "image/png" }
],
"model": "gpt-image-1",
"provider": "openai"
}
}
}Parameters
| Parameter | Type | Description |
|---|---|---|
provider | string | openai, gemini, vertex, xai, ollama |
prompt | string | Image description |
model | string | Model ID (optional, uses default) |
n | number | Number of images (1-10) |
size | string | e.g. "1024x1024" (OpenAI) |
aspectRatio | string | e.g. "16:9" (Gemini/Vertex) |
quality | string | "standard", "hd", "ultra" |
tag | string | Optional label for usage tracking (max 100 chars) |
Video Generation
Available via POST /api/llm/generate-video.
Request
{
"provider": "gemini",
"prompt": "A timelapse of clouds moving over a mountain range",
"model": "veo-3.1-generate-preview",
"aspectRatio": "16:9",
"durationSeconds": 8,
"numberOfVideos": 1,
"tag": "content-creation"
}xAI Image-to-Video
Pass an imageUrl to generate a video from a source image (xAI only).
{
"provider": "xai",
"prompt": "Slow camera zoom out with cinematic lighting",
"model": "grok-imagine-video",
"imageUrl": "https://example.com/photo.jpg"
}Result (via /queue/result)
Video generation is asynchronous. Results are available via /queue/result.
{
"success": true,
"data": {
"queueId": "vid_abc123",
"status": "completed",
"result": {
"videos": [
{ "data": "<base64_encoded_video>", "mimeType": "video/mp4" }
],
"model": "veo-3.1-generate-preview",
"provider": "gemini"
}
}
}Parameters
| Parameter | Type | Description |
|---|---|---|
provider | string | gemini, vertex, xai |
prompt | string | Video description |
model | string | Model ID (default: veo-3.1-generate-preview) |
aspectRatio | string | e.g. "16:9", "9:16", "1:1" |
durationSeconds | number | Duration in seconds (1-60) |
numberOfVideos | number | Number of videos (1-4) |
tag | string | Optional label for usage tracking (max 100 chars) |
resolution | string | e.g. "1080p" (xAI) |
imageUrl | string | Source image URL for image-to-video (xAI) |
Text-to-Speech (TTS)
Available via POST /api/llm/generate-audio.
ElevenLabs request
{
"provider": "elevenlabs",
"text": "Hello, welcome to the interview.",
"model": "eleven_flash_v2_5",
"voiceId": "JBFqnCBsd6RMkjVDRZzb",
"outputFormat": "mp3_44100_128",
"tag": "interview"
}MLX Audio (Local) request
{
"provider": "mlxaudio",
"text": "Hallo, welkom bij het interview.",
"model": "kokoro",
"voiceId": "af_heart",
"tag": "interview"
}Result (via /queue/result)
{
"success": true,
"data": {
"queueId": "llm_abc123",
"status": "completed",
"result": {
"audio": "<base64_encoded_audio>",
"mimeType": "audio/mpeg",
"model": "eleven_flash_v2_5",
"characterCount": 32
}
}
}Parameters
| Parameter | Type | Description |
|---|---|---|
provider | string | "elevenlabs" or "mlxaudio" |
text | string | Text to convert to speech |
model | string | TTS model ID |
voiceId | string | Voice ID (ElevenLabs ID or MLX Audio voice name) |
outputFormat | string | mp3_44100_128, pcm_16000, etc. (ElevenLabs only) |
tag | string | Optional label for usage tracking |
Sound Effects
Available via POST /api/llm/generate-sound-effect. Powered by ElevenLabs text-to-sound-effects.
Request
{
"text": "thunder rolling in the distance, rain on a tin roof",
"model": "eleven_text_to_sound_v2",
"durationSeconds": 10,
"promptInfluence": 0.5,
"tag": "ambient"
}Result (via /queue/result)
{
"success": true,
"data": {
"queueId": "llm_abc123",
"status": "completed",
"result": {
"audio": "<base64_encoded_audio>",
"mimeType": "audio/mpeg",
"durationSeconds": 10,
"model": "eleven_text_to_sound_v2"
}
}
}Parameters
| Parameter | Type | Description |
|---|---|---|
text | string | Sound effect description (required) |
model | string | Model ID (default: eleven_text_to_sound_v2) |
durationSeconds | number | Duration 0.5-30 seconds |
promptInfluence | number | How closely to follow the prompt (0-1, default 0.3) |
loop | boolean | Generate a loopable sound effect |
tag | string | Optional label for usage tracking |
Music Generation
Available via POST /api/llm/generate-music. Powered by ElevenLabs text-to-music.
Request
{
"prompt": "upbeat jazz jingle, 10 seconds, bright piano and saxophone",
"model": "music_v1",
"durationMs": 10000,
"forceInstrumental": true,
"tag": "jingle"
}Result (via /queue/result)
{
"success": true,
"data": {
"queueId": "llm_abc123",
"status": "completed",
"result": {
"audio": "<base64_encoded_audio>",
"mimeType": "audio/mpeg",
"durationMs": 10000,
"model": "music_v1"
}
}
}Parameters
| Parameter | Type | Description |
|---|---|---|
prompt | string | Music description (required) |
model | string | Model ID (default: music_v1) |
durationMs | number | Duration in milliseconds (3000-600000) |
forceInstrumental | boolean | Force instrumental only (no vocals) |
tag | string | Optional label for usage tracking |
Dialogue (Multi-Speaker)
Available via POST /api/llm/generate-dialogue. Generate multi-speaker dialogue audio where each turn uses a different voice. Only the eleven_v3 model is supported. Maximum 10 unique voice IDs per request.
Example Request
POST /api/llm/generate-dialogue
{
"inputs": [
{ "text": "Hello, how are you?", "voiceId": "JBFqnCBsd6RMkjVDRZzb" },
{ "text": "I'm doing great, thanks!", "voiceId": "Aw4FAjKCGjjNkVhN1Xmq" }
],
"outputFormat": "mp3_44100_128",
"languageCode": "en"
}| Parameter | Type | Description |
|---|---|---|
inputs | array | Array of {text, voiceId} objects (required, 1-100 items) |
model | string | Only eleven_v3 (default) |
outputFormat | string | Audio format (default: mp3_44100_128) |
languageCode | string | Language code (e.g. "en", "nl") |
voiceSettings | object | Voice settings (stability, similarityBoost, style, useSpeakerBoost) |
seed | integer | Reproducibility seed (0-4294967295) |
applyTextNormalization | string | "auto", "on", or "off" |
Voice Previews
Browse and preview all available TTS voices — including ElevenLabs and MLX Audio — in the .
Speech-to-Text (STT)
Available via POST /api/llm/transcribe.
ElevenLabs request (with diarization)
{
"provider": "elevenlabs",
"audio": "<base64_encoded_audio>",
"mimeType": "audio/mpeg",
"model": "scribe_v2",
"language": "nl",
"tag": "interview",
"diarize": true,
"numSpeakers": 2,
"timestampsGranularity": "word",
"tagAudioEvents": true
}MLX Audio (Local) request
{
"provider": "mlxaudio",
"audio": "<base64_encoded_audio>",
"mimeType": "audio/wav",
"model": "whisper-large-v3",
"language": "nl",
"tag": "transcription"
}Result (via /queue/result)
{
"success": true,
"data": {
"queueId": "llm_abc123",
"status": "completed",
"result": {
"text": "Hello, welcome to the interview.",
"language": "nl",
"model": "scribe_v2",
"words": [
{ "text": "Hello,", "start": 0.08, "end": 0.54, "type": "word", "speakerId": "speaker_0" },
{ "text": "welcome", "start": 0.56, "end": 0.92, "type": "word", "speakerId": "speaker_0" },
{ "text": "to", "start": 0.94, "end": 1.02, "type": "word", "speakerId": "speaker_0" }
]
}
}
}Parameters
| Parameter | Type | Description |
|---|---|---|
provider | string | "elevenlabs" or "mlxaudio" |
audio | string | Base64-encoded audio data |
mimeType | string | audio/mpeg, audio/wav, etc. |
model | string | STT model ID |
language | string | ISO language code (optional) |
tag | string | Optional label for usage tracking |
diarize | boolean | Enable speaker diarization (ElevenLabs only) |
numSpeakers | number | Expected number of speakers, 1-32 (ElevenLabs only) |
timestampsGranularity | string | "word" or "character" (ElevenLabs, default: "word") |
tagAudioEvents | boolean | Tag audio events like laughter, applause (ElevenLabs only) |
Image Recognition (Vision)
Local image recognition. Endpoints under /api/vision/. The local detection endpoints (object, zero-shot, faces, training) require VISION_SERVICE_ENABLED=true. Web detection runs on Google Cloud Vision (the gcvision provider, billed per analysis) and does not use the local sidecar.
| Endpoint | Model | Use Case |
|---|---|---|
| POST /api/vision/detect | yolo11n/s/m | Object detection (COCO classes) |
| POST /api/vision/detect/zero-shot | florence-2-base | Zero-shot detection (text prompt) |
| POST /api/vision/detect/faces | haarcascade, mediapipe, yolo-face | Face detection + optional blur |
| POST /api/vision/detect/web | gcvision (default) · tineye | Reverse image search — does the image appear elsewhere on the web. Pick the engine with engine |
| POST /api/vision/train | - | Start custom model training (smart defaults) |
| POST /api/vision/auto-label | florence-2-base | Auto-label images using Florence-2 |
| POST /api/vision/auto-train | florence-2-base + YOLO | Auto-label + train in one step |
| GET /api/vision/train/:jobId | - | Training job status + metrics |
| GET /api/vision/models | - | List available models |
| DELETE /api/vision/models/:modelId | - | Delete custom model |
Object Detection (YOLO)
// POST /api/vision/detect
{
"image": "<base64_encoded_image>",
"model": "yolo11n",
"confidence": 0.25
}Zero-Shot Detection (Florence-2)
// POST /api/vision/detect/zero-shot
{
"image": "<base64_encoded_image>",
"prompt": "coca-cola bottle, pepsi can"
}Face Detection + Blurring
// POST /api/vision/detect/faces
{
"image": "<base64_encoded_image>",
"model": "mediapipe",
"confidence": 0.5,
"blur": true,
"blurStrength": 51
}Web Detection (reverse image search)
Tells you if an image already exists elsewhere online (e.g. a product photo scraped from a stock site rather than shot on location). Two engines via the engine field:
gcvision(default) — Google Cloud Vision: semantic web entities, best-guess labels and visually-similar images.tineye— TinEye: exact and edited-copy matches with per-page provenance (host domain + crawl date). Best for copyright / where-is-my-image tracking. Processed in Canada.
// POST /api/vision/detect/web
{
"image": "<base64_encoded_image>",
"maxResults": 10,
"engine": "tineye" // optional, defaults to "gcvision"
}// gcvision result (via /queue/result)
{
"likelyFromWeb": true,
"fullMatchingImages": [{ "url": "https://stock.example/a.jpg" }],
"partialMatchingImages": [{ "url": "https://blog.example/b.jpg", "score": 0.7 }],
"pagesWithMatchingImages": [{ "url": "https://stock.example/page", "pageTitle": "Storefront" }],
"visuallySimilarImages": [{ "url": "https://sim.example/c.jpg" }],
"webEntities": [{ "description": "storefront", "score": 0.9 }],
"bestGuessLabels": ["storefront"],
"model": "web-detection",
"provider": "gcvision",
"inferenceTimeMs": 412
}// tineye result (via /queue/result)
{
"likelyFromWeb": true,
"totalResults": 2,
"totalBacklinks": 3,
"fullMatchingImages": [{ "url": "https://wp.example/a.jpg", "score": 0.97 }],
"pagesWithMatchingImages": [
{ "url": "https://wp.example/post", "domain": "wp.example", "crawlDate": "2024-09-11", "score": 0.97 }
],
"partialMatchingImages": [],
"visuallySimilarImages": [],
"webEntities": [],
"bestGuessLabels": [],
"model": "tineye-search",
"provider": "tineye",
"inferenceTimeMs": 1840
}Detection Result (via /queue/result)
{
"detections": [
{
"label": "coca-cola-bottle",
"confidence": 0.94,
"bbox": { "x": 120, "y": 45, "width": 80, "height": 200 },
"count": 3
}
],
"model": "yolo11n",
"inferenceTimeMs": 32,
"imageSize": { "width": 1920, "height": 1080 }
}Custom Model Training
Training uses smart defaults based on your dataset size. Hyperparameters, augmentation, and epochs are automatically optimized. You can override epochs if needed.
// POST /api/vision/train
{
"datasetB64": "<base64_zip_with_images_and_labels>",
"baseModel": "yolo11n",
"classes": ["coca-cola", "pepsi", "fanta"],
"imageSize": 640
}
// epochs are auto-calculated based on dataset sizeTraining Response
{
"jobId": "a1b2c3d4",
"status": "pending",
"datasetStats": { "train": 80, "val": 20, "total": 100 },
"trainingProfile": "medium-dataset",
"augmentation": "high",
"effectiveEpochs": 300,
"warnings": []
}Auto-Label (Florence-2)
Send raw images + class names. Florence-2 generates bounding box labels automatically and packages them as a YOLO-format dataset zip.
// POST /api/vision/auto-label
{
"images": ["<base64_image_1>", "<base64_image_2>", ...],
"classes": ["coca-cola", "pepsi"],
"confidence": 0.3
}Auto-Label Response
{
"datasetB64": "<base64_zip>",
"format": "yolo-zip",
"stats": {
"totalImages": 50,
"trainImages": 40,
"valImages": 10,
"imagesWithDetections": 45,
"imagesWithoutDetections": 5,
"detectionsPerClass": { "coca-cola": 42, "pepsi": 38 },
"totalDetections": 80
},
"processingTimeMs": 12340
}Auto-Train (Label + Train in one step)
Combines auto-labeling and training into a single request. Requires minimum 10 images.
// POST /api/vision/auto-train
{
"images": ["<base64_image_1>", ...],
"classes": ["coca-cola"],
"baseModel": "yolo11n",
"confidence": 0.3
}Detection Parameters
| Parameter | Type | Description |
|---|---|---|
image | string | Base64-encoded image |
model | string | Object detection: yolo11n/s/m or custom ID. Face detection: haarcascade, mediapipe, yolo-face (default: haarcascade) |
confidence | number | Minimum confidence threshold (0-1, default 0.25 for objects, 0.5 for faces) |
prompt | string | Text description for zero-shot detection |
blur | boolean | Enable face blurring (face detection only) |
blurStrength | number | Gaussian blur kernel size (odd number, default 51) |
Choosing the Right Approach
| Scenario | Recommendation |
|---|---|
| < 10 images | Use zero-shot detection (Florence-2) - no training needed |
| 10-50 images | Auto-label + aggressive training (auto-applied) |
| 50-200 images | Standard training with high augmentation |
| 200+ images | Optimal training with standard augmentation |
Training Best Practices
Dataset Requirements
| Size | Auto Epochs | Augmentation | Expected Quality |
|---|---|---|---|
| 10-50 | 500 | Aggressive | Basic - good for prototyping |
| 50-200 | 300 | High | Good - suitable for most use cases |
| 200-1000 | 150 | Standard | Production ready |
| 1000+ | 100 | Standard | Optimal |
Image Diversity Checklist
- Different angles and perspectives
- Varying lighting conditions (daylight, indoor, shadow)
- Different backgrounds and contexts
- Multiple distances (close-up, medium, far)
- Partial occlusion (objects partially hidden)
- Different orientations and rotations
Confidence Thresholds
| Use Case | Threshold |
|---|---|
| Default / general use | 0.25 |
| High recall (catch everything) | 0.15 |
| High precision (minimize false positives) | 0.50 |
| Production / critical | 0.40+ |
The training response includes a recommended_confidence based on the model's mAP metrics.
Quick Start: Train Your First Model
Collect 50+ images of your target object
More diverse images = better model. Mix angles, lighting, backgrounds.
Auto-label with Florence-2
Send images to POST /api/vision/auto-label or use POST /api/vision/auto-train for one-shot.
Train and detect
Smart defaults handle hyperparameters. Poll GET /api/vision/train/:jobId until complete, then use your custom model ID in detection requests.