Skip to content

Image, Video & Audio

Image, Video & Audio

Generate images, videos, and audio using dedicated endpoints.

Image Generation

Available via POST /api/llm/generate-image.

Request

json
{
  "provider": "openai",
  "prompt": "A futuristic city skyline at sunset, digital art",
  "model": "gpt-image-1",
  "n": 1,
  "size": "1024x1024",
  "quality": "hd",
  "tag": "marketing"
}

Ollama (Local) image request

json
{
  "provider": "ollama",
  "prompt": "A watercolor painting of a forest",
  "model": "flux2-klein",
  "n": 1,
  "size": "1024x1024",
  "tag": "local-gen"
}

Gemini / Vertex image request

json
{
  "provider": "gemini",
  "prompt": "A watercolor painting of a forest",
  "model": "imagen-4.0-generate-001",
  "n": 2,
  "aspectRatio": "16:9",
  "tag": "marketing"
}

Result (via /queue/result)

json
{
  "success": true,
  "data": {
    "queueId": "img_abc123",
    "status": "completed",
    "result": {
      "images": [
        { "data": "<base64_encoded_image>", "mimeType": "image/png" }
      ],
      "model": "gpt-image-1",
      "provider": "openai"
    }
  }
}

Parameters

ParameterTypeDescription
providerstringopenai, gemini, vertex, xai, ollama
promptstringImage description
modelstringModel ID (optional, uses default)
nnumberNumber of images (1-10)
sizestringe.g. "1024x1024" (OpenAI)
aspectRatiostringe.g. "16:9" (Gemini/Vertex)
qualitystring"standard", "hd", "ultra"
tagstringOptional label for usage tracking (max 100 chars)

Video Generation

Available via POST /api/llm/generate-video.

Request

json
{
  "provider": "gemini",
  "prompt": "A timelapse of clouds moving over a mountain range",
  "model": "veo-3.1-generate-preview",
  "aspectRatio": "16:9",
  "durationSeconds": 8,
  "numberOfVideos": 1,
  "tag": "content-creation"
}

xAI Image-to-Video

Pass an imageUrl to generate a video from a source image (xAI only).

json
{
  "provider": "xai",
  "prompt": "Slow camera zoom out with cinematic lighting",
  "model": "grok-imagine-video",
  "imageUrl": "https://example.com/photo.jpg"
}

Result (via /queue/result)

Video generation is asynchronous. Results are available via /queue/result.

json
{
  "success": true,
  "data": {
    "queueId": "vid_abc123",
    "status": "completed",
    "result": {
      "videos": [
        { "data": "<base64_encoded_video>", "mimeType": "video/mp4" }
      ],
      "model": "veo-3.1-generate-preview",
      "provider": "gemini"
    }
  }
}

Parameters

ParameterTypeDescription
providerstringgemini, vertex, xai
promptstringVideo description
modelstringModel ID (default: veo-3.1-generate-preview)
aspectRatiostringe.g. "16:9", "9:16", "1:1"
durationSecondsnumberDuration in seconds (1-60)
numberOfVideosnumberNumber of videos (1-4)
tagstringOptional label for usage tracking (max 100 chars)
resolutionstringe.g. "1080p" (xAI)
imageUrlstringSource image URL for image-to-video (xAI)

Text-to-Speech (TTS)

Available via POST /api/llm/generate-audio.

ElevenLabs request

json
{
  "provider": "elevenlabs",
  "text": "Hello, welcome to the interview.",
  "model": "eleven_flash_v2_5",
  "voiceId": "JBFqnCBsd6RMkjVDRZzb",
  "outputFormat": "mp3_44100_128",
  "tag": "interview"
}

MLX Audio (Local) request

json
{
  "provider": "mlxaudio",
  "text": "Hallo, welkom bij het interview.",
  "model": "kokoro",
  "voiceId": "af_heart",
  "tag": "interview"
}

Result (via /queue/result)

json
{
  "success": true,
  "data": {
    "queueId": "llm_abc123",
    "status": "completed",
    "result": {
      "audio": "<base64_encoded_audio>",
      "mimeType": "audio/mpeg",
      "model": "eleven_flash_v2_5",
      "characterCount": 32
    }
  }
}

Parameters

ParameterTypeDescription
providerstring"elevenlabs" or "mlxaudio"
textstringText to convert to speech
modelstringTTS model ID
voiceIdstringVoice ID (ElevenLabs ID or MLX Audio voice name)
outputFormatstringmp3_44100_128, pcm_16000, etc. (ElevenLabs only)
tagstringOptional label for usage tracking

Sound Effects

Available via POST /api/llm/generate-sound-effect. Powered by ElevenLabs text-to-sound-effects.

Request

json
{
  "text": "thunder rolling in the distance, rain on a tin roof",
  "model": "eleven_text_to_sound_v2",
  "durationSeconds": 10,
  "promptInfluence": 0.5,
  "tag": "ambient"
}

Result (via /queue/result)

json
{
  "success": true,
  "data": {
    "queueId": "llm_abc123",
    "status": "completed",
    "result": {
      "audio": "<base64_encoded_audio>",
      "mimeType": "audio/mpeg",
      "durationSeconds": 10,
      "model": "eleven_text_to_sound_v2"
    }
  }
}

Parameters

ParameterTypeDescription
textstringSound effect description (required)
modelstringModel ID (default: eleven_text_to_sound_v2)
durationSecondsnumberDuration 0.5-30 seconds
promptInfluencenumberHow closely to follow the prompt (0-1, default 0.3)
loopbooleanGenerate a loopable sound effect
tagstringOptional label for usage tracking

Music Generation

Available via POST /api/llm/generate-music. Powered by ElevenLabs text-to-music.

Request

json
{
  "prompt": "upbeat jazz jingle, 10 seconds, bright piano and saxophone",
  "model": "music_v1",
  "durationMs": 10000,
  "forceInstrumental": true,
  "tag": "jingle"
}

Result (via /queue/result)

json
{
  "success": true,
  "data": {
    "queueId": "llm_abc123",
    "status": "completed",
    "result": {
      "audio": "<base64_encoded_audio>",
      "mimeType": "audio/mpeg",
      "durationMs": 10000,
      "model": "music_v1"
    }
  }
}

Parameters

ParameterTypeDescription
promptstringMusic description (required)
modelstringModel ID (default: music_v1)
durationMsnumberDuration in milliseconds (3000-600000)
forceInstrumentalbooleanForce instrumental only (no vocals)
tagstringOptional label for usage tracking

Dialogue (Multi-Speaker)

Available via POST /api/llm/generate-dialogue. Generate multi-speaker dialogue audio where each turn uses a different voice. Only the eleven_v3 model is supported. Maximum 10 unique voice IDs per request.

Example Request

POST /api/llm/generate-dialogue
{
  "inputs": [
    { "text": "Hello, how are you?", "voiceId": "JBFqnCBsd6RMkjVDRZzb" },
    { "text": "I'm doing great, thanks!", "voiceId": "Aw4FAjKCGjjNkVhN1Xmq" }
  ],
  "outputFormat": "mp3_44100_128",
  "languageCode": "en"
}
ParameterTypeDescription
inputsarrayArray of {text, voiceId} objects (required, 1-100 items)
modelstringOnly eleven_v3 (default)
outputFormatstringAudio format (default: mp3_44100_128)
languageCodestringLanguage code (e.g. "en", "nl")
voiceSettingsobjectVoice settings (stability, similarityBoost, style, useSpeakerBoost)
seedintegerReproducibility seed (0-4294967295)
applyTextNormalizationstring"auto", "on", or "off"

Voice Previews

Browse and preview all available TTS voices — including ElevenLabs and MLX Audio — in the .

Speech-to-Text (STT)

Available via POST /api/llm/transcribe.

ElevenLabs request (with diarization)

json
{
  "provider": "elevenlabs",
  "audio": "<base64_encoded_audio>",
  "mimeType": "audio/mpeg",
  "model": "scribe_v2",
  "language": "nl",
  "tag": "interview",
  "diarize": true,
  "numSpeakers": 2,
  "timestampsGranularity": "word",
  "tagAudioEvents": true
}

MLX Audio (Local) request

json
{
  "provider": "mlxaudio",
  "audio": "<base64_encoded_audio>",
  "mimeType": "audio/wav",
  "model": "whisper-large-v3",
  "language": "nl",
  "tag": "transcription"
}

Result (via /queue/result)

json
{
  "success": true,
  "data": {
    "queueId": "llm_abc123",
    "status": "completed",
    "result": {
      "text": "Hello, welcome to the interview.",
      "language": "nl",
      "model": "scribe_v2",
      "words": [
        { "text": "Hello,", "start": 0.08, "end": 0.54, "type": "word", "speakerId": "speaker_0" },
        { "text": "welcome", "start": 0.56, "end": 0.92, "type": "word", "speakerId": "speaker_0" },
        { "text": "to", "start": 0.94, "end": 1.02, "type": "word", "speakerId": "speaker_0" }
      ]
    }
  }
}

Parameters

ParameterTypeDescription
providerstring"elevenlabs" or "mlxaudio"
audiostringBase64-encoded audio data
mimeTypestringaudio/mpeg, audio/wav, etc.
modelstringSTT model ID
languagestringISO language code (optional)
tagstringOptional label for usage tracking
diarizebooleanEnable speaker diarization (ElevenLabs only)
numSpeakersnumberExpected number of speakers, 1-32 (ElevenLabs only)
timestampsGranularitystring"word" or "character" (ElevenLabs, default: "word")
tagAudioEventsbooleanTag audio events like laughter, applause (ElevenLabs only)

Image Recognition (Vision)

Local image recognition. Endpoints under /api/vision/. The local detection endpoints (object, zero-shot, faces, training) require VISION_SERVICE_ENABLED=true. Web detection runs on Google Cloud Vision (the gcvision provider, billed per analysis) and does not use the local sidecar.

EndpointModelUse Case
POST /api/vision/detectyolo11n/s/mObject detection (COCO classes)
POST /api/vision/detect/zero-shotflorence-2-baseZero-shot detection (text prompt)
POST /api/vision/detect/faceshaarcascade, mediapipe, yolo-faceFace detection + optional blur
POST /api/vision/detect/webgcvision (default) · tineyeReverse image search — does the image appear elsewhere on the web. Pick the engine with engine
POST /api/vision/train-Start custom model training (smart defaults)
POST /api/vision/auto-labelflorence-2-baseAuto-label images using Florence-2
POST /api/vision/auto-trainflorence-2-base + YOLOAuto-label + train in one step
GET /api/vision/train/:jobId-Training job status + metrics
GET /api/vision/models-List available models
DELETE /api/vision/models/:modelId-Delete custom model

Object Detection (YOLO)

json
// POST /api/vision/detect
{
  "image": "<base64_encoded_image>",
  "model": "yolo11n",
  "confidence": 0.25
}

Zero-Shot Detection (Florence-2)

json
// POST /api/vision/detect/zero-shot
{
  "image": "<base64_encoded_image>",
  "prompt": "coca-cola bottle, pepsi can"
}

Face Detection + Blurring

json
// POST /api/vision/detect/faces
{
  "image": "<base64_encoded_image>",
  "model": "mediapipe",
  "confidence": 0.5,
  "blur": true,
  "blurStrength": 51
}

Web Detection (reverse image search)

Tells you if an image already exists elsewhere online (e.g. a product photo scraped from a stock site rather than shot on location). Two engines via the engine field:

  • gcvision (default) — Google Cloud Vision: semantic web entities, best-guess labels and visually-similar images.
  • tineye — TinEye: exact and edited-copy matches with per-page provenance (host domain + crawl date). Best for copyright / where-is-my-image tracking. Processed in Canada.
json
// POST /api/vision/detect/web
{
  "image": "<base64_encoded_image>",
  "maxResults": 10,
  "engine": "tineye"   // optional, defaults to "gcvision"
}
json
// gcvision result (via /queue/result)
{
  "likelyFromWeb": true,
  "fullMatchingImages": [{ "url": "https://stock.example/a.jpg" }],
  "partialMatchingImages": [{ "url": "https://blog.example/b.jpg", "score": 0.7 }],
  "pagesWithMatchingImages": [{ "url": "https://stock.example/page", "pageTitle": "Storefront" }],
  "visuallySimilarImages": [{ "url": "https://sim.example/c.jpg" }],
  "webEntities": [{ "description": "storefront", "score": 0.9 }],
  "bestGuessLabels": ["storefront"],
  "model": "web-detection",
  "provider": "gcvision",
  "inferenceTimeMs": 412
}
json
// tineye result (via /queue/result)
{
  "likelyFromWeb": true,
  "totalResults": 2,
  "totalBacklinks": 3,
  "fullMatchingImages": [{ "url": "https://wp.example/a.jpg", "score": 0.97 }],
  "pagesWithMatchingImages": [
    { "url": "https://wp.example/post", "domain": "wp.example", "crawlDate": "2024-09-11", "score": 0.97 }
  ],
  "partialMatchingImages": [],
  "visuallySimilarImages": [],
  "webEntities": [],
  "bestGuessLabels": [],
  "model": "tineye-search",
  "provider": "tineye",
  "inferenceTimeMs": 1840
}

Detection Result (via /queue/result)

json
{
  "detections": [
    {
      "label": "coca-cola-bottle",
      "confidence": 0.94,
      "bbox": { "x": 120, "y": 45, "width": 80, "height": 200 },
      "count": 3
    }
  ],
  "model": "yolo11n",
  "inferenceTimeMs": 32,
  "imageSize": { "width": 1920, "height": 1080 }
}

Custom Model Training

Training uses smart defaults based on your dataset size. Hyperparameters, augmentation, and epochs are automatically optimized. You can override epochs if needed.

json
// POST /api/vision/train
{
  "datasetB64": "<base64_zip_with_images_and_labels>",
  "baseModel": "yolo11n",
  "classes": ["coca-cola", "pepsi", "fanta"],
  "imageSize": 640
}
// epochs are auto-calculated based on dataset size

Training Response

json
{
  "jobId": "a1b2c3d4",
  "status": "pending",
  "datasetStats": { "train": 80, "val": 20, "total": 100 },
  "trainingProfile": "medium-dataset",
  "augmentation": "high",
  "effectiveEpochs": 300,
  "warnings": []
}

Auto-Label (Florence-2)

Send raw images + class names. Florence-2 generates bounding box labels automatically and packages them as a YOLO-format dataset zip.

json
// POST /api/vision/auto-label
{
  "images": ["<base64_image_1>", "<base64_image_2>", ...],
  "classes": ["coca-cola", "pepsi"],
  "confidence": 0.3
}

Auto-Label Response

json
{
  "datasetB64": "<base64_zip>",
  "format": "yolo-zip",
  "stats": {
    "totalImages": 50,
    "trainImages": 40,
    "valImages": 10,
    "imagesWithDetections": 45,
    "imagesWithoutDetections": 5,
    "detectionsPerClass": { "coca-cola": 42, "pepsi": 38 },
    "totalDetections": 80
  },
  "processingTimeMs": 12340
}

Auto-Train (Label + Train in one step)

Combines auto-labeling and training into a single request. Requires minimum 10 images.

json
// POST /api/vision/auto-train
{
  "images": ["<base64_image_1>", ...],
  "classes": ["coca-cola"],
  "baseModel": "yolo11n",
  "confidence": 0.3
}

Detection Parameters

ParameterTypeDescription
imagestringBase64-encoded image
modelstringObject detection: yolo11n/s/m or custom ID. Face detection: haarcascade, mediapipe, yolo-face (default: haarcascade)
confidencenumberMinimum confidence threshold (0-1, default 0.25 for objects, 0.5 for faces)
promptstringText description for zero-shot detection
blurbooleanEnable face blurring (face detection only)
blurStrengthnumberGaussian blur kernel size (odd number, default 51)

Choosing the Right Approach

ScenarioRecommendation
< 10 imagesUse zero-shot detection (Florence-2) - no training needed
10-50 imagesAuto-label + aggressive training (auto-applied)
50-200 imagesStandard training with high augmentation
200+ imagesOptimal training with standard augmentation

Training Best Practices

Dataset Requirements

SizeAuto EpochsAugmentationExpected Quality
10-50500AggressiveBasic - good for prototyping
50-200300HighGood - suitable for most use cases
200-1000150StandardProduction ready
1000+100StandardOptimal

Image Diversity Checklist

  • Different angles and perspectives
  • Varying lighting conditions (daylight, indoor, shadow)
  • Different backgrounds and contexts
  • Multiple distances (close-up, medium, far)
  • Partial occlusion (objects partially hidden)
  • Different orientations and rotations

Confidence Thresholds

Use CaseThreshold
Default / general use0.25
High recall (catch everything)0.15
High precision (minimize false positives)0.50
Production / critical0.40+

The training response includes a recommended_confidence based on the model's mAP metrics.

Quick Start: Train Your First Model

1

Collect 50+ images of your target object

More diverse images = better model. Mix angles, lighting, backgrounds.

2

Auto-label with Florence-2

Send images to POST /api/vision/auto-label or use POST /api/vision/auto-train for one-shot.

3

Train and detect

Smart defaults handle hyperparameters. Poll GET /api/vision/train/:jobId until complete, then use your custom model ID in detection requests.