What it does
Transcribes audio or video to text using ElevenLabs Scribe. Works on files already in the conversation (including output from elevenlabs_text_to_speech) or any HTTPS-accessible media URL — including cloud storage, YouTube, TikTok, and podcast hosts.Key features
- Single
audio_sourceparam acceptsr2://conversation attachments or HTTPS URLs — auto-detected by prefix - Scribe v2 (default) for best-in-class accuracy
- Word-level timestamps returned by default
- Speaker diarization (who spoke when) when
diarizeis on - Audio event tagging — surfaces
(laughter),(footsteps), etc. inline in the transcript - Auto language detection, or pin a specific ISO-639 code
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
audio_source | string | Yes | Either (a) an r2://bucket/key path of an audio file already attached to the thread, or (b) an HTTPS URL to an audio/video file. Supports cloud storage URLs (S3, R2, GCS), YouTube, TikTok, and other HTTPS sources up to 2GB. |
model_id | enum | No | scribe_v2 (default, latest) or scribe_v1 |
language_code | string | No | ISO-639-1 or ISO-639-3 code (e.g. eng, spa). If omitted, the language is auto-detected. |
diarize | boolean | No | Annotate which speaker is talking (returns speaker_id per word). Default: true |
tag_audio_events | boolean | No | Tag audio events like (laughter), (footsteps) inline. Default: true |
num_speakers | integer | No | Expected maximum number of speakers (1–32). Helps diarization when known. |
timestamps_granularity | enum | No | none, word (default), or character |
Common use cases
Transcribe a file already attached to the conversation
r2_path straight through.
Transcribe a public podcast or recording URL
Transcribe a meeting with multiple speakers
Response
Returns:text— the full transcriptlanguage_code/language_probability— detected language and confidencespeaker_count— number of distinct speakers identified (whendiarizeis on)word_count— total words in the transcriptwords— per-word objects with text, start/end timestamps, andspeaker_idsource— a label describing which input path was used
