Skip to main content

What it does

Transcribes audio or video to text using ElevenLabs Scribe. Works on files already in the conversation (including output from elevenlabs_text_to_speech) or any HTTPS-accessible media URL — including cloud storage, YouTube, TikTok, and podcast hosts.

Key features

  • Single audio_source param accepts r2:// conversation attachments or HTTPS URLs — auto-detected by prefix
  • Scribe v2 (default) for best-in-class accuracy
  • Word-level timestamps returned by default
  • Speaker diarization (who spoke when) when diarize is on
  • Audio event tagging — surfaces (laughter), (footsteps), etc. inline in the transcript
  • Auto language detection, or pin a specific ISO-639 code

Parameters

ParameterTypeRequiredDescription
audio_sourcestringYesEither (a) an r2://bucket/key path of an audio file already attached to the thread, or (b) an HTTPS URL to an audio/video file. Supports cloud storage URLs (S3, R2, GCS), YouTube, TikTok, and other HTTPS sources up to 2GB.
model_idenumNoscribe_v2 (default, latest) or scribe_v1
language_codestringNoISO-639-1 or ISO-639-3 code (e.g. eng, spa). If omitted, the language is auto-detected.
diarizebooleanNoAnnotate which speaker is talking (returns speaker_id per word). Default: true
tag_audio_eventsbooleanNoTag audio events like (laughter), (footsteps) inline. Default: true
num_speakersintegerNoExpected maximum number of speakers (1–32). Helps diarization when known.
timestamps_granularityenumNonone, word (default), or character

Common use cases

Transcribe a file already attached to the conversation

audio_source: "r2://aster-agents/org_xxx/threads/yyy/recording.mp3"
Use this when a previous tool call (TTS, a document extraction, or a user upload) produced an audio file — pass its r2_path straight through.

Transcribe a public podcast or recording URL

audio_source: "https://example.com/episode-42.mp3"
language_code: "eng"

Transcribe a meeting with multiple speakers

audio_source: "r2://aster-agents/org_xxx/threads/yyy/meeting.mp4"
diarize: true
num_speakers: 4

Response

Returns:
  • text — the full transcript
  • language_code / language_probability — detected language and confidence
  • speaker_count — number of distinct speakers identified (when diarize is on)
  • word_count — total words in the transcript
  • words — per-word objects with text, start/end timestamps, and speaker_id
  • source — a label describing which input path was used

Setup

No per-user setup. ElevenLabs is configured at the platform level — just enable the tool on your agent in Control Hub > Edit Agent under the Audio section.