ElevenLabs Speech to Text

What it does

Transcribes audio or video to text using ElevenLabs Scribe. Works on files already in the conversation (including output from elevenlabs_text_to_speech) or any HTTPS-accessible media URL — including cloud storage, YouTube, TikTok, and podcast hosts.

Key features

Single audio_source param accepts r2:// conversation attachments or HTTPS URLs — auto-detected by prefix
Scribe v2 (default) for best-in-class accuracy
Word-level timestamps returned by default
Speaker diarization (who spoke when) when diarize is on
Audio event tagging — surfaces (laughter), (footsteps), etc. inline in the transcript
Auto language detection, or pin a specific ISO-639 code

Parameters

Parameter	Type	Required	Description
`audio_source`	string	Yes	Either (a) an `r2://bucket/key` path of an audio file already attached to the thread, or (b) an HTTPS URL to an audio/video file. Supports cloud storage URLs (S3, R2, GCS), YouTube, TikTok, and other HTTPS sources up to 2GB.
`model_id`	enum	No	`scribe_v2` (default, latest) or `scribe_v1`
`language_code`	string	No	ISO-639-1 or ISO-639-3 code (e.g. `eng`, `spa`). If omitted, the language is auto-detected.
`diarize`	boolean	No	Annotate which speaker is talking (returns `speaker_id` per word). Default: `true`
`tag_audio_events`	boolean	No	Tag audio events like `(laughter)`, `(footsteps)` inline. Default: `true`
`num_speakers`	integer	No	Expected maximum number of speakers (1–32). Helps diarization when known.
`timestamps_granularity`	enum	No	`none`, `word` (default), or `character`

Common use cases

Transcribe a file already attached to the conversation

audio_source: "r2://aster-agents/org_xxx/threads/yyy/recording.mp3"

Use this when a previous tool call (TTS, a document extraction, or a user upload) produced an audio file — pass its r2_path straight through.

Transcribe a public podcast or recording URL

audio_source: "https://example.com/episode-42.mp3"
language_code: "eng"

Transcribe a meeting with multiple speakers

audio_source: "r2://aster-agents/org_xxx/threads/yyy/meeting.mp4"
diarize: true
num_speakers: 4

Response

Returns:

text — the full transcript
language_code / language_probability — detected language and confidence
speaker_count — number of distinct speakers identified (when diarize is on)
word_count — total words in the transcript
words — per-word objects with text, start/end timestamps, and speaker_id
source — a label describing which input path was used

Setup

No per-user setup. ElevenLabs is configured at the platform level — just enable the tool on your agent in Control Hub > Edit Agent under the Audio section.

​What it does

​Key features

​Parameters

​Common use cases

​Transcribe a file already attached to the conversation

​Transcribe a public podcast or recording URL

​Transcribe a meeting with multiple speakers

​Response

​Setup