What it does

The Scrape URL tool extracts text content from web pages and PDF documents. Perfect for gathering information from websites, analyzing web content, or processing documents that your agents need to work with.

Key features

  • Extract content from any web page or PDF URL
  • Choose between full content or tables-only extraction
  • Get either clean text or raw HTML
  • Handles JavaScript-heavy pages with Playwright
  • Automatic token limiting to prevent oversized responses

Parameters

ParameterTypeRequiredDescription
urlstringYesThe URL of the webpage or PDF to scrape
tables_onlybooleanNoExtract only tables from the page (default: false)
raw_htmlbooleanNoReturn raw HTML instead of parsed text (default: false)

Common use cases

Extract article content

url: "https://example.com/article"
tables_only: false
raw_html: false
Perfect for getting clean text from news articles, blog posts, or documentation.

Get data from tables

url: "https://example.com/data-page"
tables_only: true
raw_html: false
Extract structured data from HTML tables for analysis.

Process PDF documents

url: "https://example.com/document.pdf"
tables_only: false
raw_html: false
Extract text content from PDF files for document analysis.

Get raw HTML for parsing

url: "https://example.com/page"
tables_only: false
raw_html: true
Useful when you need the full HTML structure for custom processing.

Limitations

  • Content is limited to 30,000 tokens by default
  • PDF extraction doesn’t handle images or complex formatting
  • Some dynamic content requiring user interaction may not be captured
  • Large documents may be truncated

Troubleshooting

“Failed to load page”
  • Check that the URL is accessible and valid
  • Verify the website doesn’t block automated access
  • Try the URL in a browser to confirm it works
“Content truncated”
  • The page content exceeded the token limit
  • Consider using tables_only: true for data extraction
  • Break large documents into smaller sections
“PDF extraction failed”
  • Ensure the URL points to a valid PDF file
  • Some password-protected PDFs cannot be processed
  • Try downloading and hosting the PDF elsewhere
  • Ask Web - Ask questions about web content using an LLM
  • Call API - Make API calls to web services