What it does
The Scrape URL tool extracts text content from web pages and PDF documents. Perfect for gathering information from websites, analyzing web content, or processing documents that your agents need to work with.Key features
- Extract content from any web page or PDF URL
- Choose between full content or tables-only extraction
- Get either clean text or raw HTML
- Handles JavaScript-heavy pages with Playwright
- Automatic token limiting to prevent oversized responses
Parameters
Parameter | Type | Required | Description |
---|---|---|---|
url | string | Yes | The URL of the webpage or PDF to scrape |
tables_only | boolean | No | Extract only tables from the page (default: false) |
raw_html | boolean | No | Return raw HTML instead of parsed text (default: false) |
Common use cases
Extract article content
Get data from tables
Process PDF documents
Get raw HTML for parsing
Limitations
- Content is limited to 30,000 tokens by default
- PDF extraction doesn’t handle images or complex formatting
- Some dynamic content requiring user interaction may not be captured
- Large documents may be truncated
Troubleshooting
“Failed to load page”- Check that the URL is accessible and valid
- Verify the website doesn’t block automated access
- Try the URL in a browser to confirm it works
- The page content exceeded the token limit
- Consider using
tables_only: true
for data extraction - Break large documents into smaller sections
- Ensure the URL points to a valid PDF file
- Some password-protected PDFs cannot be processed
- Try downloading and hosting the PDF elsewhere