Overview
Your AI agents can already search and read the full contents of every file in a knowledge base. Extraction schemas add a structured index on top of that—a consistent set of fields extracted from each document that enables quick filtering and cross-document analysis. Think of it this way: without an extraction schema, answering “which contracts are worth over $1M?” requires the AI to open and read every contract. With an extraction schema that includestotal_value as a number field, that question becomes a simple filter.
Extraction schemas are most valuable for:
- Numeric fields you want to compare, aggregate, or filter (amounts, counts, percentages, dates)
- Categorical fields you want to group or filter by (document type, status, region, sector)
- Key identifiers that help locate the right documents quickly (deal names, parties, project codes)
Choosing What to Extract
Before writing your schema, ask yourself: “What questions do I want to answer across all my documents?”| Question Type | Extract These Fields |
|---|---|
| ”Which deals are over $10M?” | deal_value (number) |
| “Show me all contracts expiring this quarter” | expiration_date (string in YYYY-MM-DD) |
| “Break down documents by type” | document_type (enum) |
| “Find all projects in the healthcare sector” | sector (enum) |
| “What’s our average contract value by region?” | contract_value (number), region (enum) |
| “List all documents involving Acme Corp” | parties (array of strings) |
Prioritize fields that are
- Frequently used for filtering or sorting
- Needed for aggregations (sums, averages, counts)
- Used to categorize or group documents
- Key identifiers for finding specific documents
Skip fields that are
- Only useful when reading the full document (the AI can already do that)
- Highly variable or unstructured across documents
- Rarely needed for cross-document questions
Schema Basics
Extraction schemas use JSON Schema format. At minimum, your schema needs three things:Field Types & Structures
The reference below covers every supported field type and the structures you can compose from them. Expand each section as you need it.Supported Types
Use these types for your schema fields:| Type | Description | Example Value |
|---|---|---|
string | Text values | "Quarterly Report" |
number | Decimal numbers | 123.45 |
integer | Whole numbers | 42 |
boolean | True/false values | true |
array | Lists of items | ["item1", "item2"] |
object | Nested structures | {"name": "John"} |
Making fields optional
Making fields optional
Use a type array with When a field uses
null to make fields optional:["type", "null"], the AI will return null if the information isn’t found in the document.Using enums
Using enums
Enums constrain a field to specific allowed values. Use enums whenever you have a known set of categories:Enums work with strings, numbers, booleans, or null values. Include
null in the enum array if the field should be optional.Arrays
Arrays
Use arrays to extract lists of items:Arrays of objects — extract structured lists with typed objects:
Nested objects
Nested objects
Group related fields using nested objects:
Best Practices
1. Require fields whenever possible
1. Require fields whenever possible
Mark fields as required unless they genuinely may not exist in documents. This helps the AI understand which information is essential:
2. Use enums for categories
2. Use enums for categories
Whenever you have a known set of possible values, use enums instead of free-form strings:
3. Keep nesting shallow
3. Keep nesting shallow
Prefer one or two levels of nesting to group related fields logically:
4. Use descriptions to instruct the AI
4. Use descriptions to instruct the AI
The Use descriptions to specify:
description field is your primary way to guide the extraction AI. Think of descriptions as instructions telling the AI exactly how to interpret each field when analyzing the document:- Date and number formats:
"YYYY-MM-DD","in USD","as a decimal percentage (e.g., 0.15 for 15%)" - What to include or exclude:
"excluding taxes","only from the executive summary section" - How to handle missing data:
"Return null if not explicitly stated" - Where to look:
"from the signature block","as stated in the header" - Level of detail:
"one-sentence summary","verbatim quote","3-5 bullet points"
5. Test across multiple documents
5. Test across multiple documents
Before finalizing your schema, test it against several representative documents—not just one. A schema that works perfectly for one document may fail on others.Testing across examples helps you identify:
- Fields that don’t exist in all documents (should be optional with
null) - Categories that vary more than expected (broaden your enums)
- Format inconsistencies (clarify your descriptions)
- Edge cases you didn’t anticipate
Example Schemas
These are complete, copy-ready schemas for common document types. Switch tabs to see each one.Schema Depth: Index vs. Full Extraction
When designing your schema, consider how your agent will actually use the data.- Summary-level (index and filter)
- Deep (work directly from extracted data)
Best when agents need to find and filter documents, then read the full content for detail:The agent loads
extracted_data.json to find the right files, then reads their .md content for line items, notes, and other detail. This is simpler to maintain and works well when documents vary in structure.How to decide
| Factor | Summary schema | Deep schema |
|---|---|---|
| Agent reads 1-5 files per task | Good fit | Overkill |
| Agent processes 10+ files per task | Bottleneck — agent reads every file | Good fit |
| Document structure varies a lot | Easier to maintain | Harder to design |
| Speed/timeout matters | Agent still needs file reads | Single JSON load |
| Schema changes frequently | Low re-extraction cost | Must re-extract all files |
Setting Up Extraction
Configure structured data extraction
Under Structured Data Extraction:
- Select an Extraction Model (e.g., GPT-4o)
- Paste your JSON schema into the Extraction Schema field
The schema only applies to newly uploaded files. To re-extract data from existing files, use the retry extraction option in the file details.
Viewing Extracted Data
After processing completes:
You can also access extracted data programmatically through agents using the
search_knowledge_base tool.
Troubleshooting
"Invalid JSON" error
"Invalid JSON" error
Your schema isn’t valid JSON. Check for:
- Missing commas between properties
- Missing closing braces
} - Trailing commas (not allowed in JSON)
"Must have a type field" error
"Must have a type field" error
Every schema needs a root
type. Usually this should be "object":Empty or missing extracted fields
Empty or missing extracted fields
- Ensure the field is marked as
requiredif it should always be extracted - Check that your field names clearly describe what to extract
- Verify the document actually contains the information
Extraction taking too long
Extraction taking too long
Large schemas with many fields take longer to process. Consider:
- Breaking into multiple smaller knowledge bases
- Removing non-essential fields
- Using a faster extraction model
