Extraction Schemas

Overview

Your AI agents can already search and read the full contents of every file in a knowledge base. Extraction schemas add a structured index on top of that—a consistent set of fields extracted from each document that enables quick filtering and cross-document analysis. Think of it this way: without an extraction schema, answering “which contracts are worth over $1M?” requires the AI to open and read every contract. With an extraction schema that includes total_value as a number field, that question becomes a simple filter. Extraction schemas are most valuable for:

Numeric fields you want to compare, aggregate, or filter (amounts, counts, percentages, dates)
Categorical fields you want to group or filter by (document type, status, region, sector)
Key identifiers that help locate the right documents quickly (deal names, parties, project codes)

String fields like summaries can be useful too, but the real power is in structured data that saves you from scanning every file when asking questions across your entire knowledge base.

Choosing What to Extract

Before writing your schema, ask yourself: “What questions do I want to answer across all my documents?”

Question Type	Extract These Fields
”Which deals are over $10M?”	`deal_value` (number)
“Show me all contracts expiring this quarter”	`expiration_date` (string in YYYY-MM-DD)
“Break down documents by type”	`document_type` (enum)
“Find all projects in the healthcare sector”	`sector` (enum)
“What’s our average contract value by region?”	`contract_value` (number), `region` (enum)
“List all documents involving Acme Corp”	`parties` (array of strings)

Prioritize fields that are:

Frequently used for filtering or sorting
Needed for aggregations (sums, averages, counts)
Used to categorize or group documents
Key identifiers for finding specific documents

Skip fields that are:

Only useful when reading the full document (the AI can already do that)
Highly variable or unstructured across documents
Rarely needed for cross-document questions

Schema Basics

Extraction schemas use JSON Schema format. At minimum, your schema needs:

A type field (usually "object")
A properties field defining what to extract
A required array listing mandatory fields

{
  "type": "object",
  "required": ["title", "date", "summary"],
  "properties": {
    "title": { "type": "string", "description": "Document title as shown on the first page" },
    "date": { "type": "string", "description": "Publication or effective date in YYYY-MM-DD format" },
    "summary": { "type": "string", "description": "One paragraph summary of the document's main purpose" }
  }
}

Supported Types

Use these types for your schema fields:

Type	Description	Example Value
`string`	Text values	`"Quarterly Report"`
`number`	Decimal numbers	`123.45`
`integer`	Whole numbers	`42`
`boolean`	True/false values	`true`
`array`	Lists of items	`["item1", "item2"]`
`object`	Nested structures	`{"name": "John"}`

Making Fields Optional

Use a type array with null to make fields optional:

{
  "type": "object",
  "required": ["company_name"],
  "properties": {
    "company_name": { "type": "string", "description": "Legal company name" },
    "website": { "type": ["string", "null"], "description": "Company website URL, if mentioned" },
    "employee_count": { "type": ["integer", "null"], "description": "Number of employees, if stated" }
  }
}

When a field uses ["type", "null"], the AI will return null if the information isn’t found in the document.

Using Enums

Enums constrain a field to specific allowed values. Use enums whenever you have a known set of categories:

{
  "type": "object",
  "required": ["document_type", "status"],
  "properties": {
    "document_type": {
      "type": "string",
      "enum": ["CONTRACT", "INVOICE", "PROPOSAL", "REPORT"],
      "description": "Classification of the document"
    },
    "status": {
      "type": "string",
      "enum": ["DRAFT", "PENDING", "APPROVED", "REJECTED"],
      "description": "Current approval status"
    },
    "priority": {
      "type": ["string", "null"],
      "enum": ["LOW", "MEDIUM", "HIGH", null],
      "description": "Priority level, if indicated"
    }
  }
}

Enums work with strings, numbers, booleans, or null values. Include null in the enum array if the field should be optional.

Arrays

Use arrays to extract lists of items:

{
  "type": "object",
  "required": ["authors", "keywords"],
  "properties": {
    "authors": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Full names of all authors"
    },
    "keywords": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Key topics or tags relevant to this document"
    }
  }
}

Arrays of Objects

Extract structured lists with typed objects:

{
  "type": "object",
  "required": ["line_items"],
  "properties": {
    "line_items": {
      "type": "array",
      "description": "Individual items or charges listed in the document",
      "items": {
        "type": "object",
        "required": ["description", "amount"],
        "properties": {
          "description": { "type": "string", "description": "Item or service description" },
          "quantity": { "type": ["integer", "null"], "description": "Number of units, if specified" },
          "amount": { "type": "number", "description": "Line item total in USD" }
        }
      }
    }
  }
}

Nested Objects

Group related fields using nested objects:

{
  "type": "object",
  "required": ["vendor", "total_amount"],
  "properties": {
    "vendor": {
      "type": "object",
      "description": "Information about the vendor or supplier",
      "required": ["name"],
      "properties": {
        "name": { "type": "string", "description": "Vendor's legal business name" },
        "address": { "type": ["string", "null"], "description": "Full mailing address" },
        "contact_email": { "type": ["string", "null"], "description": "Primary contact email" }
      }
    },
    "total_amount": { "type": "number", "description": "Total amount due in USD" }
  }
}

Best Practices

1. Require Fields Whenever Possible

Mark fields as required unless they genuinely may not exist in documents. This helps the AI understand which information is essential:

{
  "required": ["title", "date", "amount"],
  "properties": { ... }
}

2. Use Enums for Categories

Whenever you have a known set of possible values, use enums instead of free-form strings:

// Good - constrained values
"status": { "type": "string", "enum": ["ACTIVE", "INACTIVE", "PENDING"] }

// Avoid - unconstrained
"status": { "type": "string" }

3. Keep Nesting Shallow

Prefer one or two levels of nesting to group related fields logically:

{
  "deal_name": { "type": "string" },
  "deal_type": { "type": "string" },
  "buyer": {
    "type": "object",
    "properties": {
      "name": { "type": "string" },
      "contact_email": { "type": "string" }
    }
  }
}

4. Use Descriptions to Instruct the AI

The description field is your primary way to guide the extraction AI. Think of descriptions as instructions telling the AI exactly how to interpret each field when analyzing the document:

{
  "type": "object",
  "required": ["effective_date", "total_value", "parties"],
  "properties": {
    "effective_date": {
      "type": "string",
      "description": "Contract start date in YYYY-MM-DD format"
    },
    "total_value": {
      "type": "number",
      "description": "Total contract value in USD, excluding taxes and fees"
    },
    "renewal_terms": {
      "type": ["string", "null"],
      "description": "Summary of auto-renewal clauses. Return null if no auto-renewal provisions exist."
    },
    "parties": {
      "type": "array",
      "items": { "type": "string" },
      "description": "Legal names of all signing parties, extracted exactly as written in the signature block"
    }
  }
}

Use descriptions to specify:

Date and number formats: "YYYY-MM-DD", "in USD", "as a decimal percentage (e.g., 0.15 for 15%)"
What to include or exclude: "excluding taxes", "only from the executive summary section"
How to handle missing data: "Return null if not explicitly stated"
Where to look: "from the signature block", "as stated in the header"
Level of detail: "one-sentence summary", "verbatim quote", "3-5 bullet points"

Well-written descriptions dramatically improve extraction accuracy.

5. Test Across Multiple Documents

Before finalizing your schema, test it against several representative documents—not just one. A schema that works perfectly for one document may fail on others. Testing across examples helps you identify:

Fields that don’t exist in all documents (should be optional with null)
Categories that vary more than expected (broaden your enums)
Format inconsistencies (clarify your descriptions)
Edge cases you didn’t anticipate

The goal is a schema that works generically across all documents you’ll upload to the knowledge base, not one optimized for a single example.

Example Schemas

Invoice Extraction

{
  "type": "object",
  "required": ["invoice_number", "vendor_name", "total_amount", "line_items"],
  "properties": {
    "invoice_number": { "type": "string" },
    "invoice_date": { "type": ["string", "null"] },
    "due_date": { "type": ["string", "null"] },
    "vendor_name": { "type": "string" },
    "vendor_address": { "type": ["string", "null"] },
    "total_amount": { "type": "number" },
    "currency": {
      "type": "string",
      "enum": ["USD", "EUR", "GBP", "CAD"]
    },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["description", "amount"],
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": ["number", "null"] },
          "unit_price": { "type": ["number", "null"] },
          "amount": { "type": "number" }
        }
      }
    }
  }
}

Research Paper Extraction

{
  "type": "object",
  "required": ["title", "authors", "abstract", "methodology"],
  "properties": {
    "title": { "type": "string" },
    "authors": {
      "type": "array",
      "items": { "type": "string" }
    },
    "publication_date": { "type": ["string", "null"] },
    "journal": { "type": ["string", "null"] },
    "abstract": { "type": "string" },
    "methodology": { "type": "string" },
    "key_findings": {
      "type": "array",
      "items": { "type": "string" }
    },
    "limitations": { "type": ["string", "null"] },
    "sample_size": { "type": ["integer", "null"] }
  }
}

Contract Extraction

{
  "type": "object",
  "required": ["contract_type", "parties", "effective_date", "key_terms"],
  "properties": {
    "contract_type": {
      "type": "string",
      "enum": ["NDA", "MSA", "SOW", "EMPLOYMENT", "LEASE", "OTHER"]
    },
    "parties": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["name", "role"],
        "properties": {
          "name": { "type": "string" },
          "role": {
            "type": "string",
            "enum": ["PARTY_A", "PARTY_B", "GUARANTOR", "WITNESS"]
          }
        }
      }
    },
    "effective_date": { "type": "string" },
    "expiration_date": { "type": ["string", "null"] },
    "auto_renewal": { "type": ["boolean", "null"] },
    "total_value": { "type": ["number", "null"] },
    "key_terms": {
      "type": "array",
      "items": {
        "type": "object",
        "required": ["term", "description"],
        "properties": {
          "term": { "type": "string" },
          "description": { "type": "string" }
        }
      }
    }
  }
}

Setting Up Extraction

Go to Control Hub → Knowledge Bases
Create a new knowledge base or edit an existing one
Under Structured Data Extraction:
- Select an Extraction Model (e.g., GPT-4o)
- Paste your JSON schema into the Extraction Schema field
Save and upload your documents

The schema only applies to newly uploaded files. To re-extract data from existing files, use the retry extraction option in the file details.

Viewing Extracted Data

After processing completes:

Click on a file in your knowledge base
View the Extracted Data section
The structured JSON output matches your schema definition

You can also access extracted data programmatically through agents using the search_knowledge_base tool.

Troubleshooting

”Invalid JSON” Error

Your schema isn’t valid JSON. Check for:

Missing commas between properties
Missing closing braces }
Trailing commas (not allowed in JSON)

“Must have a ‘type’ field” Error

Every schema needs a root type. Usually this should be "object":

{
  "type": "object",
  ...
}

Empty or Missing Extracted Fields

Ensure the field is marked as required if it should always be extracted
Check that your field names clearly describe what to extract
Verify the document actually contains the information

Extraction Taking Too Long

Large schemas with many fields take longer to process. Consider:

Breaking into multiple smaller knowledge bases
Removing non-essential fields
Using a faster extraction model

Get Started

Agents & Features

Organization Settings

Integrations

Tools Reference

Advanced

Overview

Choosing What to Extract

Schema Basics

Supported Types

Making Fields Optional

Using Enums

Arrays

Arrays of Objects

Nested Objects

Best Practices

1. Require Fields Whenever Possible

2. Use Enums for Categories

3. Keep Nesting Shallow

4. Use Descriptions to Instruct the AI

5. Test Across Multiple Documents

Example Schemas

Invoice Extraction

Research Paper Extraction

Contract Extraction

Setting Up Extraction

Viewing Extracted Data

Troubleshooting

”Invalid JSON” Error

“Must have a ‘type’ field” Error

Empty or Missing Extracted Fields

Extraction Taking Too Long

Get Started

Agents & Features

Organization Settings

Integrations

Tools Reference

Advanced

​Overview

​Choosing What to Extract

​Schema Basics

​Supported Types

​Making Fields Optional

​Using Enums

​Arrays

​Arrays of Objects

​Nested Objects

​Best Practices

​1. Require Fields Whenever Possible

​2. Use Enums for Categories

​3. Keep Nesting Shallow

​4. Use Descriptions to Instruct the AI

​5. Test Across Multiple Documents

​Example Schemas

​Invoice Extraction

​Research Paper Extraction

​Contract Extraction

​Setting Up Extraction

​Viewing Extracted Data

​Troubleshooting

​”Invalid JSON” Error

​“Must have a ‘type’ field” Error

​Empty or Missing Extracted Fields

​Extraction Taking Too Long

Overview

Choosing What to Extract

Schema Basics

Supported Types

Making Fields Optional

Using Enums

Arrays

Arrays of Objects

Nested Objects

Best Practices

1. Require Fields Whenever Possible

2. Use Enums for Categories

3. Keep Nesting Shallow

4. Use Descriptions to Instruct the AI

5. Test Across Multiple Documents

Example Schemas

Invoice Extraction

Research Paper Extraction

Contract Extraction

Setting Up Extraction

Viewing Extracted Data

Troubleshooting

”Invalid JSON” Error

“Must have a ‘type’ field” Error

Empty or Missing Extracted Fields

Extraction Taking Too Long