Schemas define exactly what data to extract from your documents and in what format. A well-defined schema is the key to accurate, consistent extractions.
Automat schemas follow JSON Schema conventions with some AI-specific extensions for better extraction guidance.
Every field should include:
For text values:
For numeric values (integers and decimals):
For true/false values:
For predefined set of values:
For lists of values:
For nested structures:
Here’s a comprehensive invoice extraction schema:
Descriptions are crucial for accurate extraction. They guide the AI on:
❌ "Date of the document"
✅ "Invoice issue date, typically at the top of the document, in YYYY-MM-DD format"
When a document has multiple similar values:
By default, all fields are optional. Use the required property for mandatory fields:
Begin with 5-10 core fields, validate accuracy, then expand
Use snake_case and descriptive names: total_amount not amt
Use nested objects for related data (vendor info, customer info, etc.)
Test with documents missing optional fields to ensure graceful handling