Text
Strings up to ~200K chars. Parsed emails, scraped pages, copied invoices.
We'll set up a personalised walkthrough.
One POST. Send text, an image, or a PDF; declare the fields you want; get back structured JSON, ready-to-embed HTML, or a full semantic rendering of the source document — plus a 0–1 confidence per field and the exact source phrase each value came from. Built so your workflow can auto-approve the easy cases and escalate the uncertain ones — without re-parsing the LLM's prose.
Strings up to ~200K chars. Parsed emails, scraped pages, copied invoices.
JPEG / PNG / WebP / TIFF. The vision LLM reads directly — no OCR-first, so table layout + form structure carry through.
Text-layer extraction first; scanned PDFs fall back to per-page rasterise. Up to 10 pages per call; use page_range to slice larger documents.
Pick the shape that fits your downstream system. JSON for automation, HTML for human-readable surfaces, document HTML for re-flowing scanned content. Caller picks per-call via output_format.
Structured fields keyed by your schema, plus a 0–1 confidence per field and the source phrase each value was read from. The shape your workflow code lives on.
Same JSON plus a deterministic HTML render of the extracted fields (<dl> for scalars, <table> for line items). Drop it straight into a ticket comment, email body, or audit note.
Whole-document semantic HTML — image or PDF only, no schema. Velgent re-flows the source into headings, paragraphs, lists, and tables. Useful for re-using scanned content in modern surfaces.
A 0.0–1.0 score on every leaf field. Multi-signal: LLM self-rating + anchor verification + source-quality + schema validation. min-semantics — the weakest signal dominates, so a hallucination dominates over LLM optimism.
Every value carries the source phrase it was read from. Auto-verified server-side; mismatches drop confidence sharply. Audit-ready by default — no opt-in flag.
Declare the fields inline per request, or publish a named template once in admin and reference it by slug. Templates are versioned; pin a specific version for replay or canary.
No persistent storage of the file. Bytes held in memory only; cleared after the response. PII redaction before the LLM on text inputs. BYOK + tenant-residency-aware LLM routing inherited from the rest of the engine.
POST /api/v1/extract
{
"text": "Invoice INV-2026-001 dated 2026-05-28, total $1,250.00 USD",
"extraction_schema": {
"fields": {
"invoice_number": { "type": "string", "pattern": "^INV-\\d+" },
"total": { "type": "number", "min": 0 },
"currency": { "type": "enum", "values": ["USD","EUR","GBP"] },
"invoice_date": { "type": "date", "format": "iso8601_date" }
}
}
}{
"extracted": {
"invoice_number": "INV-2026-001",
"total": 1250.00,
"currency": "USD",
"invoice_date": "2026-05-28"
},
"confidence": {
"invoice_number": 0.98,
"total": 0.95,
"currency": 0.99,
"invoice_date": 0.97
},
"anchors": {
"total": { "text": "total $1,250.00 USD", "page": null }
}
// + components, metadata, schema_drift, warnings
}Full schema reference, error codes, and worked examples for image and PDF inputs: docs.velgent.com/operations/data-extractor.
Passport scans, invoices with banking details, KYC PDFs — the documents you'd send to Data Extractor are exactly the ones you don't want sitting in a vendor's database. Here's what actually happens to them.
Uploaded bytes live in process memory only for the lifetime of the request, then are cleared in a finally block before the response returns. We don't write the file to disk, we don't put it in a queue, we don't keep it around for re-extraction. Need to re-extract? Send the file again.
Text and PDF text-layer inputs run through Presidio on every call. When PII appears — passport number, driver's license, date of birth, SSN, custom regex types — a pii_detected security event lands in your audit feed with per-category counts. Default action is detect-and-log so extraction still works on PII fields the schema asked for; REDACT or SUPPRESS is one flag away.
Each call logs mime type, page count, file size, SHA-256 hash, and a 2KB redacted preview of the extracted text. Never raw file bytes. URLs you supply get audited as host-only — full paths and query strings are never persisted, so tokens you accidentally leak in a URL don't reach our database.
Your LLM provider, model, and credentials are scoped to your tenant — a request from another tenant can never reach them. The router respects your residency policy too: an AU-locked tenant won't route to a US-only provider, BYOK or otherwise. All traffic is TLS in transit; admin traffic is HMAC-signed end-to-end.
The complete security contract — eight V0 rules covering memory hygiene, the audit-row schema, PII action semantics, and tenant-isolation invariants — lives in the engine spec at docs.velgent.com/operations/data-extractor#security. Every pii_detected and schema_drift event surfaces in your admin console's /security-events feed in near-real-time.
One product extracts. One product decides. The natural downstream of Data Extractor is the Policy Engine in validatemode: encode the routing logic — "auto-approve if every critical field has confidence > 0.9 AND total matches PO; else route to human review" — and let your workflow do the rest.
See the Policy Engine