Available now

DATA EXTRACTOR.
Documents in.
JSON or HTML out.

One POST. Send text, an image, or a PDF; declare the fields you want; get back structured JSON, ready-to-embed HTML, or a full semantic rendering of the source document — plus a 0–1 confidence per field and the exact source phrase each value came from. Built so your workflow can auto-approve the easy cases and escalate the uncertain ones — without re-parsing the LLM's prose.

Read the docs Back to product catalogue

Three inputs, one output

Text

Strings up to ~200K chars. Parsed emails, scraped pages, copied invoices.

Image

JPEG / PNG / WebP / TIFF. The vision LLM reads directly — no OCR-first, so table layout + form structure carry through.

PDF

Text-layer extraction first; scanned PDFs fall back to per-page rasterise. Up to 10 pages per call; use page_range to slice larger documents.

Three output modes, one endpoint

Pick the shape that fits your downstream system. JSON for automation, HTML for human-readable surfaces, document HTML for re-flowing scanned content. Caller picks per-call via output_format.

default

JSON

Structured fields keyed by your schema, plus a 0–1 confidence per field and the source phrase each value was read from. The shape your workflow code lives on.

HTML fields

Same JSON plus a deterministic HTML render of the extracted fields (<dl> for scalars, <table> for line items). Drop it straight into a ticket comment, email body, or audit note.

HTML document

Whole-document semantic HTML — image or PDF only, no schema. Velgent re-flows the source into headings, paragraphs, lists, and tables. Useful for re-using scanned content in modern surfaces.

What's in the response

Per-field confidence

A 0.0–1.0 score on every leaf field. Multi-signal: LLM self-rating + anchor verification + source-quality + schema validation. min-semantics — the weakest signal dominates, so a hallucination dominates over LLM optimism.

Anchors (mandatory)

Every value carries the source phrase it was read from. Auto-verified server-side; mismatches drop confidence sharply. Audit-ready by default — no opt-in flag.

Schema or template

Declare the fields inline per request, or publish a named template once in admin and reference it by slug. Templates are versioned; pin a specific version for replay or canary.

Secure by default

No persistent storage of the file. Bytes held in memory only; cleared after the response. PII redaction before the LLM on text inputs. BYOK + tenant-residency-aware LLM routing inherited from the rest of the engine.

The wire shape, in one example

Request

POST /api/v1/extract
{
  "text": "Invoice INV-2026-001 dated 2026-05-28, total $1,250.00 USD",
  "extraction_schema": {
    "fields": {
      "invoice_number": { "type": "string", "pattern": "^INV-\\d+" },
      "total":          { "type": "number", "min": 0 },
      "currency":       { "type": "enum", "values": ["USD","EUR","GBP"] },
      "invoice_date":   { "type": "date", "format": "iso8601_date" }
    }
  }
}

Response

{
  "extracted": {
    "invoice_number": "INV-2026-001",
    "total":          1250.00,
    "currency":       "USD",
    "invoice_date":   "2026-05-28"
  },
  "confidence": {
    "invoice_number": 0.98,
    "total":          0.95,
    "currency":       0.99,
    "invoice_date":   0.97
  },
  "anchors": {
    "total": { "text": "total $1,250.00 USD", "page": null }
  }
  // + components, metadata, schema_drift, warnings
}

Full schema reference, error codes, and worked examples for image and PDF inputs: docs.velgent.com/operations/data-extractor.

Security & confidentiality

Passport scans, invoices with banking details, KYC PDFs — the documents you'd send to Data Extractor are exactly the ones you don't want sitting in a vendor's database. Here's what actually happens to them.

No persistent storage of your files

Uploaded bytes live in process memory only for the lifetime of the request, then are cleared in a finally block before the response returns. We don't write the file to disk, we don't put it in a queue, we don't keep it around for re-extraction. Need to re-extract? Send the file again.

PII detection on every text input

Text and PDF text-layer inputs run through Presidio on every call. When PII appears — passport number, driver's license, date of birth, SSN, custom regex types — a pii_detected security event lands in your audit feed with per-category counts. Default action is detect-and-log so extraction still works on PII fields the schema asked for; REDACT or SUPPRESS is one flag away.

Audit captures provenance, not content

Each call logs mime type, page count, file size, SHA-256 hash, and a 2KB redacted preview of the extracted text. Never raw file bytes. URLs you supply get audited as host-only — full paths and query strings are never persisted, so tokens you accidentally leak in a URL don't reach our database.

Tenant isolation + residency-aware routing

Your LLM provider, model, and credentials are scoped to your tenant — a request from another tenant can never reach them. The router respects your residency policy too: an AU-locked tenant won't route to a US-only provider, BYOK or otherwise. All traffic is TLS in transit; admin traffic is HMAC-signed end-to-end.

The complete security contract — eight V0 rules covering memory hygiene, the audit-row schema, PII action semantics, and tenant-isolation invariants — lives in the engine spec at docs.velgent.com/operations/data-extractor#security. Every pii_detected and schema_drift event surfaces in your admin console's /security-events feed in near-real-time.

Composes with the Policy Engine

One product extracts. One product decides. The natural downstream of Data Extractor is the Policy Engine in validatemode: encode the routing logic — "auto-approve if every critical field has confidence > 0.9 AND total matches PO; else route to human review" — and let your workflow do the rest.

See the Policy Engine

See Velgent in action.