MOSAIC
The first classifier that learns from your folder structure and classifies long documents where current systems fail. No manual labeling. No predefined schemas.
The problem nobody solved
There is a class of documents that no current system classifies well: those whose type cannot be determined from the first page. The discriminative signal is spread throughout the document — in clause 7, in article 18, in the combination of obligations that only appears on page 12.
A reader that only sees the beginning fails. A reader that sees the first eight pages also fails. Current systems reach between 40% and 73% on this type of document. MOSAIC reaches 92.2%.
How it works
MOSAIC applies a proprietary progressive hypothesis reduction approach. Instead of classifying the document against all possible types at once, the system progressively eliminates candidates using specialized observers of increasing cost — until the decision becomes trivial.
Why it's different
Learns from your folders
No need to label documents one by one, or use vendor schemas. MOSAIC distills the tacit knowledge that already exists in your directory structure — the classification an expert perfected over years of operational practice.
Epistemic honesty
When evidence is insufficient, MOSAIC declares the document ambiguous instead of forcing a low-confidence classification. In regulated environments, a human review signal is worth more than an incorrect classification presented with confidence.
Auditable decisions
Each classification cites concrete evidence: which step resolved it, which document fragment was decisive, and at what confidence level. Not an anonymous probability distribution — a traceable justification for audit.
New types in days
Adding a document type does not require retraining the complete pipeline. Just create a folder with representative examples. Competitors require weeks or months of retraining.
Detects filing errors
MOSAIC automatically detects misfiled documents and marks them for correction, improving document order as a byproduct of deployment.
Cost proportional to difficulty
Easy documents are resolved for free in less than 1ms. Only genuinely difficult ones escalate to costly steps. Average production cost per document: $0.000056.
Classifier comparison
MOSAIC is a document classifier, not an OCR system. Evaluated on the real AEAT-Legal corpus: 2,037 Spanish tax documents, 14 types, 96.6% difficult documents.
| Classifier | Global Accuracy | Hard Docs | Latency | Auditable | Admits Uncertainty | New Type |
|---|---|---|---|---|---|---|
| Pure Regex | ~65% | ~35% | <10ms | YES | NO | Days |
| Short-window classifier | 90.0% | 91.3% | 93ms | NO | NO | Months |
| Long-window classifier | ~87% | ~87% | 350ms | NO | NO | Months |
| Concatenated ensemble | ~87% | ~85% | 380ms | NO | NO | Months |
| Direct LLM (no reduction) | ~83% | 58.0% | >3s | ~ | NO | Days |
| MOSAIC | 91.25% | 92.2% | 1.1ms | YES | YES | Days |
“The only classifier that simultaneously resolves distributed-signal documents with high precision, justifies each decision by citing evidence from the text, admits uncertainty when it cannot determine the type, and is robust to real-world filing noise.”
Use cases
Tax and legal firms
Automatic classification of files, contracts and notarial deeds where the type emerges from clauses distributed throughout the document, not from the header.
Credit documentation processing
Policies, loan contracts, appraisal reports. Documents where the financial product type appears in the combination of clauses, not at the beginning.
Procedural document management
Any organization with documents organized in folders by type can deploy MOSAIC without prior labeling — just with their existing file structure.
Ready to classify documents with real precision?
Deploy MOSAIC on your existing folder structure and get results from day one — no labeling, no complex integrations, and at a near-zero cost per document.