DocAI Processor

Intelligent Document Processing & Extraction

Document AILayoutLMv3OCRComputer VisionPython

A document classification and data extraction system developed for an insurance company's claims processing workflow. The firm received an average of 2,800 daily claims, each containing 5-8 different documents such as adjuster reports, invoices, and policy copies. The operations team manually classified these documents and entered relevant fields into the system one by one; the error rate exceeded 12%.

We designed a three-stage pipeline: First, document images undergo preprocessing and correction (deskew, noise reduction). Second, a fine-tuned LayoutLMv3 model classifies the document and extracts key fields (date, amount, policy number, etc.). Third, a business rules engine validates the extracted data and pushes it to the ERP system. Documents with low confidence scores are routed to human review.

System Architecture

IngestionProcessingIntelligenceOutputJobsProcessTextStoreEntitiesTypedVerifiedStatusFilesUpload ServiceJob QueueOCR EngineNLP PipelineObject StorageClassifierValidationExport APIDashboard

Highlights

  • LayoutLMv3-based document classification and field extraction
  • Image preprocessing pipeline (deskew, noise reduction, binarization)
  • Business rules engine for automated data validation
  • Human-in-the-loop for low-confidence document review
  • Bidirectional ERP system integration

Results

Classification accuracy at 94.8% (99.2% with human review)
File processing time reduced from 22 minutes to 90 seconds
Manual data entry errors dropped from 12% to 1.4%
Freed up 6 FTEs in the operations team