A document classification and data extraction system developed for an insurance company's claims processing workflow. The firm received an average of 2,800 daily claims, each containing 5-8 different documents such as adjuster reports, invoices, and policy copies. The operations team manually classified these documents and entered relevant fields into the system one by one; the error rate exceeded 12%.
We designed a three-stage pipeline: First, document images undergo preprocessing and correction (deskew, noise reduction). Second, a fine-tuned LayoutLMv3 model classifies the document and extracts key fields (date, amount, policy number, etc.). Third, a business rules engine validates the extracted data and pushes it to the ERP system. Documents with low confidence scores are routed to human review.