Reports

Yes a very practical and cost-effective approach is to use PaddleOCR for text extraction and then send only the cleaned raw text to a small LLM like GPT-4o-mini just for structuring the data into JSON. PaddleOCR has excellent accuracy for printed hotel bills and invoices, and using GPT only for semantic parsing cuts both cost and latency significantly. You can further reduce LLM usage by applying basic regex or rule-based extraction first and falling back to GPT only when fields are unclear. This hybrid pipeline is what many production systems use today.

79833076