Reports

I know it has been a while, but my solution has been more of the following. I am also assuming that you are doing this for RAG purpose, NOT for the sake of getting the right parsing. Instead of getting the exact parsing of the PDF, I get a rough parsing where I get all the texts from the table, and I just replace the table with some metadata like {table id: 'werwrwe', summary: "contains the values of collection date, barcode etc", page: 12}. Then during the retrieval, if the similarity search chooses this chunk, then using the table id, I can retrieve the page of the document as a picture. Basically I see table_id and the page, then I can go to the original document then get the raw table as the picture. This has been much better RAG process for me rather than trying to get the parsing absolutely correct.

79369762