[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to extract TABULAR data from a PDF document?



On 22/04/2025 09:51, jeremy ardley wrote:
Some LLM can also accept pdf for input but you'd need to snip out the pages you are interested in. I consider that slightly more risky as what you see rendered or printed and what some programs see internal to the pdf varies

I would be great if a data extractor warned users when text from document (either really text or embedded OCR layer for scans) does not match text recognized from rendered document. Besides routine sanity checks, document author might try to intentionally add some tricks with fonts aiming to confuse indexers or humans who copy text to their notes.

Accidentally I have noticed
find_tables(clip=None, strategy=None, vertical_strategy=None,
horizontal_strategy=None, vertical_lines=None, horizontal_lines=None,
snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None,
join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None,
edge_min_length=3, min_words_vertical=3, min_words_horizontal=1,
intersection_tolerance=None, intersection_x_tolerance=None,
intersection_y_tolerance=None, text_tolerance=None,
text_x_tolerance=None, text_y_tolerance=None, add_lines=None)

<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>

I have not tried it since currently I am not interested in table extraction and the version packaged for bookworm does not have this feature. I was surprised by mixing of functions to manipulate simple PDF objects and one quite sensitive to heuristics and implementation details.



Reply to: