Re: How to extract TABULAR data from a PDF document?

To: debian-user@lists.debian.org
Subject: Re: How to extract TABULAR data from a PDF document?
From: Max Nikulin <manikulin@gmail.com>
Date: Wed, 23 Apr 2025 09:37:57 +0700
Message-id: <[🔎] vu9jq7$grp$1@ciao.gmane.io>
In-reply-to: <[🔎] 6546606a-a237-4808-8cd5-ee95a7e4f7c8@gmail.com>
References: <[🔎] c0bc5b53-ae27-0ddc-1261-a893474783d7@access.net> <[🔎] 14701c9a-79ef-4dec-be96-8a7db6d4d795@gmail.com> <[🔎] aAcCVrxLrpY5gKND@axis.corp> <[🔎] 6546606a-a237-4808-8cd5-ee95a7e4f7c8@gmail.com>

On 22/04/2025 09:51, jeremy ardley wrote:

Some LLM can also accept pdf for input but you'd need to snip out thepages you are interested in. I consider that slightly more risky as whatyou see rendered or printed and what some programs see internal to thepdf varies

I would be great if a data extractor warned users when text fromdocument (either really text or embedded OCR layer for scans) does notmatch text recognized from rendered document. Besides routine sanitychecks, document author might try to intentionally add some tricks withfonts aiming to confuse indexers or humans who copy text to their notes.


Accidentally I have noticed

find_tables(clip=None, strategy=None, vertical_strategy=None,
horizontal_strategy=None, vertical_lines=None, horizontal_lines=None,
snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None,
join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None,
edge_min_length=3, min_words_vertical=3, min_words_horizontal=1,
intersection_tolerance=None, intersection_x_tolerance=None,
intersection_y_tolerance=None, text_tolerance=None,
text_x_tolerance=None, text_y_tolerance=None, add_lines=None)


<https://pymupdf.readthedocs.io/en/latest/page.html#Page.find_tables>

I have not tried it since currently I am not interested in tableextraction and the version packaged for bookworm does not have thisfeature. I was surprised by mixing of functions to manipulate simple PDFobjects and one quite sensitive to heuristics and implementation details.

Reply to:

Follow-Ups:
- Re: How to extract TABULAR data from a PDF document?
  - From: jeremy ardley <jeremy.ardley@gmail.com>

References:
- How to extract TABULAR data from a PDF document?
  - From: Richard Owlett <rowlett@access.net>
- Re: How to extract TABULAR data from a PDF document?
  - From: jeremy ardley <jeremy.ardley@gmail.com>
- Re: How to extract TABULAR data from a PDF document?
  - From: David Wright <deblis@lionunicorn.co.uk>
- Re: How to extract TABULAR data from a PDF document?
  - From: jeremy ardley <jeremy.ardley@gmail.com>

Prev by Date: Configure a "widows" key on a 120-key keyboard
Next by Date: Re: How to extract TABULAR data from a PDF document?
Previous by thread: Re: How to extract TABULAR data from a PDF document?
Next by thread: Re: How to extract TABULAR data from a PDF document?
Index(es):
- Date
- Thread