Re: How to extract TABULAR data from a PDF document?

To: debian-user@lists.debian.org
Subject: Re: How to extract TABULAR data from a PDF document?
From: "Andrew M.A. Cater" <amacater@einval.com>
Date: Thu, 24 Apr 2025 08:06:35 +0000
Message-id: <[🔎] aAnxC_zL-MyMqxh7@einval.com>
In-reply-to: <[🔎] aAney335kR3pOgLj@tuxteam.de>
References: <[🔎] c0bc5b53-ae27-0ddc-1261-a893474783d7@access.net> <[🔎] 14701c9a-79ef-4dec-be96-8a7db6d4d795@gmail.com> <[🔎] aAcCVrxLrpY5gKND@axis.corp> <[🔎] 6546606a-a237-4808-8cd5-ee95a7e4f7c8@gmail.com> <[🔎] vu9jq7$grp$1@ciao.gmane.io> <[🔎] 0ea949c1-c191-4f47-9d6e-85434e5eedd6@gmail.com> <[🔎] vuc7qh$779$1@ciao.gmane.io> <[🔎] 38eb8d57-b243-42ad-aa53-97e9afb290ac@gmail.com> <[🔎] aAney335kR3pOgLj@tuxteam.de>

On Thu, Apr 24, 2025 at 08:48:43AM +0200, tomas@tuxteam.de wrote:
> On Thu, Apr 24, 2025 at 11:32:23AM +0800, jeremy ardley wrote:
> > 
> > On 24/4/25 10:31, Max Nikulin wrote:
> > > 
> > > By the way, PDF files may be tagged for screen readers. Is there a
> > > dedicated structure to explicitly mark tables? It would be the best
> > > source for data extraction.
> > 
> > 
> > ISO 14289 is an accessibility standard for PDF. It allows for the creation
> > of a "Tagged PDF" where semantic information, including table structures
> > (<Table>, <TR>, <TH>, <TD>), can be embedded in a separate logical structure
> > tree
> > 

Disclaimer: I deal with some accessibility documentation in my day job.
The problem is that very few authors know this - and very few tools support
tagging. Adobe Acrobat is about the best but the $$ versions.

Informal advice is always "Write it in Word, then let Word convert it to 
PDF" That works if the author is disciplined and knows how to tag,
heading orders and so on - but it can still produce tagged PDFs that
are nominally accessible to screen readers but practically unusable.

The result is that PDFs may well be completely fine as a secure archival
format, non-modifiable, readable everywhere - and useless to a segment
of the population which is blind or visually impaired.
.
Deque University - deque.com - has a whole series of accessibility 
courses and a couple of *long* ones on how to write a PDF :(

This also goes for HTML wihich has to be well written and tagging
images with alt-text and so on. There is an ARIA standard which
helps make the web more accessible but that's an adjunct, to 
be used over and above well-written HTML and CSS.

All best, as ever,

Andy
(amacater@debian.org)

> > You can download it for free at https://pdfa.org/resource/iso-14289-pdfua/
> 
> Oh, thanks for this one :)
> 
> Cheers
> -- 
> tomás

Reply to:

Follow-Ups:
- Re: How to extract TABULAR data from a PDF document?
  - From: jeremy ardley <jeremy.ardley@gmail.com>
- Re: How to extract TABULAR data from a PDF document?
  - From: Max Nikulin <manikulin@gmail.com>

References:
- How to extract TABULAR data from a PDF document?
  - From: Richard Owlett <rowlett@access.net>
- Re: How to extract TABULAR data from a PDF document?
  - From: jeremy ardley <jeremy.ardley@gmail.com>
- Re: How to extract TABULAR data from a PDF document?
  - From: David Wright <deblis@lionunicorn.co.uk>
- Re: How to extract TABULAR data from a PDF document?
  - From: jeremy ardley <jeremy.ardley@gmail.com>
- Re: How to extract TABULAR data from a PDF document?
  - From: Max Nikulin <manikulin@gmail.com>
- Re: How to extract TABULAR data from a PDF document?
  - From: jeremy ardley <jeremy.ardley@gmail.com>
- Re: How to extract TABULAR data from a PDF document?
  - From: Max Nikulin <manikulin@gmail.com>
- Re: How to extract TABULAR data from a PDF document?
  - From: jeremy ardley <jeremy.ardley@gmail.com>
- Re: How to extract TABULAR data from a PDF document?
  - From: <tomas@tuxteam.de>

Prev by Date: Re: PC recommendations for Debian 12
Next by Date: Re: How to extract TABULAR data from a PDF document?
Previous by thread: Re: How to extract TABULAR data from a PDF document?
Next by thread: Re: How to extract TABULAR data from a PDF document?
Index(es):
- Date
- Thread