Re: How to extract TABULAR data from a PDF document?
On Thu, Apr 24, 2025 at 08:48:43AM +0200, tomas@tuxteam.de wrote:
> On Thu, Apr 24, 2025 at 11:32:23AM +0800, jeremy ardley wrote:
> >
> > On 24/4/25 10:31, Max Nikulin wrote:
> > >
> > > By the way, PDF files may be tagged for screen readers. Is there a
> > > dedicated structure to explicitly mark tables? It would be the best
> > > source for data extraction.
> >
> >
> > ISO 14289 is an accessibility standard for PDF. It allows for the creation
> > of a "Tagged PDF" where semantic information, including table structures
> > (<Table>, <TR>, <TH>, <TD>), can be embedded in a separate logical structure
> > tree
> >
Disclaimer: I deal with some accessibility documentation in my day job.
The problem is that very few authors know this - and very few tools support
tagging. Adobe Acrobat is about the best but the $$ versions.
Informal advice is always "Write it in Word, then let Word convert it to
PDF" That works if the author is disciplined and knows how to tag,
heading orders and so on - but it can still produce tagged PDFs that
are nominally accessible to screen readers but practically unusable.
The result is that PDFs may well be completely fine as a secure archival
format, non-modifiable, readable everywhere - and useless to a segment
of the population which is blind or visually impaired.
.
Deque University - deque.com - has a whole series of accessibility
courses and a couple of *long* ones on how to write a PDF :(
This also goes for HTML wihich has to be well written and tagging
images with alt-text and so on. There is an ARIA standard which
helps make the web more accessible but that's an adjunct, to
be used over and above well-written HTML and CSS.
All best, as ever,
Andy
(amacater@debian.org)
> > You can download it for free at https://pdfa.org/resource/iso-14289-pdfua/
>
> Oh, thanks for this one :)
>
> Cheers
> --
> tomás
Reply to: