Re: [OT?] Attempting to extract tabular data from PDF -- approriate forum?
On 7/19/25 11:43 AM, Daniele Forsi wrote:
Hello Richard,
the PDF format is not suitable for structured data,
You err slightly. {smile ;}
The _creators_ of PDF had no explicit interest in "structured data".
Their creation was a tool to create machine readable data, which once
printed, would what came from the existing printing industry.
what do you want to do with it?
To quote myself ;}
>
> Table A4.14 ... has information meeting my immediate personal need.
> My goal is to document a typical generic weekly grocery list for
> the "standard" 2000 calorie/day diet as a spreadsheet and/or
database. >
[SNIP]
If you want something that you can modify, use "pdftotext" which is
available in Debian in the "poppler-utils" package
This will work for you:
pdftotext -layout -f 116 -l 116 /tmp/TFP2021.pdf
Thank you.
I had tried "poppler-utils" on another edition of "TFP2021.pdf".
The result was a *MESS* unsuitable as input to a scriptable editor
such as Kate.
[An aside. Back in the 70's, when working for DEC as an Engineering
Tech, I was surrounded by TECO fanatics. It caused me to appreciate
powerful text editors. That prompted my interest in Kate.]
Using Pluma I started a trial edit of /tmp/TFP2021.txt created by
pdftotext. I think creating a Kate macro is feasible. I'm a Kate newbie
and it will at least an educational experience.
Now let's talk radio!
I wanted to convert the band plan
https://www.iaru-r1.org/wp-content/uploads/2021/03/UHF-Bandplan.pdf
I tried different ways:
first I did a copy and paste in Libreoffice Writer, I got all the
contents, but the columns where gone as expected
then I did a copy and paste in Libreoffice Calc, but there isn't an
easy way to get the columns
finally I ran: pdftotext -layout -f 1 -l 2 UHF-Bandplan.pdf
and also in this case pdftotext is doing a better job than a simple
copy and paste, but it can't be easily read with a software so I
wonder if a machine-readable list of frequencies is already available
somewhere
I believe you are overly pessimistic.
When I ran: pdftotext -layout -f 1 -l 2 UHF-Bandplan.pdf the result was
similar enough to TFP2021.txt that I believe Kate may be suitable.
I'll use editing TFP2021.txt as a learning experience and
UHF-Bandplan.txt as a feasibility to task experiment with Kate.
I should have preliminary in a week or so. To match some goals of the
project that prompted my investigation of Kate, the output will be HTML.
Later.
Reply to: