[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [OT?] Attempting to extract tabular data from PDF -- approriate forum?



On 7/19/25 11:43 AM, Daniele Forsi wrote:
Hello Richard,

the PDF format is not suitable for structured data,

You err slightly. {smile ;}
The _creators_ of PDF had no explicit interest in "structured data".
Their creation was a tool to create machine readable data, which once printed, would what came from the existing printing industry.

what do you want to do with it?

To quote myself ;}
  >
  > Table A4.14 ... has information meeting my immediate personal need.
  > My goal is to document a typical generic weekly grocery list for
> the "standard" 2000 calorie/day diet as a spreadsheet and/or database. >


[SNIP]
If you want something that you can modify, use "pdftotext" which is
available in Debian in the "poppler-utils" package
This will work for you:
pdftotext -layout -f 116 -l 116 /tmp/TFP2021.pdf

Thank you.
I had tried "poppler-utils" on another edition of "TFP2021.pdf".
The result was a *MESS* unsuitable as input to a scriptable editor
such as Kate.
[An aside. Back in the 70's, when working for DEC as an Engineering Tech, I was surrounded by TECO fanatics. It caused me to appreciate powerful text editors. That prompted my interest in Kate.]

Using Pluma I started a trial edit of /tmp/TFP2021.txt created by pdftotext. I think creating a Kate macro is feasible. I'm a Kate newbie and it will at least an educational experience.



Now let's talk radio!
I wanted to convert the band plan
https://www.iaru-r1.org/wp-content/uploads/2021/03/UHF-Bandplan.pdf

I tried different ways:
first I did a copy and paste in Libreoffice Writer, I got all the
contents, but the columns where gone as expected
then I did a copy and paste in Libreoffice Calc, but there isn't an
easy way to get the columns
finally I ran: pdftotext -layout -f 1 -l 2 UHF-Bandplan.pdf
and also in this case pdftotext is doing a better job than a simple
copy and paste, but it can't be easily read with a software so I
wonder if a machine-readable list of frequencies is already available
somewhere


I believe you are overly pessimistic.
When I ran: pdftotext -layout -f 1 -l 2 UHF-Bandplan.pdf the result was similar enough to TFP2021.txt that I believe Kate may be suitable.

I'll use editing TFP2021.txt as a learning experience and UHF-Bandplan.txt as a feasibility to task experiment with Kate.

I should have preliminary in a week or so. To match some goals of the project that prompted my investigation of Kate, the output will be HTML.

Later.




Reply to: