Re: Extracting tabular text from a pdf

To: debian-user@lists.debian.org
Subject: Re: Extracting tabular text from a pdf
From: Richard Owlett <rowlett@cloud85.net>
Date: Wed, 27 Nov 2019 13:39:40 -0600
Message-id: <[🔎] cecadfcc-251e-fefe-5cfe-933ba63807c3@cloud85.net>
In-reply-to: <[🔎] 201911271258.45581.rhkramer@gmail.com>
References: <[🔎] 1969072e-0dd6-a925-5989-8e00333bc917@cloud85.net> <[🔎] 201911271258.45581.rhkramer@gmail.com>

On 11/27/2019 11:58 AM, rhkramer@gmail.com wrote:

On Wednesday, November 27, 2019 12:07:16 PM Richard Owlett wrote:

I'm trying to create some spreadsheets containing nutritional data.

Sample documents I'm using for input include:

https://choosemyplate-prod.azureedge.net/sites/default/files/2WeekMenusGr
oceryList.pdf

https://choosemyplate-prod.azureedge.net/sites/default/files/2WeekMenusAn
dFoodGroupContent.pdf

http://www.nhlbi.nih.gov/files/docs/public/heart/new_dash.pdf


Using "pdftotext -layout  input.pdf output.txt" is tantalizingly close
to what I want.

When PRINTED that visually preserves relationships.
I wish to copy a column of data to a spreadsheet, but can only select
horizontally, NOT vertically.


I didn't try your pdftotext command, so I don't know how tantalizingly close
that got you.


I *APPRECIATE* that you _READ_ my post <GRIN> *ROFL*


I opened one of the .pdfs in Okular, then switched to selection mode and
selected a column of data using the mouse.  I then copied it to a text editor.

It looks very columnar to me,


*chuckle chuckle*

with the only (minor) problems being an extra

line containing some unprintable characters (□ -- copied and pasted here, but
they show up differently in nedit)  (and a line end character).

I'm sure that could easily be imported into a spreadsheet just specifying
those unprintable characters as the record separator (using that text file).


I have NO problem deleting specific character sequences.


(I used Okular 0.14.3 as distributed in Wheezy.


INOW "I should try with Wheezy"
I have multiple machines dedicated to *EXPERIMENTATION* *ROFL*


I suspect it may not be practical in the general case.
Are there any examples that might come close?


Thank you for answering the question I *ACTUALLY* asked. *ROFL*

Reply to:

Follow-Ups:
- Re: Extracting tabular text from a pdf
  - From: rhkramer@gmail.com

References:
- Extracting tabular text from a pdf
  - From: Richard Owlett <rowlett@cloud85.net>
- Re: Extracting tabular text from a pdf
  - From: rhkramer@gmail.com

Prev by Date: Re: Extracting tabular text from a pdf
Next by Date: Re: Bug - but where
Previous by thread: Re: Extracting tabular text from a pdf
Next by thread: Re: Extracting tabular text from a pdf
Index(es):
- Date
- Thread