[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Extracting tabular text from a pdf



On Wed 27 Nov 2019 at 12:58:45 (-0500), rhkramer@gmail.com wrote:
> On Wednesday, November 27, 2019 12:07:16 PM Richard Owlett wrote:
> > I'm trying to create some spreadsheets containing nutritional data.
> > 
> > Sample documents I'm using for input include:
> > > https://choosemyplate-prod.azureedge.net/sites/default/files/2WeekMenusGr
> > > oceryList.pdf
> > > 
> > > https://choosemyplate-prod.azureedge.net/sites/default/files/2WeekMenusAn
> > > dFoodGroupContent.pdf
> > > 
> > > http://www.nhlbi.nih.gov/files/docs/public/heart/new_dash.pdf
> > 
> > Using "pdftotext -layout  input.pdf output.txt" is tantalizingly close
> > to what I want.
> > 
> > When PRINTED that visually preserves relationships.
> > I wish to copy a column of data to a spreadsheet, but can only select
> > horizontally, NOT vertically.

Perhaps look for an editor that can select rectangular blocks. For
example, emacs has rectangular variants of commands.
https://www.gnu.org/software/emacs/manual/html_node/emacs/Rectangles.html
Back in the last millennium, I was using TDE (Thomson-Davis Editor)
to do much the same in DOS.

> I didn't try your pdftotext command, so I don't know how tantalizingly close 
> that got you.
> 
> I opened one of the .pdfs in Okular, then switched to selection mode and 
> selected a column of data using the mouse.  I then copied it to a text editor.  
> It looks very columnar to me, with the only (minor) problems being an extra 
> line containing some unprintable characters (□ -- copied and pasted here, but 
> they show up differently in nedit)  (and a line end character).

Yes, copying directly from PDFs in xpdf also selects rectangles.
OTOH evince (by default: I'm not overly familiar with its
capabilities) appears to select by lines, even though a rectangle
is displayed while dragging the mouse.

> I'm sure that could easily be imported into a spreadsheet just specifying 
> those unprintable characters as the record separator (using that text file).

Or first tidy it up in an editor.

> (I used Okular 0.14.3 as distributed in Wheezy.)
> > 
> > I suspect it may not be practical in the general case.
> > Are there any examples that might come close?

You should be able to coerce any decent editor into earning you a cigar.

Cheers,
David.


Reply to: