[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Extracting tabular text from a pdf



I'm trying to create some spreadsheets containing nutritional data.

Sample documents I'm using for input include:
https://choosemyplate-prod.azureedge.net/sites/default/files/2WeekMenusGroceryList.pdf

https://choosemyplate-prod.azureedge.net/sites/default/files/2WeekMenusAndFoodGroupContent.pdf

http://www.nhlbi.nih.gov/files/docs/public/heart/new_dash.pdf

Using "pdftotext -layout input.pdf output.txt" is tantalizingly close to what I want.

When PRINTED that visually preserves relationships.
I wish to copy a column of data to a spreadsheet, but can only select horizontally, NOT vertically.

I suspect it may not be practical in the general case.
Are there any examples that might come close?




Reply to: