Re: Problems with a PDF
The explanation is that it's a two-column document, i.e. from the
geometrical order, numbers 4 and 5 are really on top of 1 and 2ff.
pdftotext -layout file.pdf
keeps the document layout intact, so you can see the first column on the
left side followed by a second column on the right side, the pages are
meant to be read from top down on the left, followed by top down on the
right, which is indeed confusing, but at least the text order doesn't
get messed up in the text representation when the -layout option is present.
In order to fix this, i.e. create single-column text, you would need to
copy&paste the document column-wise, page by page, which is probably not
supported by evince. tesseract can do a column-wise OCR on a scanned
document, but converting the file to a picture and then running OCR on
it will probably introduce even more errors.
On Tue, Oct 13, 2015 at 12:55:12AM +0200, MENGUAL Jean-Philippe wrote:
> I'm trying to read an European law, and for the 1st time I cannot. It's a
> pdf file. It's he!e:
> 1. I tried pdftotext
> 2. I opened with Evince then Atril, ctrl-a, ctrl-c, ctrl-v in gedit/pluma.
> Without understanding the language, you easily will see that the numbers are
> disordered. Instead of (1) (2) (3) (4), the doc starts with (4), (5), then
> (1), (2). Confusing.
> An explanation? An idea to fix? What should I do (including !eport a bug
> Jean-Philippe MENGUAL
> HYPRA, progressons ensemble
> Tél.: 01 84 73 06 61
> Mail: email@example.com
> Site Web: http://hypra.fr