[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: How to search and mine text on a two-column pdf file?



Thank you Samuel,

What are the bash commands to make pdfcrop and pdftk work? Thanks.

Regards


Henry

Samuel Thibault <sthibault@debian.org> 於 2022年10月27日 週四 中午12:27寫道:
Henry Chang, le jeu. 27 oct. 2022 12:08:20 -0400, a ecrit:
> I found that the original 11470644.pdf is formatted in two columns. The texts
> on a line of the first column messed up with the texts on the line of the
> second column at the same position.

Perhaps you can use pdfcrop and pdftk to split pages into the left and
the right parts, and join then together again in a single pdf file that
you can feed to tesseract.

Samuel


--
Muchiu (Henry) Chang, PhD. Cantab
Patent Mapping Intelligence Researcher &
Monte Carlo Modeling Simulation Expert
https://www.linkedin.com/in/mcc212/
tel. +1-416-828-5676

Reply to: