[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: PDF rendering/extraction involving indic scripts

Quoting Ritesh Raj Sarraf (2017-01-09 08:51:04)
> On Mon, 2017-01-09 at 01:05 +0100, Jonas Smedegaard wrote:
>>> I don't recollect finding any such list, when I was running into  
>>> problems with PDF. I remember talking to Vasudev and he suggested 
>>> me  your name, hoping you may have more insight into Fonts and PDF 
>>> in  general.
>> Which problems did you run into, Ritesh, more concretely?
> Mostly with Indic text extraction from the PDF files. What is rendered 
> in the PDF doesn't get exported as text.

Ah, extraction _from_ PDF.  Yes, that is a pain, because it is 
technically *not* possible to do reliably!

PDF is compiled output from drawing instructions, *not* a source format: 
It was invented as the digital equivalent of paper - just as you can 
scan a piece of paper but not be certain if you semantically got a 
circle or the letter "o" or the digit "0", you can parse a PDF document 
but not be certain if e.g. elements close to each other belong together.

PDF reverse engineering - a.k.a. PDF content extraction - is sometimes 
possible, and more likely when same tools are used to produce and 
extract.  That trick is (ab)used in particular by the inventor of PDF - 
Adobe - and that has no doubt added to the confusion (if not caused it).

Always call it "PDF files" (not specific brands), and never _depend_ on 
ability to extract content (only proper source is reliable)!

Here are console tools for all known¹ PDF extraction libraries, tested 
on a single² PDF file containing english and devanagari content:

  * Succesfully extracts some devanagari:
    * pdftotext (lib:poppler pkg:poppler-utils)
    * pdftohtml (lib:poppler pkg:poppler-utils)
    * pdf2htmlex (lib:pdf.js pkg:pdf2htmlex)
    * pdf2txt (lib:pdfminer pkg:python-pdfminer)
  * Extracts complete text streams (maybe decodable separately):
    * pdfextract (lib:origami pkg:origami-pdf)
    * mutool (lib:mupdf pkg:mupdf-tools)
  * Fails to extract complete text - skipping devanagari:
    * ps2ascii (lib:gs pkg:ghostscript)
    * pstotext (lib:gs pkg:pstotext)
    * podofotxtextract (lib:podofo pkg:libpodofo-utils)
  * Fails to extract any text at all (or I uses it wrongly):
    * pdftosrc (lib:poppler pkg:texlive-binaries)
    * getpdftext (lib:cam-pdf pkg:libcam-pdf-perl)
  * Untested (and relevant: uses untested library):
    * pdfsam (lib:itext pkg:pdfsam)
    * pdfbox (lib:pdfbox pkg:libpdfbox-java)
    * pkg:php-tcpdf
    * pkg:libcamlpdf-ocaml

NB! The list only includes tools with varying _extraction_ features, 
which is typically limited by a single underlying library.  Popular 
examples already covered are OpenOffice (lib:poppler) and Scribus 

I care about PDF rendering and extraction, but I lack knowledge on indic 
scripts and am unable to spot crucial flaws like misplaced or garbled 
glyphs, or (for rendering) wrong spacing.

If anyone knows about alternative Free tools (with _different_ 
extraction features!), please let me know!

Please also share more sample texts with me - both source and rendered 
PDFs - for multiple indic scripts.

 - Jonas

¹ Only code in Debian is truly known; only Free code can become known.

² A sample text for a Free font authored by a friend of mine: 

 * Jonas Smedegaard - idealist & Internet-arkitekt
 * Tlf.: +45 40843136  Website: http://dr.jones.dk/

 [x] quote me freely  [ ] ask before reusing  [ ] keep private

Reply to: