Re: PDF rendering/extraction involving indic scripts

To: Jonas Smedegaard <jonas@jones.dk>
Cc: Ritesh Raj Sarraf <rrs@researchut.com>, "Abhijit A. M." <abhijit13@disroot.org>, debian-dug-in@lists.debian.org, Siri Reiter <siri@jones.dk>
Subject: Re: PDF rendering/extraction involving indic scripts
From: Mahendra Bhandwalkar <mahendra.bhandwalkar@gmail.com>
Date: Wed, 11 Jan 2017 09:43:31 +0530
Message-id: <[🔎] CABsoPXBA_x8bfh5+BzkPnq71DuLfoVf2AtO9Nbsk-9fA70E_Gw@mail.gmail.com>
In-reply-to: <[🔎] 148397455120.2347.3555704517203462018@auryn.jones.dk>
References: <[🔎] 90d222f7-22c9-0e31-38d8-e32d5d11b66d@disroot.org> <[🔎] 1483896846.12261.1.camel@researchut.com> <[🔎] 148389854251.2347.10605487073741548101@auryn.jones.dk> <[🔎] 1483899942.12261.3.camel@researchut.com> <[🔎] 148392032510.2347.14272967871766835336@auryn.jones.dk> <[🔎] 1483948264.12261.5.camel@researchut.com> <[🔎] 148397455120.2347.3555704517203462018@auryn.jones.dk>

hi...
is there is any tool in Linux that will read lines from marathi
language pdf file  and store
in mysql Table?

On 1/9/17, Jonas Smedegaard <jonas@jones.dk> wrote:
> Quoting Ritesh Raj Sarraf (2017-01-09 08:51:04)
>> On Mon, 2017-01-09 at 01:05 +0100, Jonas Smedegaard wrote:
>>>> I don't recollect finding any such list, when I was running into
>>>> problems with PDF. I remember talking to Vasudev and he suggested
>>>> me  your name, hoping you may have more insight into Fonts and PDF
>>>> in  general.
>>>
>>> Which problems did you run into, Ritesh, more concretely?
>>
>> Mostly with Indic text extraction from the PDF files. What is rendered
>> in the PDF doesn't get exported as text.
>
> Ah, extraction _from_ PDF.  Yes, that is a pain, because it is
> technically *not* possible to do reliably!
>
> PDF is compiled output from drawing instructions, *not* a source format:
> It was invented as the digital equivalent of paper - just as you can
> scan a piece of paper but not be certain if you semantically got a
> circle or the letter "o" or the digit "0", you can parse a PDF document
> but not be certain if e.g. elements close to each other belong together.
>
> PDF reverse engineering - a.k.a. PDF content extraction - is sometimes
> possible, and more likely when same tools are used to produce and
> extract.  That trick is (ab)used in particular by the inventor of PDF -
> Adobe - and that has no doubt added to the confusion (if not caused it).
>
> Always call it "PDF files" (not specific brands), and never _depend_ on
> ability to extract content (only proper source is reliable)!
>
> Here are console tools for all known¹ PDF extraction libraries, tested
> on a single² PDF file containing english and devanagari content:
>
>   * Succesfully extracts some devanagari:
>     * pdftotext (lib:poppler pkg:poppler-utils)
>     * pdftohtml (lib:poppler pkg:poppler-utils)
>     * pdf2htmlex (lib:pdf.js pkg:pdf2htmlex)
>     * pdf2txt (lib:pdfminer pkg:python-pdfminer)
>   * Extracts complete text streams (maybe decodable separately):
>     * pdfextract (lib:origami pkg:origami-pdf)
>     * mutool (lib:mupdf pkg:mupdf-tools)
>   * Fails to extract complete text - skipping devanagari:
>     * ps2ascii (lib:gs pkg:ghostscript)
>     * pstotext (lib:gs pkg:pstotext)
>     * podofotxtextract (lib:podofo pkg:libpodofo-utils)
>   * Fails to extract any text at all (or I uses it wrongly):
>     * pdftosrc (lib:poppler pkg:texlive-binaries)
>     * getpdftext (lib:cam-pdf pkg:libcam-pdf-perl)
>   * Untested (and relevant: uses untested library):
>     * pdfsam (lib:itext pkg:pdfsam)
>     * pdfbox (lib:pdfbox pkg:libpdfbox-java)
>     * pkg:php-tcpdf
>     * pkg:libcamlpdf-ocaml
>
> NB! The list only includes tools with varying _extraction_ features,
> which is typically limited by a single underlying library.  Popular
> examples already covered are OpenOffice (lib:poppler) and Scribus
> (lib:podofo).
>
> I care about PDF rendering and extraction, but I lack knowledge on indic
> scripts and am unable to spot crucial flaws like misplaced or garbled
> glyphs, or (for rendering) wrong spacing.
>
> If anyone knows about alternative Free tools (with _different_
> extraction features!), please let me know!
>
> Please also share more sample texts with me - both source and rendered
> PDFs - for multiple indic scripts.
>
>
>  - Jonas
>
>
> ¹ Only code in Debian is truly known; only Free code can become known.
>
> ² A sample text for a Free font authored by a friend of mine:
> https://github.com/cyrealtype/Sumana/raw/master/Samples/Sumana%20Poster.pdf
>
> --
>  * Jonas Smedegaard - idealist & Internet-arkitekt
>  * Tlf.: +45 40843136  Website: http://dr.jones.dk/
>
>  [x] quote me freely  [ ] ask before reusing  [ ] keep private
>
>

Reply to:

Follow-Ups:
- Re: PDF rendering/extraction involving indic scripts
  - From: Jonas Smedegaard <jonas@jones.dk>

References:
- Report: Debian Packaging Workshop at COEP
  - From: "Abhijit A. M." <abhijit13@disroot.org>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Ritesh Raj Sarraf <rrs@researchut.com>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Jonas Smedegaard <jonas@jones.dk>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Ritesh Raj Sarraf <rrs@researchut.com>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Jonas Smedegaard <jonas@jones.dk>
- Re: Report: Debian Packaging Workshop at COEP
  - From: Ritesh Raj Sarraf <rrs@researchut.com>
- Re: PDF rendering/extraction involving indic scripts
  - From: Jonas Smedegaard <jonas@jones.dk>

Prev by Date: Re: PDF rendering/extraction involving indic scripts
Next by Date: Re: PDF rendering/extraction involving indic scripts
Previous by thread: Re: PDF rendering/extraction involving indic scripts
Next by thread: Re: PDF rendering/extraction involving indic scripts
Index(es):
- Date
- Thread