[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: PDF rendering/extraction involving indic scripts



hi...
is there is any tool in Linux that will read lines from marathi
language pdf file  and store
in mysql Table?

On 1/9/17, Jonas Smedegaard <jonas@jones.dk> wrote:
> Quoting Ritesh Raj Sarraf (2017-01-09 08:51:04)
>> On Mon, 2017-01-09 at 01:05 +0100, Jonas Smedegaard wrote:
>>>> I don't recollect finding any such list, when I was running into
>>>> problems with PDF. I remember talking to Vasudev and he suggested
>>>> me  your name, hoping you may have more insight into Fonts and PDF
>>>> in  general.
>>>
>>> Which problems did you run into, Ritesh, more concretely?
>>
>> Mostly with Indic text extraction from the PDF files. What is rendered
>> in the PDF doesn't get exported as text.
>
> Ah, extraction _from_ PDF.  Yes, that is a pain, because it is
> technically *not* possible to do reliably!
>
> PDF is compiled output from drawing instructions, *not* a source format:
> It was invented as the digital equivalent of paper - just as you can
> scan a piece of paper but not be certain if you semantically got a
> circle or the letter "o" or the digit "0", you can parse a PDF document
> but not be certain if e.g. elements close to each other belong together.
>
> PDF reverse engineering - a.k.a. PDF content extraction - is sometimes
> possible, and more likely when same tools are used to produce and
> extract.  That trick is (ab)used in particular by the inventor of PDF -
> Adobe - and that has no doubt added to the confusion (if not caused it).
>
> Always call it "PDF files" (not specific brands), and never _depend_ on
> ability to extract content (only proper source is reliable)!
>
> Here are console tools for all known¹ PDF extraction libraries, tested
> on a single² PDF file containing english and devanagari content:
>
>   * Succesfully extracts some devanagari:
>     * pdftotext (lib:poppler pkg:poppler-utils)
>     * pdftohtml (lib:poppler pkg:poppler-utils)
>     * pdf2htmlex (lib:pdf.js pkg:pdf2htmlex)
>     * pdf2txt (lib:pdfminer pkg:python-pdfminer)
>   * Extracts complete text streams (maybe decodable separately):
>     * pdfextract (lib:origami pkg:origami-pdf)
>     * mutool (lib:mupdf pkg:mupdf-tools)
>   * Fails to extract complete text - skipping devanagari:
>     * ps2ascii (lib:gs pkg:ghostscript)
>     * pstotext (lib:gs pkg:pstotext)
>     * podofotxtextract (lib:podofo pkg:libpodofo-utils)
>   * Fails to extract any text at all (or I uses it wrongly):
>     * pdftosrc (lib:poppler pkg:texlive-binaries)
>     * getpdftext (lib:cam-pdf pkg:libcam-pdf-perl)
>   * Untested (and relevant: uses untested library):
>     * pdfsam (lib:itext pkg:pdfsam)
>     * pdfbox (lib:pdfbox pkg:libpdfbox-java)
>     * pkg:php-tcpdf
>     * pkg:libcamlpdf-ocaml
>
> NB! The list only includes tools with varying _extraction_ features,
> which is typically limited by a single underlying library.  Popular
> examples already covered are OpenOffice (lib:poppler) and Scribus
> (lib:podofo).
>
> I care about PDF rendering and extraction, but I lack knowledge on indic
> scripts and am unable to spot crucial flaws like misplaced or garbled
> glyphs, or (for rendering) wrong spacing.
>
> If anyone knows about alternative Free tools (with _different_
> extraction features!), please let me know!
>
> Please also share more sample texts with me - both source and rendered
> PDFs - for multiple indic scripts.
>
>
>  - Jonas
>
>
> ¹ Only code in Debian is truly known; only Free code can become known.
>
> ² A sample text for a Free font authored by a friend of mine:
> https://github.com/cyrealtype/Sumana/raw/master/Samples/Sumana%20Poster.pdf
>
> --
>  * Jonas Smedegaard - idealist & Internet-arkitekt
>  * Tlf.: +45 40843136  Website: http://dr.jones.dk/
>
>  [x] quote me freely  [ ] ask before reusing  [ ] keep private
>
>


Reply to: