Re: PDF rendering/extraction involving indic scripts
hi...
is there is any tool in Linux that will read lines from marathi
language pdf file and store
in mysql Table?
On 1/9/17, Jonas Smedegaard <jonas@jones.dk> wrote:
> Quoting Ritesh Raj Sarraf (2017-01-09 08:51:04)
>> On Mon, 2017-01-09 at 01:05 +0100, Jonas Smedegaard wrote:
>>>> I don't recollect finding any such list, when I was running into
>>>> problems with PDF. I remember talking to Vasudev and he suggested
>>>> me your name, hoping you may have more insight into Fonts and PDF
>>>> in general.
>>>
>>> Which problems did you run into, Ritesh, more concretely?
>>
>> Mostly with Indic text extraction from the PDF files. What is rendered
>> in the PDF doesn't get exported as text.
>
> Ah, extraction _from_ PDF. Yes, that is a pain, because it is
> technically *not* possible to do reliably!
>
> PDF is compiled output from drawing instructions, *not* a source format:
> It was invented as the digital equivalent of paper - just as you can
> scan a piece of paper but not be certain if you semantically got a
> circle or the letter "o" or the digit "0", you can parse a PDF document
> but not be certain if e.g. elements close to each other belong together.
>
> PDF reverse engineering - a.k.a. PDF content extraction - is sometimes
> possible, and more likely when same tools are used to produce and
> extract. That trick is (ab)used in particular by the inventor of PDF -
> Adobe - and that has no doubt added to the confusion (if not caused it).
>
> Always call it "PDF files" (not specific brands), and never _depend_ on
> ability to extract content (only proper source is reliable)!
>
> Here are console tools for all known¹ PDF extraction libraries, tested
> on a single² PDF file containing english and devanagari content:
>
> * Succesfully extracts some devanagari:
> * pdftotext (lib:poppler pkg:poppler-utils)
> * pdftohtml (lib:poppler pkg:poppler-utils)
> * pdf2htmlex (lib:pdf.js pkg:pdf2htmlex)
> * pdf2txt (lib:pdfminer pkg:python-pdfminer)
> * Extracts complete text streams (maybe decodable separately):
> * pdfextract (lib:origami pkg:origami-pdf)
> * mutool (lib:mupdf pkg:mupdf-tools)
> * Fails to extract complete text - skipping devanagari:
> * ps2ascii (lib:gs pkg:ghostscript)
> * pstotext (lib:gs pkg:pstotext)
> * podofotxtextract (lib:podofo pkg:libpodofo-utils)
> * Fails to extract any text at all (or I uses it wrongly):
> * pdftosrc (lib:poppler pkg:texlive-binaries)
> * getpdftext (lib:cam-pdf pkg:libcam-pdf-perl)
> * Untested (and relevant: uses untested library):
> * pdfsam (lib:itext pkg:pdfsam)
> * pdfbox (lib:pdfbox pkg:libpdfbox-java)
> * pkg:php-tcpdf
> * pkg:libcamlpdf-ocaml
>
> NB! The list only includes tools with varying _extraction_ features,
> which is typically limited by a single underlying library. Popular
> examples already covered are OpenOffice (lib:poppler) and Scribus
> (lib:podofo).
>
> I care about PDF rendering and extraction, but I lack knowledge on indic
> scripts and am unable to spot crucial flaws like misplaced or garbled
> glyphs, or (for rendering) wrong spacing.
>
> If anyone knows about alternative Free tools (with _different_
> extraction features!), please let me know!
>
> Please also share more sample texts with me - both source and rendered
> PDFs - for multiple indic scripts.
>
>
> - Jonas
>
>
> ¹ Only code in Debian is truly known; only Free code can become known.
>
> ² A sample text for a Free font authored by a friend of mine:
> https://github.com/cyrealtype/Sumana/raw/master/Samples/Sumana%20Poster.pdf
>
> --
> * Jonas Smedegaard - idealist & Internet-arkitekt
> * Tlf.: +45 40843136 Website: http://dr.jones.dk/
>
> [x] quote me freely [ ] ask before reusing [ ] keep private
>
>
Reply to: