Re: PDF aus Images in einzelne Seiten zerlegen + OCR

To: debian-user-german@lists.debian.org
Subject: Re: PDF aus Images in einzelne Seiten zerlegen + OCR
From: Christoph Conrad <nospam@spamgourmet.com>
Date: Tue, 03 Apr 2007 20:58:31 +0200
Message-id: <[🔎] 87mz1pi8jo.fsf@ID-24456.user.uni-berlin.de>
Reply-to: Christoph Conrad <christoph.conrad@gmx.de>
References: <87ircmjqo9.fsf@ID-24456.user.uni-berlin.de> <20070328055012.GD4953@a-kretschmer.de> <87hcs5hke6.fsf@ID-24456.user.uni-berlin.de> <200703281556.36429.thomas-ml@vollmeronline.de> <87irclf5nr.fsf@ID-24456.user.uni-berlin.de> <87wt0zb5dz.fsf@ID-24456.user.uni-berlin.de> <[🔎] 46121246.2060504@jl42.de>

Hallo Jakob,

> Steht Dein Script unter einer freien Lizenz, so dass ich es weiter
> verarbeiten darf?

Jetzt ja :-) So war es schon vorher gemeint. Siehe unten, v0.2.

Ich habe noch diverse Löschbefehle ergänzt, sonst werden evt. bei
Bearbeitung mehrerer PDFs falls in PDF 2 weniger Seiten als in 1 die
überschüssigen Seiten von 1 an 2 angehängt. Ausserdem Löschen der
unpaper-Splitfiles, sonst werden zweite Seiten bei einer Folgeseite mit
nur einer Seite ebenfalls angehängt.

Freundliche Grüße,
Christoph

#!/bin/bash

#
# Copyright (c) 2007 Christoph Conrad <mailto:christoph.conrad@gmx.de>
#
# pdfscanned2ascii v0.2
#
# Usage: pdfscanned2ascii <pdf-file>
#
# Split PDF consisting of scanned book images (two pages per image) and
# OCR the content. Output file: "<pdf-file>.txt".
#
# Required software: pdfimages, unpaper, convert, tesseract
#

#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2, or (at your option)
# any later version.

# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.

# You should have received a copy of the GNU General Public License along
# with this program; if not, write to the Free Software Foundation, Inc.,
# 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
#


# trap "" SIGINT

if [ "$1" == "" ]; then
    echo "Usage: $0 <pdf-file>"
    exit
fi

input="$1"
output="$1.txt"

rm -f images-*.ppm
rm -f images-*.pbm
rm -f image-split*

echo "split pdf to PPM/PBM (color/bw)"
pdfimages "$input" images

rm -f "$output"

find . -name 'images*' -type f -print | sort -n | while read file
do
   # split scanned double pages in single bw pages
   unpaper -t pbm --overwrite -l double -op 2 "$file" "image-split%d.pbm"

   # pbm->tiff
   convert image-split1.pbm image-split1.tiff
   convert image-split2.pbm image-split2.tiff

   # OCR & collect output
   tesseract image-split1.tiff out
   cat out.txt >> "$output"
   tesseract image-split2.tiff out
   cat out.txt >> "$output"

   rm -f image-split*
done

Reply to:

References:
- Re: PDF aus Images in einzelne Seiten zerlegen + OCR
  - From: Jakob Lenfers <debian.mailinglisten@jl42.de>

Prev by Date: Re: Begin: Waiting for root file system, cryptsetup luksOpen /dev/md1 md1_crypt
Next by Date: wie als normaler user ntfs platten mounten
Previous by thread: Re: PDF aus Images in einzelne Seiten zerlegen + OCR
Next by thread: Andere umask für bestimmte Verzeichnisse
Index(es):
- Date
- Thread