Re: PDF aus Images in einzelne Seiten zerlegen + OCR

To: debian-user-german@lists.debian.org
Subject: Re: PDF aus Images in einzelne Seiten zerlegen + OCR
From: Christoph Conrad <nospam@spamgourmet.com>
Date: Fri, 30 Mar 2007 08:33:12 +0200
Message-id: <[🔎] 87wt0zb5dz.fsf@ID-24456.user.uni-berlin.de>
Reply-to: Christoph Conrad <christoph.conrad@gmx.de>
References: <[🔎] 87ircmjqo9.fsf@ID-24456.user.uni-berlin.de> <[🔎] 20070328055012.GD4953@a-kretschmer.de> <[🔎] 87hcs5hke6.fsf@ID-24456.user.uni-berlin.de> <[🔎] 200703281556.36429.thomas-ml@vollmeronline.de> <[🔎] 87irclf5nr.fsf@ID-24456.user.uni-berlin.de>

Hallo,

mein Script hängt unten an. Sehr straightforward, genau für meine
Zwecke, vielleicht hilft es jemanden von euch. Nachbearbeitung der
Ausgabe ist erforderlich, und mit regexps in sed oder Emacs einfach
möglich.

Freundliche Grüße,
Christoph

#!/bin/bash

#
# Copyright (c) 2007 Christoph Conrad <mailto:christoph.conrad@gmx.de>
#
# pdfscanned2ascii v0.1
#
# Usage: pdfscanned2ascii <pdf-file>
#
# Split PDF consisting of scanned book images (two pages per image) and
# OCR the content. Output file: "<pdf-file>.txt".
#
# Required software: pdfimages, unpaper, convert, tesseract
#

if [ "$1" == "" ]; then
    echo "Usage: $0 <pdf-file>"
    exit
fi

input="$1"
output="$1.txt"

echo "split pdf to PPM/PBM (color/bw)"
pdfimages "$input" images

rm -f "$output"

find . -name 'images*' -type f -print | sort -n | while read file
do
   # split scanned double pages in single bw pages
   unpaper -t pbm --overwrite -l double -op 2 "$file" "image-split%d.pbm"

   # pbm->tiff
   convert image-split1.pbm image-split1.tiff
   convert image-split2.pbm image-split2.tiff

   # OCR & collect output
   tesseract image-split1.tiff out
   cat out.txt >> "$output"
   tesseract image-split2.tiff out
   cat out.txt >> "$output"
done

Reply to:

References:
- PDF aus Images in einzelne Seiten zerlegen + OCR
  - From: Christoph Conrad <nospam@spamgourmet.com>
- Re: PDF aus Images in einzelne Seiten zerlegen + OCR
  - From: Andreas Kretschmer <andreas.kretschmer@schollglas.com>
- Re: PDF aus Images in einzelne Seiten zerlegen + OCR
  - From: Christoph Conrad <nospam@spamgourmet.com>
- Re: PDF aus Images in einzelne Seiten zerlegen + OCR
  - From: Thomas Vollmer <thomas-ml@vollmeronline.de>
- Re: PDF aus Images in einzelne Seiten zerlegen + OCR
  - From: Christoph Conrad <nospam@spamgourmet.com>

Prev by Date: Re: Grafikkarte mit funktionierendem TV-Out ohne X-Server
Next by Date: Re: PDF aus Images in einzelne Seiten zerlegen + OCR
Previous by thread: Re: PDF aus Images in einzelne Seiten zerlegen + OCR
Next by thread: Re: PDF aus Images in einzelne Seiten zerlegen + OCR
Index(es):
- Date
- Thread