--- Begin Message ---
- To: Debian Bug Tracking System <submit@bugs.debian.org>
- Subject: referencer: PDF-scraping for DOIs sometimes cuts them off in the middle
- From: Zack Weinberg <zackw@panix.com>
- Date: Sat, 12 May 2007 20:00:56 -0700
- Message-id: <20070513030056.10199.23411.reportbug@localhost>
Package: referencer
Version: 1.0.2-1
Severity: normal
I have a number of PDFs with DOIs appearing in the text, but that
Referencer cannot properly scrape out. There is no true metadata in the
PDF, so it's going for text extraction from the page body. The complete
BT/ET block containing the DOI is at the end of this message, but the
key bit is this:
[(doi:10.1016/)14.5(S)-95.3(0)]TJ
6.3307 0 TD
0.0983 Tc
[(010-0277\(02\)00)-6.3(235-4)]TJ
ET
This causes libpoppler to feed this text to BibData::guessDoi():
doi:10.1016/S 0 0 1 0 - 0 2 7 7 ( 0 2 ) 0 0 2 3 5 - 4\n
"10.1016/S" is what Referencer records as the DOI. The correct DOI is the above
string with all the spaces taken out, i.e. 10.1016/S0010-0277(02)00235-4 .
Unfortunately, I don't have any concrete suggestion for how guessDoi() could
do a better job in this case without also screwing up other situations (where
random text appears immediately after the DOI, separated only by a space).
-- System Information:
Debian Release: lenny/sid
APT prefers unstable
APT policy: (500, 'unstable'), (500, 'testing'), (1, 'experimental')
Architecture: i386 (i686)
Kernel: Linux 2.6.18-4-686 (SMP w/2 CPU cores)
Locale: LANG=en_US, LC_CTYPE=en_US (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash
Versions of packages referencer depends on:
ii libart-2.0-2 2.3.19-3 Library of functions for 2D graphi
ii libatk1.0-0 1.18.0-2 The ATK accessibility toolkit
ii libbonobo2-0 2.18.0-2 Bonobo CORBA interfaces library
ii libbonoboui2-0 2.18.0-5 The Bonobo UI library
ii libboost-regex1.33.1 1.33.1-10 regular expression library for C++
ii libc6 2.5-7 GNU C Library: Shared libraries
ii libcairo2 1.4.6-1 The Cairo 2D vector graphics libra
ii libfontconfig1 2.4.2-1.2 generic font configuration library
ii libgcc1 1:4.1.2-6 GCC support library
ii libgconf2-4 2.18.0.1-3 GNOME configuration database syste
ii libgconfmm-2.6-1c2 2.14.2-1 C++ wrappers for GConf (shared lib
ii libglade2-0 1:2.6.0-4 library to load .glade files at ru
ii libglademm-2.4-1c2a 2.6.2-2 C++ wrappers for libglade2 (shared
ii libglib2.0-0 2.12.12-1 The GLib library of C routines
ii libglibmm-2.4-1c2a 2.12.7-1 C++ wrapper for the GLib toolkit (
ii libgnome-keyring0 0.8.1-2 GNOME keyring services library
ii libgnome-vfsmm-2.6-1c2a 2.14.0-1 C++ wrappers for GnomeVFS (shared
ii libgnome2-0 2.18.0-4 The GNOME 2 library - runtime file
ii libgnomecanvas2-0 2.14.0-2 A powerful object-oriented display
ii libgnomecanvasmm-2.6-1c2a 2.14.0-1 C++ wrappers for libgnomecanvas2 (
ii libgnomemm-2.6-1c2 2.14.0-1 C++ wrappers for libgnome (shared
ii libgnomeui-0 2.18.1-2 The GNOME 2 libraries (User Interf
ii libgnomeuimm-2.6-1c2a 2.14.0-1 C++ wrappers for libgnomeui (share
ii libgnomevfs2-0 1:2.18.1-2 GNOME Virtual File System (runtime
ii libgtk2.0-0 2.10.12-1 The GTK+ graphical user interface
ii libgtkmm-2.4-1c2a 1:2.8.8-1 C++ wrappers for GTK+ 2.4 (shared
ii libice6 1:1.0.3-2 X11 Inter-Client Exchange library
ii liborbit2 1:2.14.7-0.1 libraries for ORBit2 - a CORBA ORB
ii libpango1.0-0 1.16.4-1 Layout and rendering of internatio
ii libpoppler0c2 0.4.5-5.1 PDF rendering library
ii libpopt0 1.10-3 lib for parsing cmdline parameters
ii libsigc++-2.0-0c2a 2.0.17-2 type-safe Signal Framework for C++
ii libsm6 1:1.0.2-2 X11 Session Management library
ii libstdc++6 4.1.2-6 The GNU Standard C++ Library v3
ii libx11-6 2:1.0.3-7 X11 client-side library
ii libxcursor1 1:1.1.8-2 X cursor management library
ii libxext6 1:1.0.3-2 X11 miscellaneous extension librar
ii libxfixes3 1:4.0.3-2 X11 miscellaneous 'fixes' extensio
ii libxi6 1:1.0.1-4 X11 Input extension library
ii libxinerama1 1:1.0.2-1 X11 Xinerama extension library
ii libxml2 2.6.28.dfsg-1 GNOME XML library
ii libxrandr2 2:1.2.1-1 X11 RandR extension library
ii libxrender1 1:0.9.2-1 X Rendering Extension client libra
referencer recommends no packages.
-- no debconf information
BT
7.9702 0 0 7.9702 340.5542 597.3164 Tm
[(www.elsev)11.4(ier.com/locate/co)8.9(gnit)]TJ
-32.0589 -63.7337 TD
[(0010-0277)15.5(/03/$)-299.5(-)-300.1(see)-293(front)-300.7(matter)]TJ
/F4 1 Tf
13.9915 0 TD
(\001)Tj
/F1 1 Tf
1.1666 0 TD
[(2003)-297.5(Elsevier)-289.8(Science)-293.2(B.V.)-299.7(All)-299.1(rights)-294.9(reserved.)]TJ
-15.1581 -1.2448 TD
[(doi:10.1016/)14.5(S)-95.3(0)]TJ
6.3307 0 TD
0.0983 Tc
[(010-0277\(02\)00)-6.3(235-4)]TJ
ET
--- End Message ---