Re: Alternative to Debian Repository - extract CSV formatted data from PDF
On Sun 23 Feb 2025 at 22:13:55 (+0700), Max Nikulin wrote:
> On 22/02/2025 05:02, David Wright wrote:
> >
> > With mupdf, I don't even
> > know how to copy, as the mouse just drags the page around.
>
> I have not tried it, but...
> https://manpages.debian.org/bookworm/mupdf/mupdf.1.en.html#Right~2
I'm not sure how I missed that. But pasting the region gives a single
column, which then has to be reassembled. That's not difficult, but
it does mean finding the starts of the total lines as they're unmarked.
> > On Fri 21 Feb 2025 at 09:53:46 (+0700), Max Nikulin wrote:
> > > When text file has properly aligned columns, instead of
> > > "quoting" some spaces, it may be better to add TAB characters at
> > > certain positions on each line. Perhaps LibreOffice Calc even has GUI
> > > to select column widths during importing of text files.
> >
> > Yes, gnumeric has that too. But I would hate to have a lot of
> > mousework if I were repeating this frequently. And for a
> > postprandial one-off, I just took a no-tools approach
> > (barring an editor, of course).
>
> Maybe I have missed something, but you trick with "=" is not
> necessary. For tab-separated values
>
> sed -e 's/^ \{10\}/.&/' -e 's/^ \+//' -e 's/ \+/\t/g' /tmp/es-7.txt
>
> is not perfect, but should be acceptable.
It was insurance, lest I needed to use comma delimiters. Also,
other people may have different tools, by choice or availability.
> I am sure there should be ready to use tools that extract tables from
> PDF and from aligned text. Out of curiosity I tried to create a small
> python script to process text you attached earlier. It does not try to
> join text for multiline cells. Input file requires a couple of
> corrections to avoid overlapped text and a stray column. Heuristics
> may be improved.
I tend to scrape with bash scripts, using temporary intermediate files
between each step. When the page format changes, as it inevitably does,
the intermediates act as a script trace, making it easier to adapt.
Cheers,
David.
Reply to: