Re: convert html to xml

To: Max Nikulin <manikulin@gmail.com>
Cc: debian-user@lists.debian.org
Subject: Re: convert html to xml
From: "Russell L. Harris" <russell@rlharris.org>
Date: Mon, 1 Sep 2025 18:36:19 +0000
Message-id: <[🔎] 20250901183619.jl5znezkhobtgwme@rlharris.org>
In-reply-to: <108ucd4$4uk$1@ciao.gmane.io>
References: <20250830041717.34ojpsyvr6mhxfia@rlharris.org> <108ucd4$4uk$1@ciao.gmane.io>

On Sat, Aug 30, 2025 at 03:22:58PM +0700, Max Nikulin wrote:

For me it is not uncommon to get PDF files in search results. That iswhy I suspect that something is wrong with your PDF's. Are theygenerated to be sent to printer or to be published on a web site? Does"pdftotext FILE.PDF -" is able to extract readable text? Does "pdfinfoFILE.PDF" list author, title, etc.? Are links to these files havedescriptive context?



Max,

I am very grateful for your diagnosis.  I was unaware of metadata for
PDF.

With a bit of searching, I located several authoritative articles on
metadata for PDF.

It turns out that the hyperref package for LaTeX has provision and
instruction for the metadata fields.  And there is a paper by Karl
Rupp, "PDF Metadata in LaTeX Documents".

RLH

Reply to:

Prev by Date: Re: [tde-users] blocked upgrade over "missing newline" (corrupt text file in /var/lib/dpkg/info/)
Next by Date: Re: convert html to xml
Previous by thread: Re: convert html to xml
Next by thread: Re: Creating a custom Debian build
Index(es):
- Date
- Thread