Re: convert html to xml

To: debian-user@lists.debian.org
Subject: Re: convert html to xml
From: Max Nikulin <manikulin@gmail.com>
Date: Mon, 1 Sep 2025 10:09:12 +0700
Message-id: <[🔎] 10932or$60h$1@ciao.gmane.io>
In-reply-to: <20250831075157.po5zu2waasoqormb@rlharris.org>
References: <20250830041717.34ojpsyvr6mhxfia@rlharris.org> <108ucd4$4uk$1@ciao.gmane.io> <20250831075157.po5zu2waasoqormb@rlharris.org>

On 31/08/2025 14:51, Russell L. Harris wrote:

On Sat, Aug 30, 2025 at 03:22:58PM +0700, Max Nikulin wrote:

Does "pdftotext FILE.PDF -" is able to extract readable text?


Yes.


OK, I was afraid that TeX-specific encoding may be an obstacle for crawlers.

Does "pdfinfo FILE.PDF" list author, title, etc.?


$ pdfinfo journal-011.pdf
Creator:          TeX output 2025.08.15:0329
Producer:        dvipdfm (20211117)
CreationDate:    Fri Aug 15 03:29:17 2025 UTC

It was a hint that you need to add metadata. Help search engines tocreate meaningful entries when they are presenting search results.

Are links to these files have descriptive context?


Each article has a summary or abstract, followed by several links:

\href{http://www.gospelbroadcasting.org/journal/pdf/journal-011.pdf}{[~View
or Download PDF (journal-011.pdf)~]}\\~\\

Does it mean that some files are linked from PDF files only? I have noidea if search engines parse links in PDFs. Struggling with visibilityissues, I would ensure that they are directly accessible from HTML pages.

The problem is finding a way to import the studies into WordPress.


You have asked it earlier.


Please forgive the duplication.     I lost track.

From my point of view, in both thread people tried their best to helpyou and suggestions may be used at least as additional keywords forsearch engine queries. The problem is that you are seeking a recipespecific to WordPress. It is an off-topic on a Debian-related mailinglist. (I do not believe in magic, I expect that other CMS and staticsite generators may be configured to achieve results similar to WordPress.)

Reply to:

Follow-Ups:
- Re: convert html to xml
  - From: "Russell L. Harris" <russell@rlharris.org>
- Re: convert html to xml
  - From: "Roy J. Tellason, Sr." <roy@rtellason.com>

Prev by Date: Re: Please check my sudo bash script
Next by Date: Re: convert html to xml
Previous by thread: Re: Please check my sudo bash script
Next by thread: Re: convert html to xml
Index(es):
- Date
- Thread