Re: convert html to xml

To: Max Nikulin <manikulin@gmail.com>
Cc: debian-user@lists.debian.org
Subject: Re: convert html to xml
From: "Russell L. Harris" <russell@rlharris.org>
Date: Sun, 31 Aug 2025 07:51:57 +0000
Message-id: <[🔎] 20250831075157.po5zu2waasoqormb@rlharris.org>
In-reply-to: <[🔎] 108ucd4$4uk$1@ciao.gmane.io>
References: <[🔎] 20250830041717.34ojpsyvr6mhxfia@rlharris.org> <[🔎] 108ucd4$4uk$1@ciao.gmane.io>

On Sat, Aug 30, 2025 at 03:22:58PM +0700, Max Nikulin wrote:

For me it is not uncommon to get PDF files in search results. That iswhy I suspect that something is wrong with your PDF's. Are theygenerated to be sent to printer or to be published on a web site?


They are produced with dvipdfm.  I do not see in the dvipdfm man page
a printer/web switch.

Does "pdftotext FILE.PDF -" is able to extract readable text?


Yes.

Does "pdfinfo FILE.PDF" list author, title, etc.?


$ pdfinfo journal-011.pdf
Creator:          TeX output 2025.08.15:0329
Producer:        dvipdfm (20211117)
CreationDate:    Fri Aug 15 03:29:17 2025 UTC
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           8
Encrypted:       no
Page size:       612 x 792 pts (letter)
Page rot:        0
File size:       61997 bytes
Optimized:       no
PDF version:     1.5

Are links to these files have descriptive context?


Each article has a summary or abstract, followed by several links:

\href{http://www.gospelbroadcasting.org/journal/pdf/journal-011.pdf}{[~View
or Download PDF (journal-011.pdf)~]}\\~\\
\bigskip
\href{http://www.gospelbroadcasting.org/journal/epub/journal-011.epub}{[~View
or Download EPUB (journal-011.epub)~]}\\~\\
\bigskip
\href{http://www.gospelbroadcasting.org/journal/html/journal-011/journal-011.html}{[~View
or Download HTML (journal-011.html)~]}\\~\\

The problem is finding a way to import the studies into WordPress.


You have asked it earlier.


Please forgive the duplication.	 I lost track.

It seems, active subscribers on this list do not have this specific
experience. I expect, there are enough ways to import content into
WordPress. You may ask your question in some LaTeX community. You may
ask in some WordPress community what formats are suitable for import
(perhaps there are not so much participants familiar with LaTeX
there).


It appears that an xml file is the easiest way to import lengthy
material into WordPress.  And it turns out that latexml converts LaTeX
into XML.

I was not aware of latexml until a day or two ago.  My plan of attack
was to convert LaTeX to HTML, then convert HTML to XML, and finally,
import XML into WordPress.

My original thought was to create a "quick-and-dirty" WordPress blog
site with WordPress-only files, but with links to
gospelbroadcasting.org.  Each study, whether two pages or twenty,
would be a blog post on the WordPress web site.

People serious about Bible study could go to gospelbroadcasting.org to
download PDF, EPUB, or HTML files.  And with all the spiffy
S.E.O. tools available for WordPress, I figured S.E.O. of the
WordPress sight would be easy.

But this recent exchange has me reconsidering.  I am grateful for the
critique.

Are you realizing that XML is a rather generic data format? You needsome specific format *based* on XML.


All I know is that a number of WordPress support sites recommend the
use of "XML" but do not provide further detail.  WordPress is a
strange environment.

RLH

Reply to:

References:
- convert html to xml
  - From: "Russell L. Harris" <russell@rlharris.org>
- Re: convert html to xml
  - From: Max Nikulin <manikulin@gmail.com>

Prev by Date: Re: Encrypt replies by default
Next by Date: Re: Encrypt replies by default
Previous by thread: Re: convert html to xml
Next by thread: Re: convert html to xml
Index(es):
- Date
- Thread