[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: convert html to xml



On Sat, Aug 30, 2025 at 03:22:58PM +0700, Max Nikulin wrote:

For me it is not uncommon to get PDF files in search results. That is why I suspect that something is wrong with your PDF's. Are they generated to be sent to printer or to be published on a web site?

They are produced with dvipdfm.  I do not see in the dvipdfm man page
a printer/web switch.

Does "pdftotext FILE.PDF -" is able to extract readable text?

Yes.

Does "pdfinfo FILE.PDF" list author, title, etc.?

$ pdfinfo journal-011.pdf
Creator:          TeX output 2025.08.15:0329
Producer:        dvipdfm (20211117)
CreationDate:    Fri Aug 15 03:29:17 2025 UTC
Custom Metadata: no
Metadata Stream: no
Tagged:          no
UserProperties:  no
Suspects:        no
Form:            none
JavaScript:      no
Pages:           8
Encrypted:       no
Page size:       612 x 792 pts (letter)
Page rot:        0
File size:       61997 bytes
Optimized:       no
PDF version:     1.5

Are links to these files have descriptive context?

Each article has a summary or abstract, followed by several links:

\href{http://www.gospelbroadcasting.org/journal/pdf/journal-011.pdf}{[~View
or Download PDF (journal-011.pdf)~]}\\~\\
\bigskip
\href{http://www.gospelbroadcasting.org/journal/epub/journal-011.epub}{[~View
or Download EPUB (journal-011.epub)~]}\\~\\
\bigskip
\href{http://www.gospelbroadcasting.org/journal/html/journal-011/journal-011.html}{[~View
or Download HTML (journal-011.html)~]}\\~\\

The problem is finding a way to import the studies into WordPress.

You have asked it earlier.

Please forgive the duplication.	 I lost track.

It seems, active subscribers on this list do not have this specific
experience. I expect, there are enough ways to import content into
WordPress. You may ask your question in some LaTeX community. You may
ask in some WordPress community what formats are suitable for import
(perhaps there are not so much participants familiar with LaTeX
there).

It appears that an xml file is the easiest way to import lengthy
material into WordPress.  And it turns out that latexml converts LaTeX
into XML.

I was not aware of latexml until a day or two ago.  My plan of attack
was to convert LaTeX to HTML, then convert HTML to XML, and finally,
import XML into WordPress.

My original thought was to create a "quick-and-dirty" WordPress blog
site with WordPress-only files, but with links to
gospelbroadcasting.org.  Each study, whether two pages or twenty,
would be a blog post on the WordPress web site.

People serious about Bible study could go to gospelbroadcasting.org to
download PDF, EPUB, or HTML files.  And with all the spiffy
S.E.O. tools available for WordPress, I figured S.E.O. of the
WordPress sight would be easy.

But this recent exchange has me reconsidering.  I am grateful for the
critique.

Are you realizing that XML is a rather generic data format? You need some specific format *based* on XML.

All I know is that a number of WordPress support sites recommend the
use of "XML" but do not provide further detail.  WordPress is a
strange environment.

RLH


Reply to: