Re: convert html to xml
On 31/08/2025 14:51, Russell L. Harris wrote:
On Sat, Aug 30, 2025 at 03:22:58PM +0700, Max Nikulin wrote:
Does "pdftotext FILE.PDF -" is able to extract readable text?
Yes.
OK, I was afraid that TeX-specific encoding may be an obstacle for crawlers.
Does "pdfinfo FILE.PDF" list author, title, etc.?
$ pdfinfo journal-011.pdf
Creator: TeX output 2025.08.15:0329
Producer: dvipdfm (20211117)
CreationDate: Fri Aug 15 03:29:17 2025 UTC
It was a hint that you need to add metadata. Help search engines to
create meaningful entries when they are presenting search results.
Are links to these files have descriptive context?
Each article has a summary or abstract, followed by several links:
\href{http://www.gospelbroadcasting.org/journal/pdf/journal-011.pdf}{[~View
or Download PDF (journal-011.pdf)~]}\\~\\
Does it mean that some files are linked from PDF files only? I have no
idea if search engines parse links in PDFs. Struggling with visibility
issues, I would ensure that they are directly accessible from HTML pages.
The problem is finding a way to import the studies into WordPress.
You have asked it earlier.
Please forgive the duplication. I lost track.
From my point of view, in both thread people tried their best to help
you and suggestions may be used at least as additional keywords for
search engine queries. The problem is that you are seeking a recipe
specific to WordPress. It is an off-topic on a Debian-related mailing
list. (I do not believe in magic, I expect that other CMS and static
site generators may be configured to achieve results similar to WordPress.)
Reply to: