[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: convert html to xml



On 31/08/2025 14:51, Russell L. Harris wrote:
On Sat, Aug 30, 2025 at 03:22:58PM +0700, Max Nikulin wrote:
Does "pdftotext FILE.PDF -" is able to extract readable text?

Yes.

OK, I was afraid that TeX-specific encoding may be an obstacle for crawlers.

Does "pdfinfo FILE.PDF" list author, title, etc.?

$ pdfinfo journal-011.pdf
Creator:          TeX output 2025.08.15:0329
Producer:        dvipdfm (20211117)
CreationDate:    Fri Aug 15 03:29:17 2025 UTC

It was a hint that you need to add metadata. Help search engines to create meaningful entries when they are presenting search results.

Are links to these files have descriptive context?

Each article has a summary or abstract, followed by several links:

\href{http://www.gospelbroadcasting.org/journal/pdf/journal-011.pdf}{[~View
or Download PDF (journal-011.pdf)~]}\\~\\

Does it mean that some files are linked from PDF files only? I have no idea if search engines parse links in PDFs. Struggling with visibility issues, I would ensure that they are directly accessible from HTML pages.

The problem is finding a way to import the studies into WordPress.

You have asked it earlier.

Please forgive the duplication.     I lost track.

From my point of view, in both thread people tried their best to help you and suggestions may be used at least as additional keywords for search engine queries. The problem is that you are seeking a recipe specific to WordPress. It is an off-topic on a Debian-related mailing list. (I do not believe in magic, I expect that other CMS and static site generators may be configured to achieve results similar to WordPress.)


Reply to: