Re: generate a rss.xml from a bunch of HTML files

To: Darac Marjal <mailinglist@darac.org.uk>
Cc: debian-user@lists.debian.org
Subject: Re: generate a rss.xml from a bunch of HTML files
From: Dan Ritter <dsr@randomstring.org>
Date: Mon, 10 May 2021 06:16:55 -0400
Message-id: <[🔎] 20210510101655.cpxwoobtd2cgms7s@randomstring.org>
In-reply-to: <[🔎] ff8ce332-f473-307a-265c-97190f3156f4@darac.org.uk>
References: <[🔎] 20210509130048.kv6ajqwujdghs5q4@acr13.nuvreauspam> <[🔎] 874kfc2g02.fsf@zoho.eu> <[🔎] 20210509170502.r7h3xlrbh4g4ijrd@randomstring.org> <[🔎] 87mtt3zsc5.fsf@zoho.eu> <[🔎] 20210509193632.hlkuzotkcckdea73@randomstring.org> <[🔎] 6098563E.609@fastmail.fm> <[🔎] 20210509170107.2ec0ef51@hawk.localdomain> <[🔎] 87r1ifxyjj.fsf@zoho.eu> <[🔎] 20210510060624.dukl3kv3qgujqvud@acr13.nuvreauspam> <[🔎] ff8ce332-f473-307a-265c-97190f3156f4@darac.org.uk>

Darac Marjal wrote: 
> 
> On 10/05/2021 07:06, Andrei POPESCU wrote:
> > On Lu, 10 mai 21, 01:44:32, Emanuel Berg wrote:
> >> Charles Curley wrote:
> >>
> >>> Right. However, as I found out asking elsewhere, you can
> >>> include HTML in Markdown.
> >> Hehehe, let's see, first write HTML, then include it in
> >> Markdown, then have the static site generator generate
> >> HTML... brilliant :)
> > Surely there must be some site generator with RSS support that takes 
> > "plain" HTML as input.
> 
> I would guess that there isn't, purely because the task of figuring out
> what information to extract is relatively awkward. OK, there are some
> easy tasks such as "What is the title of the page?" (<title> tag), "What
> is the publication date of the page?" (mtime of the file), but there are
> trickier questions: "Who was the author of this page?" (well, we could
> hope for a meta tag, and fall back to the user running the tool,
> perhaps) and "What's the copyright of the page?" (I'm fairly certain
> there's no standard tag for that in HTML). Finally, there comes to the
> tricky bit of the page summary. Most feeds provide a summary of the page
> content to entice readers to read the whole article; one or two
> paragraphs should be sufficient. But if you've ever used the "Reader
> Mode" of a web browser, or ever pointed a screen reader at a web page,
> you'll know that finding the body of the page isn't a 100% accurate task.
> 
> This is why so many site generators prefer you to provide the pieces and
> they'll build up the final HTML. HTML *is* supposed to be a semantic
> language rather than a presentation language (that is, one could argue
> that the first few <p> tags are the first few paragraphs of the page),
> but if you're asking for a tool that can parse arbitrary HTML
> (including  machine-generated HTML), then I don't think it's going to be
> easy.


Again, from the basic Pelican documentation, in the section
right after INSTALL:

---

Pelican interprets the HTML in a very straightforward manner,
reading metadata from meta tags, the title from the title tag,
and the body out from the body tag:

<html>
    <head>
        <title>My super title</title>
        <meta name="tags" content="thats, awesome" />
        <meta name="date" content="2012-07-09 22:28" />
        <meta name="modified" content="2012-07-10 20:14" />
        <meta name="category" content="yeah" />
        <meta name="authors" content="Alexis Métaireau, Conan
Doyle" />
        <meta name="summary" content="Short version for index
and feeds" />
    </head>
    <body>
        This is the content of my super blog post.
    </body>
</html>

With HTML, there is one simple exception to the standard
metadata: tags can be specified either via the tags metadata, as
is standard in Pelican, or via the keywords metadata, as is
standard in HTML. The two can be used interchangeably.

Note that, aside from the title, none of this content metadata
is mandatory: if the date is not specified and DEFAULT_DATE is
set to 'fs', Pelican will rely on the file’s “mtime” timestamp,
and the category can be determined by the directory in which the
file resides. For example, a file located at
python/foobar/myfoobar.rst will have a category of foobar. If
you would like to organize your files in other ways where the
name of the subfolder would not be a good category name, you can
set the setting USE_FOLDER_AS_CATEGORY to False. When parsing
dates given in the page metadata, Pelican supports the W3C’s
suggested subset ISO 8601.

So the title is the only required metadata. If that bothers you,
worry not. Instead of manually specifying a title in your
metadata each time, you can use the source content file name as
the title. For example, a Markdown source file named Publishing
via Pelican.md would automatically be assigned a title of
Publishing via Pelican. If you would prefer this behavior, add
the following line to your settings file:
---

Reply to:

References:
- Re: generate a rss.xml from a bunch of HTML files
  - From: Andrei POPESCU <andreimpopescu@gmail.com>
- Re: generate a rss.xml from a bunch of HTML files
  - From: Emanuel Berg <moasenwood@zoho.eu>
- Re: generate a rss.xml from a bunch of HTML files
  - From: Dan Ritter <dsr@randomstring.org>
- Re: generate a rss.xml from a bunch of HTML files
  - From: Emanuel Berg <moasenwood@zoho.eu>
- Re: generate a rss.xml from a bunch of HTML files
  - From: Dan Ritter <dsr@randomstring.org>
- Re: generate a rss.xml from a bunch of HTML files
  - From: The Wanderer <wanderer@fastmail.fm>
- Re: generate a rss.xml from a bunch of HTML files
  - From: Charles Curley <charlescurley@charlescurley.com>
- Re: generate a rss.xml from a bunch of HTML files
  - From: Emanuel Berg <moasenwood@zoho.eu>
- Re: generate a rss.xml from a bunch of HTML files
  - From: Andrei POPESCU <andreimpopescu@gmail.com>
- Re: generate a rss.xml from a bunch of HTML files
  - From: Darac Marjal <mailinglist@darac.org.uk>

Prev by Date: Re: generate a rss.xml from a bunch of HTML files
Next by Date: Re: generate a rss.xml from a bunch of HTML files
Previous by thread: Re: generate a rss.xml from a bunch of HTML files
Next by thread: Re: generate a rss.xml from a bunch of HTML files
Index(es):
- Date
- Thread