Bug#1009971: Our current validation script does not work with html5 files
Hi Laura!
On Thu, 21 Apr 2022 15:53:03 +0200
Laura Arjona Reina <larjona@debian.org> wrote:
> Package: www.debian.org
> User: www.debian.org@packages.debian.org
> Usertag: scripts
> Severity: important
>
> Hi all
> I'm starting to work in the bug #980921 (Pages in HTML5) and, as it is
> mentioned there, we need to adapt our "validate" script so it correctly
> processes the pages declared as HTML5 (currently, only the homepage in the
> different languages).
>
> The current status is following:
>
> Related scripts:
>
> https://salsa.debian.org/webmaster-team/cron/-/blob/master/lessoften executed
> once a day, calling (via run-parts) the following script:
> https://salsa.debian.org/webmaster-team/cron/-/blob/master/scripts/999Xvalidate
> which gets the list of languages and folders to process and then calls:
>
> https://salsa.debian.org/webmaster-team/cron/-/blob/master/scripts/validate
>
> Which is the actual script doing the HTML validation, using the onsgmls
> command (part of opensp package).
>
> This command validates a SGML file based on a DTD. The issue (as far as I
> know) is that there is no "official" SGML DTD template to use when parsing
> HTML5 files.
>
> I have tried adapting the "validate" script to be able to recognize the
> DOCTYPE header used for html5 files, and then tried to pass a DTD (I tried
> downloading the ones here http://sgmljs.net/docs/w3c-html5-dtd.html and here
> http://sgmljs.net/docs/w3c-html52-dtd.html and also here
> https://jkorpela.fi/html5-dtd.html ) but couldn't make it work, and also was
> not convinced it is the better approach.
>
> I've tried to look at what w3c validator uses and they use Nu.checker:
>
> https://validator.w3.org/nu/about.html
> https://github.com/validator/validator/releases/latest
>
> But I'm not sure if this is packaged in Debian in any of its flavours.
>
> I have searched https://packages.debian.org/search?keywords=html5 but none of
> the results looks like a commandline tool that we could call instead of
> onsgmls
>
> So I don't know what to do at this point.
>
> In my local machine, I have downloaded the vnu.jar file from the latest Nu
> checker release " and tried to validate files and it works. But I don't know
> if asking DSA to install openjdk in www-master and include a copy of vnu.jar
> in our cron scripts is good and/or elegant.
>
> Opinions, advice and patches are very welcome.
>
> Meanwhile, I guess we can modify 99Xvalidate to add file exclusions, and
> exclude, for now, /index.*.html and later the few other files we have with
> html5 tags for now. I don't know how to exclude the index.*.html files on top
> folder only and not in subfolders but I guess playing with find -wholename
> and prune will do the treak (if you know, please go ahead).
>
> Kind regards,
Perhaps my vnu wrapper will prove of use:
* https://github.com/shlomif/python-vnu_validator
* https://pypi.org/project/vnu-validator/
*
https://github.com/shlomif/shlomi-fish-homepage/blob/master/Tests/validate-html-using-vnu.py
--
Shlomi Fish https://www.shlomifish.org/
What Makes Software Apps High Quality - https://shlom.in/sw-quality
<rindolf> Underscores are the most nutritious punctuation. But you also need
to eat letters, digits and whitespace for a balanced diet.
— https://is.gd/pHLcFq
Please reply to list if it's a mailing list post - https://shlom.in/reply .
Reply to: