[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1009971: Our current validation script does not work with html5 files



Hi Laura!

On Thu, 21 Apr 2022 15:53:03 +0200
Laura Arjona Reina <larjona@debian.org> wrote:

> Package: www.debian.org
> User: www.debian.org@packages.debian.org
> Usertag: scripts
> Severity: important
> 
> Hi all
> I'm starting to work in the bug #980921 (Pages in HTML5) and, as it is
> mentioned there, we need to adapt our "validate" script so it correctly
> processes the pages declared as HTML5 (currently, only the homepage in the
> different languages).
> 
> The current status is following:
> 
> Related scripts:
> 
> https://salsa.debian.org/webmaster-team/cron/-/blob/master/lessoften executed
> once a day, calling (via run-parts) the following script:
> https://salsa.debian.org/webmaster-team/cron/-/blob/master/scripts/999Xvalidate
> which gets the list of languages and folders to process and then calls:
> 
> https://salsa.debian.org/webmaster-team/cron/-/blob/master/scripts/validate
> 
> Which is the actual script doing the HTML validation, using the onsgmls
> command (part of opensp package). 
> 
> This command validates a SGML file based on a DTD. The issue (as far as I
> know) is that there is no "official" SGML DTD template to use when parsing
> HTML5 files.
> 
> I have tried adapting the "validate" script to be able to recognize the
> DOCTYPE header used for html5 files, and then tried to pass a DTD (I tried
> downloading the ones here http://sgmljs.net/docs/w3c-html5-dtd.html and here
> http://sgmljs.net/docs/w3c-html52-dtd.html and also here
> https://jkorpela.fi/html5-dtd.html ) but couldn't make it work, and also was
> not convinced it is the better approach.
> 
> I've tried to look at what w3c validator uses and they use Nu.checker:
> 
> https://validator.w3.org/nu/about.html
> https://github.com/validator/validator/releases/latest
> 
> But I'm not sure if this is packaged in Debian in any of its flavours.
> 
> I have searched https://packages.debian.org/search?keywords=html5 but none of
> the results looks like a commandline tool that we could call instead of
> onsgmls
> 
> So I don't know what to do at this point.
> 
> In my local machine, I have downloaded the vnu.jar file from the latest Nu
> checker release " and tried to validate files and it works. But I don't know
> if asking DSA to install openjdk in www-master and include a copy of vnu.jar
> in our cron scripts is good and/or elegant.
> 
> Opinions, advice and patches are very welcome.
> 
> Meanwhile, I guess we can modify 99Xvalidate to add file exclusions, and
> exclude, for now, /index.*.html and later the few other files we have with
> html5 tags for now. I don't know how to exclude the index.*.html files on top
> folder only and not in subfolders but I guess playing with find -wholename
> and prune will do the treak (if you know, please go ahead).
> 
> Kind regards,

Perhaps my vnu wrapper will prove of use:

* https://github.com/shlomif/python-vnu_validator

* https://pypi.org/project/vnu-validator/

*
https://github.com/shlomif/shlomi-fish-homepage/blob/master/Tests/validate-html-using-vnu.py
 

-- 

Shlomi Fish       https://www.shlomifish.org/
What Makes Software Apps High Quality -  https://shlom.in/sw-quality

<rindolf> Underscores are the most nutritious punctuation. But you also need
to eat letters, digits and whitespace for a balanced diet.
    — https://is.gd/pHLcFq

Please reply to list if it's a mailing list post - https://shlom.in/reply .


Reply to: