[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Bug#1009971: Our current validation script does not work with html5 files



Package: www.debian.org
User: www.debian.org@packages.debian.org
Usertag: scripts
Severity: important

Hi all
I'm starting to work in the bug #980921 (Pages in HTML5) and, as it is mentioned
there, we need to adapt our "validate" script so it correctly processes the
pages declared as HTML5 (currently, only the homepage in the different languages).

The current status is following:

Related scripts:

https://salsa.debian.org/webmaster-team/cron/-/blob/master/lessoften executed
once a day, calling (via run-parts) the following script:
https://salsa.debian.org/webmaster-team/cron/-/blob/master/scripts/999Xvalidate
which gets the list of languages and folders to process and then calls:

https://salsa.debian.org/webmaster-team/cron/-/blob/master/scripts/validate

Which is the actual script doing the HTML validation, using the onsgmls command (part of opensp package). 

This command validates a SGML file based on a DTD. The issue (as far as I know) is that there is no "official" SGML DTD template to use when parsing HTML5 files.

I have tried adapting the "validate" script to be able to recognize the DOCTYPE header used for html5 files, and then tried to pass a DTD (I tried downloading the ones here http://sgmljs.net/docs/w3c-html5-dtd.html and here http://sgmljs.net/docs/w3c-html52-dtd.html and also here https://jkorpela.fi/html5-dtd.html ) but couldn't make it work, and also was not convinced it is the better approach.

I've tried to look at what w3c validator uses and they use Nu.checker:

https://validator.w3.org/nu/about.html
https://github.com/validator/validator/releases/latest

But I'm not sure if this is packaged in Debian in any of its flavours.

I have searched https://packages.debian.org/search?keywords=html5 but none of the results looks like a commandline tool that we could call instead of onsgmls

So I don't know what to do at this point.

In my local machine, I have downloaded the vnu.jar file from the latest Nu checker release " and tried to validate files and it works. But I don't know if asking DSA to install openjdk in www-master and include a copy of vnu.jar in our cron scripts is good and/or elegant.

Opinions, advice and patches are very welcome.

Meanwhile, I guess we can modify 99Xvalidate to add file exclusions, and exclude, for now, /index.*.html and later the few other files we have with html5 tags for now. I don't know how to exclude the index.*.html files on top folder only and not in subfolders but I guess playing with find -wholename and prune will do the treak (if you know, please go ahead).

Kind regards,
-- 
Laura Arjona
https://wiki.debian.org/LauraArjona


Reply to: