[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

[PATCH] Fix encoding discrepancies and format in users/*/*.wml files



Hi there!

While looking for possible sponsors for DebConf13[1], I wrote a
quick&dirty shell script (attached, but read [2]) to extract some
information, specifically who sent the request and for which
organization.

[1] <http://lists.debian.org/87objissel.fsf@gismo.pca.it>
[2] I know that a better solution would have been to understand&reuse
    the WML infrastructure (these files are already parsed to generate
    the correct index), but I did not have the time for that, sorry.

I thus discovered some discrepancies:

- the line containing the contact information is not "standard",
  i.e. not always "# From: NAME <EMAIL>".  Moreover, some names were not
  completely "standard" either, e.g. lowercase letters or extra
  quotes[3].

- some files contain HTML-encoded accented characters, while others not,
  which sounded strange given the README[4] that states:

    Each file in these directories will create a link from the /users/ page,
    showing the content of the <pagetitle> tag. BE CAREFUL - the <pagetitle>
    is added verbatim, that means it MUST NOT contain any 8bit characters (in
    the english tree) because these titles are put into the translated pages
    when there is no translation of the file itself and create wrong
    characters.

    AGAIN: DO NOT put any 8BIT CHARACTERS into the <pagetitle>.

  This was even more strange to me since Debian is UTF-8-aware since a
  while and the migration to UTF-8 for the website was completed [5].

[3] I know this could sound nitpicking, but for automatic parsing (and
    consistency) I consider it a bug.
[4] <http://anonscm.debian.org/viewvc/webwml/webwml/english/users/README?revision=1.4&view=markup>
[5] <http://bugs.debian.org/567781>

Two examples:

--8<---------------cut here---------------start------------->8---
Index: com/alcove.wml
===================================================================
RCS file: /cvs/webwml/webwml/english/users/com/alcove.wml,v
retrieving revision 1.2
diff -u -r1.2 alcove.wml
--- com/alcove.wml	10 Sep 2007 07:38:07 -0000	1.2
+++ com/alcove.wml	19 Nov 2012 20:37:37 -0000
@@ -1,12 +1,12 @@
 # From: Yann Dirson <ydirson@fr.alcove.com>
 
-<define-tag pagetitle>Alc&ocirc;ve, France</define-tag>
+<define-tag pagetitle>Alcôve, France</define-tag>
 <define-tag webpage>http://www.alcove.com/</define-tag>
 
 #use wml::debian::users
 
 <p>
-  Here at Alc&ocirc;ve, we use Debian for all of our infrastructure and
+  Here at Alcôve, we use Debian for all of our infrastructure and
   development workstations, totalling over 30 machines.  We also
   recommend Debian to our customers for most situations, although we
   also install other distributions if they so desire.
Index: edu/unieconomicspoznan.wml
===================================================================
RCS file: /cvs/webwml/webwml/english/users/edu/unieconomicspoznan.wml,v
retrieving revision 1.2
diff -u -r1.2 unieconomicspoznan.wml
--- edu/unieconomicspoznan.wml	26 May 2011 10:05:50 -0000	1.2
+++ edu/unieconomicspoznan.wml	19 Nov 2012 20:37:37 -0000
@@ -1,4 +1,4 @@
-# Maciej So³tysiak <maciej.soltysiak@ae.poznan.pl>
+# From: Maciej Sołtysiak <maciej.soltysiak@ae.poznan.pl>
 
 <define-tag pagetitle>University of Economics in Poznan, Poland</define-tag>
 <define-tag webpage>http://www.ae.poznan.pl/</define-tag>
--8<---------------cut here---------------end--------------->8---

Given that I have anyway corrected all the entries for the DebConf
sponsors-table, I was wondering if we would like to apply them, which
also means that the README[3] file is to be corrected.  Obviously, any
error generated from such actions would be mine ;-)

NB, I have not checked languages other than English nor tried to rebuild
    the full website.  But given that the migration to UTF-8 is
    completed[4], I would be surprised if the above changes will
    generate any error.

Comments?

Thx, bye,
Gismo / Luca

#!/bin/sh
#
# extract-debian-users.sh, extract information from webwml files used
# to build www.debian.org/users/ available at
#   <http://anonscm.debian.org/viewvc/webwml/webwml/english/users/>
# Copyright (C) 2012 Luca Capello <luca@pca.it>
# Version:
# 2012-11-19: 0.1


set -e

if [ -z "$1" ]; then
    echo "Usage: $0 directory [committer]"
    exit 1
elif [ ! -d "$1" ]; then
    echo "$1 is not a directory"
    exit 2
else
    # remove tralinig '/'
    DIRECTORY=$(echo "$1" | sed -e 's/\/$//')
fi

if [ -n "$2" ]; then
    COMMITTER="$2"
else
    COMMITTER="$USER"
fi

# description of the output
cat <<EOF
From <http://www.debian.org/users/$(basename $DIRECTORY)>
======================================
EOF

DATE=$(date +%Y-%m-%d)
for I in $DIRECTORY/*.wml; do
    FROM=$(grep "^# From:" "$I")
    CONTACT=$(echo "$FROM" | sed -e 's/\(.*\)<//' -e 's/>\(.*\)//')
    PERSON=$(echo "$FROM" | sed -e 's/\(.*\)://' -e 's/<\(.*\)//' -e 's/^ //' -e 's/ $//')
    TITLE=$(grep "pagetitle>" "$I" | sed -e 's/\(.*\)pagetitle>//' -e 's/<\/define\(.*\)//')
    WEBSITE=$(grep "webpage>" "$I" | sed -e 's/\(.*\)webpage>//' -e 's/<\/define\(.*\)//')
    LINK=$(echo "$I" | sed -e 's/\(.*\)users\///' -e 's/\.wml//')
    cat <<EOF
:$TITLE
$COMMITTER: $DATE: contact is $CONTACT
        person is $PERSON
        website is $WEBSITE
        source is http://www.debian.org/users/$LINK

EOF
done

Attachment: pgpTzSwPGJSO_.pgp
Description: PGP signature


Reply to: