Re: Conversion edge cases
In what seemed like a good idea 12(!) hours ago, I decided to
spend some time clearing up a few edge cases.
= #language tags =
All language tags should now match this regexp:
#language [a-z][a-z][-_]([A-Z][A-Z]|[a-z][a-z])
... apart from the following pages:
DamienDurand #language en fr
Dirk Linnerkamp #language de en
JeanChristopheAndré #language fr,en,vi
JorgeChaves #language en/es
Kron #language en, ru
XIM #language en, it
manu #language fr,de
These all seem to be homepages, so shouldn't go in a language namespace anyway.
Invalid tags will no doubt creep back in between now and conversion time,
but will hopefully be few enough in number to fix manually.
= Categories =
We need to reconcile three types of category markup:
* MoinMoin considers a page to be in a category if it links to that category
anywhere in the page
* MediaWiki considers a page to be in the category if there is a link
of the form [[Category:Foo]] anywhere in the page, but not [[:Category:Foo]]
* conventionally, users of the current wiki puts category links at the bottom
of the page, after a string matching "\r\n----\r\n"
* this convention is not always followed
* categories often come with comments, e.g. explaining why a page has been
proposed for deletion
Having cleared up a lot of edge cases, I think we can use the Debian convention
to translate to MediaWiki more cleanly.
The code below is adapted from the code I used to search for edge cases.
Given a footer like this:
----
CategoryFoo
CategoryBar: explanation
... it converts it to:
{{Category|CategoryBar|explanation}}
{{Category|CategoryFoo}}
I'm not aware of an MW concept like "category with explanation",
so a template seems like as good an idea as any.
Categories without explanations don't need to use templates,
but keeping them the same seems like a good idea?
The code alphabetises categories mainly because it was easier to write,
but it seems like a good convention to promote?
The code treats "ToDo" as a category, instead of a WikiTag. I'm not sure
what the difference is in MoinMoin, and I guess we should treat them the same
in MediaWiki?
The code has only been tested on a version of the site before fixing all the
edge cases. You're welcome to use this as a springboard to write your own
code, but if you use it directly, please test it thoroughly :)
The code is released under the same license as mm2mw[0].
sub fix_categories_inner {
my ($original_footer, $footer) = @_;
# Remove some common boilerplate:
$footer =~ s/\s*\r\n/\n/g;
$footer =~ s/##+ Uncomment the next line if you are a wiki translator\n## CategoryWikiTranslator\n//;
$footer =~ s/##+ Uncomment the next line if you are a wiki translator\n//;
$footer =~ s/##+ CategorySomething \| CategoryAnother\n//;
$footer =~ s/##+ Keep only one good category and remove others\n//;
$footer =~ s/##+ Vous pouvez ajouter d'autres articles utiles ici\.\s*\n##+ *Voir aussi:\s*\n##+ *Si cette article correspond à certaines catégories \(qui existent!\), ajoutez les ici\.\s*\n//;
# Common regex fragments:
my $category_barrier = '^(?:^\s*|\n+|\s*\|\s+)\s*';
my $raw_category = '[A-Z]\w+';
my $bracketed_category = '\[\[(?:\w|\/)+(?:\s*\|[^\]]+)?\]\]';
my $next_category = '\s*(?=\n+|\|\s+|\b[A-Z]|\[\[\w|$)';
# Shift one category off the list at a time
my %categories;
my $success = 0;
while (1) {
$footer =~ s/^\s*\.\s*//; # e.g. InstallingDebianOn/Apple
if ($footer =~ /^\s*$/ || $footer =~ /^\s*(?:\s*#.*\n)+\s*$/ ) {
# * (Empty string)
# * "You can add other _helpful_ links here."
$success = 1;
last;
} elsif ( $footer =~ s/$category_barrier($raw_category|$bracketed_category)$next_category// ) {
# * "CategoryFoo CategoryBar"
# * "CategoryFoo | CategoryBar"
# * "[[Foo]]"
# * "[[es/Foo]]"
my $category = $1;
$category =~ s/^\[\[//;
$category =~ s/\|.*//;
$category =~ s/\]\]$//;
$categories{$category} = '';
} elsif ( $footer =~ s/^\s*((?:#.*\n)+)\s*($raw_category)(?=\n|\s+\|\s+)// ) {
# * "## This page is referenced from ...\nCategoryPermalink"
my ($comment, $category) = ($1, $2);
$comment =~ s/^#+ *//gm;
chomp($comment);
$categories{$category} = $comment;
} elsif ( $footer =~ s/$category_barrier($raw_category)\s*[-:]\s*(.*?)\s*\n// ) {
# * "CategoryFoo: comment"
# * "CategoryFoo - comment"
my ($category, $comment) = ($1, $2);
$categories{$category} = $comment;
} elsif ( $footer =~ s/$category_barrier($raw_category)\s*\/\*\s*(.*?)\s*\*\/\s*\n// ) {
# * "CategoryFoo /* comment */"
my ($category, $comment) = ($1, $2);
$categories{$category} = $comment;
} else {
# Unrecognised - assume this is not a footer after all
last;
}
}
if ($success) {
my $ret = '';
foreach my $category ( sort keys %categories ) {
if (my $reason = $categories{$category}) {
$ret .= "\n\n{{Category|$category|$reason}}\r\n";
} else {
$ret .= "\n\n{{Category|$category}}\r\n";
}
}
return $ret;
} else {
return $original_footer;
}
}
sub fix_categories {
my ($contents) = @_;
$contents =~ s/\r\n\s*#+\s*(?:-\s*)?If this page belongs to an existing *Category, add it below\.\r\n/\r\n----\r\n/i;
$contents =~ s/(\s*\r\n----\s*\r\n((?:(?!----\s*\r\n).*\r\n)*))$/fix_categories_inner($1, $2)/e;
return $contents;
}
= Miscellaneous notes from today =
I stumbled over a couple of things while doing the above, which might be useful
to the migration process.
"CategoryXWindowSystem" is not automatically converted to a link, seemingly
because it contains two adjacent capital letters. I haven't checked whether
mm2mw.pl handles this edge case, and wiki users may well consider it a bug
that should be quietly fixed during conversion.
People sometimes translate categories in different ways. For example:
* several German homepages are in KategorieHomepage
* CategoryFrPortal is in CategoryFrCategory
* es/DebianForNonCoderContributors is in es/Community
I guess clearing that up is a job for another day?
[0] https://salsa.debian.org/guillem/mm2mw
Reply to: