[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Conversion edge cases



In what seemed like a good idea 12(!) hours ago, I decided to
spend some time clearing up a few edge cases.

= #language tags =

All language tags should now match this regexp:

        #language [a-z][a-z][-_]([A-Z][A-Z]|[a-z][a-z])

... apart from the following pages:

DamienDurand              #language en fr
Dirk Linnerkamp           #language de en
JeanChristopheAndré       #language fr,en,vi
JorgeChaves               #language en/es
Kron                      #language en, ru
XIM                       #language en, it
manu                      #language fr,de

These all seem to be homepages, so shouldn't go in a language namespace anyway.

Invalid tags will no doubt creep back in between now and conversion time,
but will hopefully be few enough in number to fix manually.

= Categories =

We need to reconcile three types of category markup:

* MoinMoin considers a page to be in a category if it links to that category
  anywhere in the page
* MediaWiki considers a page to be in the category if there is a link
  of the form [[Category:Foo]] anywhere in the page, but not [[:Category:Foo]]
* conventionally, users of the current wiki puts category links at the bottom
  of the page, after a string matching "\r\n----\r\n"
  * this convention is not always followed
  * categories often come with comments, e.g. explaining why a page has been
    proposed for deletion

Having cleared up a lot of edge cases, I think we can use the Debian convention
to translate to MediaWiki more cleanly.

The code below is adapted from the code I used to search for edge cases.
Given a footer like this:

	----
	CategoryFoo
	CategoryBar: explanation

... it converts it to:

	{{Category|CategoryBar|explanation}}
	{{Category|CategoryFoo}}

I'm not aware of an MW concept like "category with explanation",
so a template seems like as good an idea as any.

Categories without explanations don't need to use templates,
but keeping them the same seems like a good idea?

The code alphabetises categories mainly because it was easier to write,
but it seems like a good convention to promote?

The code treats "ToDo" as a category, instead of a WikiTag.  I'm not sure
what the difference is in MoinMoin, and I guess we should treat them the same
in MediaWiki?

The code has only been tested on a version of the site before fixing all the
edge cases.  You're welcome to use this as a springboard to write your own
code, but if you use it directly, please test it thoroughly :)

The code is released under the same license as mm2mw[0].

sub fix_categories_inner {
  my ($original_footer, $footer) = @_;
  # Remove some common boilerplate:
  $footer =~ s/\s*\r\n/\n/g;
  $footer =~ s/##+ Uncomment the next line if you are a wiki translator\n## CategoryWikiTranslator\n//;
  $footer =~ s/##+ Uncomment the next line if you are a wiki translator\n//;
  $footer =~ s/##+ CategorySomething \| CategoryAnother\n//;
  $footer =~ s/##+ Keep only one good category and remove others\n//;
  $footer =~ s/##+ Vous pouvez ajouter d'autres articles utiles ici\.\s*\n##+ *Voir aussi:\s*\n##+ *Si cette article correspond à certaines catégories \(qui existent!\), ajoutez les ici\.\s*\n//;
  # Common regex fragments:
  my $category_barrier   = '^(?:^\s*|\n+|\s*\|\s+)\s*';
  my $raw_category       = '[A-Z]\w+';
  my $bracketed_category = '\[\[(?:\w|\/)+(?:\s*\|[^\]]+)?\]\]';
  my $next_category      = '\s*(?=\n+|\|\s+|\b[A-Z]|\[\[\w|$)';
  # Shift one category off the list at a time
  my %categories;
  my $success = 0;
  while (1) {
    $footer =~ s/^\s*\.\s*//; # e.g. InstallingDebianOn/Apple
    if ($footer =~ /^\s*$/ || $footer =~ /^\s*(?:\s*#.*\n)+\s*$/ ) {
      # * (Empty string)
      # * "You can add other _helpful_ links here."
      $success = 1;
      last;
    } elsif ( $footer =~ s/$category_barrier($raw_category|$bracketed_category)$next_category// ) {
      # * "CategoryFoo CategoryBar"
      # * "CategoryFoo | CategoryBar"
      # * "[[Foo]]"
      # * "[[es/Foo]]"
      my $category = $1;
      $category =~ s/^\[\[//;
      $category =~ s/\|.*//;
      $category =~ s/\]\]$//;
      $categories{$category} = '';
    } elsif ( $footer =~ s/^\s*((?:#.*\n)+)\s*($raw_category)(?=\n|\s+\|\s+)// ) {
      # * "## This page is referenced from ...\nCategoryPermalink"
      my ($comment, $category) = ($1, $2);
      $comment =~ s/^#+ *//gm;
      chomp($comment);
      $categories{$category} = $comment;
    } elsif ( $footer =~ s/$category_barrier($raw_category)\s*[-:]\s*(.*?)\s*\n// ) {
      # * "CategoryFoo: comment"
      # * "CategoryFoo - comment"
      my ($category, $comment) = ($1, $2);
      $categories{$category} = $comment;
    } elsif ( $footer =~ s/$category_barrier($raw_category)\s*\/\*\s*(.*?)\s*\*\/\s*\n// ) {
      # * "CategoryFoo /* comment */"
      my ($category, $comment) = ($1, $2);
      $categories{$category} = $comment;
    } else {
      # Unrecognised - assume this is not a footer after all
      last;
    }
  }
  if ($success) {
    my $ret = '';
    foreach my $category ( sort keys %categories ) {
      if (my $reason = $categories{$category}) {
        $ret .= "\n\n{{Category|$category|$reason}}\r\n";
      } else {
        $ret .= "\n\n{{Category|$category}}\r\n";
      }
    }
    return $ret;
  } else {
    return $original_footer;
  }
}

sub fix_categories {
  my ($contents) = @_;
  $contents =~ s/\r\n\s*#+\s*(?:-\s*)?If this page belongs to an existing *Category, add it below\.\r\n/\r\n----\r\n/i;
  $contents =~ s/(\s*\r\n----\s*\r\n((?:(?!----\s*\r\n).*\r\n)*))$/fix_categories_inner($1, $2)/e;
  return $contents;
}

= Miscellaneous notes from today =

I stumbled over a couple of things while doing the above, which might be useful
to the migration process.

"CategoryXWindowSystem" is not automatically converted to a link, seemingly
because it contains two adjacent capital letters.  I haven't checked whether
mm2mw.pl handles this edge case, and wiki users may well consider it a bug
that should be quietly fixed during conversion.

People sometimes translate categories in different ways.  For example:

 * several German homepages are in KategorieHomepage
 * CategoryFrPortal is in CategoryFrCategory
 * es/DebianForNonCoderContributors is in es/Community

I guess clearing that up is a job for another day?

[0] https://salsa.debian.org/guillem/mm2mw


Reply to: