[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Summary from the debian www/wiki BoF at DC14



[ Please note the cross-post and Reply-To ]

Hi folks,

As promised, here's a quick summary of what was discussed at the BoF
session in Portland. Apologies for the delay - it takes a while to
write these up... :-/

Thanks to the awesome efforts of our video team, the session is
already online [1]. I've taken a copy of the (partial!) Gobby notes
too, alongside my small set of slides for the session. [2]

We only had a small number of attendees at the session in person-
whether that's because of lack of interest or a clash with the other
sessions at the time I've no idea.

debian-www
==========

I didn't have a huge amount to talk about here, but felt it was worth
trying to start some discussion...

* We're still using CVS for the website, which is a PITA. Git might
  work, but for a few (potential?) problems:

  + New way of working for our contributors, including translators who
    may not cope with learning a more complex tool

  + Space/time constraints working with a big repo - CVS supports
    partial checkouts better. I'm not convinced that this should
    matter any more, but... Later data comparisons tell us that CVS
    uses ~350M for a checkout, a git clone uses ~540M. An initial git
    clone can also be slow.

  + Our current work-flow (helper scripts, "page outdated" logic,
    translations) is built around CVS and would need a major revamp to
    fit git. Maybe p04a could help here?

  Is anybody interested enough in switching that we'll find enough
  manpower to make the change? CVS is *horrible* (IMHO), but it's a
  lot of work to switch.

No other topics were brought up, so we moved on to the wiki...

debian-wiki
===========

Quick summary of the wiki status:

 * 12,203 pages (non-spam)
 * 12,565 registered user accounts (non-spam)
 * Using Moin 1.9.4 with some local patches (since upgraded to 1.9.7)

Brief discussion of how we've dealt with spammers - the problem is
*believed* to be just about solved now. To edit pages in the wiki, a
user must be logged in with an account. To create an account, they
must register using a valid email address and we validate that
email/account link by sending a URL that needs to be visited. Whenever
anybody attempts to sign up for an account, our scripts attempt (based
on heuristics and history) to detect and block spam sign-ups.

Questions about account setup, clarified that account holders in the
wiki don't need to be DDs. Sign-ups are free for anyone who cares -
please join in!

Wiki anti-spam discussion
=========================

More in-depth explanation of how people appear "spammy" when
attempting to create an account. A typical spammer will look have:
 * <random alphanumerics>@hotmail.com for email
 * <a totally unrelated set of random characters> for a username
 * an IP on a random Chinese mobile broadband network or known
   spam-haven

The anti-spam checks will score all the information on a sign-up
attempt and will refuse to create an account if the total score is too
high. If people attempt to sign up too many times in succession for an
account from the same spammy-looking email or IP, the IP will be
blacklisted. The blacklist is not just for blocking account sign-ups -
spammers are clearly not interested in Debian and are just looking to
spam. We block the IP so they can't access any of our pages. Too many
obvious spam sign-up attempts from the same network address range will
also result in us blacklisting a network block, or even an entire ISP
in the case of known spam-havens.

We have tried in the past using Captchas on the Debian wiki, but it
didn't help much. There are a whole load of problems with Captchas
anyway (e.g. blocking blind people, privacy issues), but the biggest
problem is that the Captchas just did not solve the spam problem for
us! Most of the spam account sign-ups are already coming from botnets
where the spammers have broken Captchas to get free email accounts -
the one for the wiki is no harder for them! Steve implemented Captcha
support for Moin to try this all out, then turned off that support on
the Debian wiki after not very long.

There is a potential problem with Tor exit nodes being blacklisted due
to spammy-looking activity. We'd like to not block the nodes
themselves here - we'll need to work on this with the Tor folks.

Steve showed a small demo of the anti-spam stuff at work, using his
"console" on the wiki, and demonstrated some example spammers that
would be blocked.

There's no perfect solution here - we're having to work out spam/ham
on a small amount of information, and we can never be *100%* sure. In
the case that a user tries to sign up and is blocked as a
false-positive for spamming, they should mail the debian-www list or
the wiki admins and we can white-list email addresses in that
situation.

Gentoo/Arch wiki comparison
===========================

Both Gentoo and Arch have/had really good wikis full of great content
and excellent links to more information. It would be awesome if the
Debian wiki could be as good; this is down to the people supplying and
maintaining the content.

Freezing the wiki on a per-release basis?
=========================================

This has been suggested a few times in the past - freeze the content
in the wiki for each release and create new versions of pages for
future content. That way, it becomes easier to track out of date
content.

I'm not convinced - lots of the wiki (not sure of the split!) is *not*
necessarily linked to a particular Debian release so this wouldn't fit
too well I think. This works very well for the Apache folks for their
documentation, but they have a very different setup.

Paul has macros which could help solve this - allow page content to
know what the current release is and maybe show different
content. Maybe that could help, maybe it's solving a different
problem.

We haven't spoken to the other distro folks about wiki setups
(e.g. comparing anti-spam).

Wiki engine choice?
===================

We've had the suggestion several times about maybe moving to a
different wiki engine. We're on moin and reasonably happy with the
setup so far. Moving to another wiki engine is difficult - very
intensive to translate markup, or you choose to start again with a
mostly empty wiki and risk not getting any content. If anybody is
interested in doing a migration, they would need a copy of our data to
work with. The wiki admins are happy to give dumps of the wiki to
anybody interested.

Paul has even added a moin patch to generate daily dumps like this
(e.g. for offline use) and is struggling to get it reviewed so
far. Steve's patches have been proposed and reviewed upstream and
there are outstanding comments he needs to resolve. Lack of time all
round. :-(

Content is much more important than the wiki engine itself.

Steve has a script that can walk through the wiki and try to identify
appropriately-tagged pages and check to see if they're out of date,
mailing the most recent editors to ask for review. It's only a
proof-of-concept so far, might put it into place shortly as a trial.

Templates for wiki pages
========================

Some discussion about moin templates - we're using these already in a
few places (e.g. for BSP pages), please suggest more if you think they
would be useful.

HELP!
=====

We're *always* looking for more help in the wiki. A particular place
where people can help is in triaging the BTS for wiki,debian.org.

Special features in the wiki
============================

We have some cute extra macro features that people have added:

 * DebianBug()
 * Release name, version, dates
 * Message-ID search for mailing list lookup

Also:

 * The CategoryPermalink tag should be used on pages that are
   referenced externally, to make it obvious that they should not be
   moved/renamed/deleted.

Special sprint / BSP for web/wiki?
==================================

It might be very useful to have a specific get-together to work on
features and bugs.

A good example of this is the up-coming semi-planned switch to
single-sign-on for the wiki. We'd like to get away from the separate
accounts that everybody has. SSO is something we've been wanting to do
for ages, but short of manpower. We'll be working on migration when we
find some time.

Translations in the wiki
========================

The way we do this is not wonderful, with links to
<LANG>/PageName. Moin has better support for this for its own internal
pages, but not sure how we can use that better ourselves.

Wiki infrastructure and performance
===================================

We moved servers a couple of years back, from a dedicated i386 machine
to an amd64 VM hosted by DSA with lots of memory. We're using a
heavily-threaded moin/wsgi setup on that machine and it seems to cope
now. As far as we can tell, we're one of the biggest Moin sites on the
planet. An example that proved this was the page save / notification
performance bug that hit us a couple of years back, which turned out
to be a scalability bug in moin. Overall, performance is looking fine
now.

Mentioned the break-in we had a few years back from the drawing plugin
security hole. We had to reset all the passwords, and a lot of people
with older accounts did not have working email addresses attached to
their accounts. Those people could not recover their accounts
automatically because of that, so would have been locked out. If
anybody is still in that situation, please contact the admins!

The security breach also caused us to move to a new and better setup
with privilege separation to reduce the impact of potential future
attacks. DSA (weasel in particular) were awesome in terms of doing the
re-installation at that time, and fixing up the system
security. Thanks!

Why are the website updates so slow?
====================================

Rebuilds happen from cron rather than on every commit. Why? The
website build takes a very long time due to its size. The
cross-linking in wml is great for generating well-linked content with
macros etc., but on a very large site it takes a long time to generate
all the HTML. It's possible to just rebuild small parts of the site,
but that can be risky and can cause bugs.

Could we use po4a / gettext for www translation?
================================================

Maybe - we need people to work on this to see if we can make it work.

[1] http://meetings-archive.debian.net/Public/debian-meetings/2014/debconf14/webm/Web_and_wiki_BoF.webm
[2] http://www.einval.com/~steve/talks/Debconf14-web-wiki/

-- 
Steve McIntyre, Cambridge, UK.                                steve@einval.com
Welcome my son, welcome to the machine.

Attachment: signature.asc
Description: Digital signature


Reply to: