Re: Stanford IDG internal Perl style

To: debian-lint-maint@lists.debian.org
Subject: Re: Stanford IDG internal Perl style
From: Russ Allbery <rra@debian.org>
Date: Mon, 01 Apr 2013 11:57:56 -0700
Message-id: <[🔎] 87sj3am62z.fsf@windlord.stanford.edu>
In-reply-to: <515620EA.70104@thykier.net> (Niels Thykier's message of "Sat, 30 Mar 2013 00:16:58 +0100")
References: <87k3oqdsq7.fsf@windlord.stanford.edu> <515620EA.70104@thykier.net>
Niels Thykier <niels@thykier.net> writes:
> On 2013-03-29 18:24, Russ Allbery wrote:

>> In general, follow the rules in the perlstyle man page and Perl Best
>> Practices, except we use cuddled elses (else on the same line after }).
>> And, of course, follow rules below rather than perlstyle if they
>> conflict.

> I don't think I have read the Perl Best Practises (assuming its the
> O'Reilly one[1]), so I have to come back on that.

> [1] http://refcards.com/docs/vromansj/perl-best-practices/refguide.pdf

This refers to the book by Damian Conway.  Not available for free so far
as I know, although that's a summary.

Thanks for all the typo fixes!

>> Note that these guidelines are for internal projects.  If we release
>> something as open source that needs to be compatible with Perl 5.8
>> rather than 5.10 (which the document guidelines assump), there are the
>> following exceptions:

> s/assump/assume/ ?

Yes.

> I have been considering to use autodie before (possible exception being
> system/exec where it is always not compatible).  I tend to prefer
> exception based I/O instead of checking return values (which gets old
> really soon), so I could be convinced to use autodie by default.  Too
> bad it does not cover all cases (print et al).

Yeah, I really like and prefer autodie.  One does have to be careful in
situations where the file may not exist for some legitimate reason or one
wants to present a decent error message to the user, but there are just so
many cases where a missing file indicates some sort of serious internal
error and it's better to just throw an exception.

>> * Factor out long expressions in the middle of statements.  It's better
>> to read three clear statements than to have them all combined into a
>> more complicated all-in-one longer line.

> Is this:

>    my $result = part1
>       + part2
>       + part3;

>  vs:

>   my $result = part1 + part2 + part3;

> (where partX is a "complex" computation)  Or is it about splitting it
> into temporary variables?

It's the latter: splitting long calculations into intermediate variables.

>> * Dereference variables with arrows for readability.  (ie:
>> "$record->{name}".

> Also in "chaining"?
>   $record->{$last_name}->{$first_name}
>   $matrix->[$i]->[$j]

> vs

>   $record->{$last_name}{$first_name}
>   $matrix->[$i][$j]

No, just the first dereference (so your second example is correct).  In
other words, the goal is to avoid doubled sigils like $$record{name} and
instead prefer $record->{name}.  This is largely because the precedence
for $$record{name} is always confusing and can break easily when the
expression becomes a bit more complex.

>> [...]
>> * For clarity in some fonts, try to make single-character strings more
>> obvious as to their intent:
>> 
>>         ''      q{}
>>         ' '     q{ }
>>         ','     q{,}

> I do not think I understand this one.  Is it to use the second column
> variants instead of the first?

Yes.

The problem here is that '' in a lot of fonts looks like ", and the
difference between '' and ' ' can be hard to see.  It's also easy to
misread strings like ',' and ';' and think the comma or semicolon is part
of the syntax.  Consider, for example:

    print $foo, ',', $bar, ',', $baz;

compared to:

    print $foo, q{,}, $bar, q{,}, $baz;

It's a bit easier to not get visually lost in the commas (at least for
me).  (Of course, for that specific example, interpolation is probably a
better idea, but some of those may have been method calls or osmething
else.)

In general, Perl Best Practices recommends using q{} for most
single-character punctuation strings or strings containing a small amount
(one to three spaces) of whitespace because it makes it a bit clearer.

>> # Naming
>> 
>> [...]
>> 
>> * Name modules using Noun::Adjective::Adjective.
>>   (Disk::DVD::Rewritable)

> Mmm... Is that strictly "one Noun followed by adjective(s)".  Also the
> example appear to be "Noun::Noun::Adjective" (or is a DVD an
> adjective?).

It sort of is, in that you can refer to a DVD Disk, but the rule is not
very strict.

> Anyway I assume "namespacing" is allowed (e.g.  Lintian::$module)

Yes, certainly.  The goal here is to avoid Disk::Rewritable::DVD or,
worse, Rewritable::DVD::Disk, although I think it's probably easier to
think of it in terms of an inheritance tree and have the right naming
structure fall out of that.  (There are some cases where
Disk::Rewritable::DVD would be more correct.)

>> # Formatting and Indentation
>> 
>> Don't use tabs in perl scripts, since they can expand differently in
>> different environments. In particular, please try not to use the mix of
>> tabs and spaces that is the default in Emacs.
>> 
>> Please follow these guidelines for spacing and formatting:
>> 
>> * Each block is indented four spaces. Continuations of commands should
>>   be indented two more spaces unless parenthesized, or indented to line
>>   up one space after the opening parentheses if parenthesized.

> So if parentheses are used it should be lined up as... ?

>   really_long_name(argument1,
>                     argument2,
>                     argument3)

This actually isn't what we do, and I'll change it to fit what we do:

    really_long_name(
        argument1,
        argument2,
        argument3,
    );

which is what perltidy will force.  (It's a bit more complex than that in
that it depends on the length; sometimes it will like:

    really_long_name(argument1,
        argument2, argument3);

But in any case it's a four-space indent.)  perltidy has strong opinions
about this and no way of forcing different behavior, so I find it easier
to just go along with it.

>   if (really_long_condition1
>        && really_long_condition2
>        && really_long_condition3) {

The one space thing isn't something that we do, so I'll pull that out.
For conditions like that, it's lined up under the parens (so like the
above but without the extra space).  But really one should try to avoid
code like that and use intermediate variables or pull the condition into a
function.

>> * Use a space between keywords (if, elsif, while) and functions and
>> parenthesized arguments. Do not put a space between the opening or
>> closing parentheses and the contents. For example:
>> 
>>         if ($foo) {
>>             bar ($baz);
>>         }

> In your comment to #704197, you suggested you had dropped the space
> between the function/sub name and its parentheses?  i.e.

>         if ($foo) {
>             bar($baz);
>         }

> Or did I misunderstand you here?

No, that's just a mistake.  Jon Robertson (who wrote up the style guide)
and I are both so used to putting a space there that we keep forgetting.
But everyone else felt strongly about omitting the space.

>> * Parentheses are optional around print, die, and warn.  Other
>> built-ins should include the parentheses, as should any user-defined
>> functions or methods.  The empty parenthesis should be admitted from
>> any method call taking no arguments, but should exist for regular
>> functions.  Any parentheses should have no space between the keyword
>> and the open parenthesis.

> s/admitted/omitted/ ?

Yes.

>> * map, grep, and sort are a special case; their first argument is a
>> code block and should be set off by { } with spaces between the
>> brackets and the code. This code block should never be within
>> parentheses.

> Is that still map(EXPR, LIST) for the alternative call variant?

I don't think we've ever used that style in any of our code.  I find the
non-block form of map really weird, and would rather write:

    @chars = map { chr($_) } @numbers;

than:

    @chars = map(chr, @numbers);

(from the perlfunc documentation).  The latter looks like a bareword to me
and looks quite confusing.

>> # Control Structures
>> 
>> * Only use postfix if or unless for flow control statements (next,
>> last, return, etc) where there is nothing or almost nothing between the
>> flow control statement and the if/unless.  Code between the flow
>> control statement and the if/unless can make the fact that it's a
>> conditional less obvious and confuse some users.

> I might be inclined to use postfix if/unless for "tag" and die as well?

I would too.  We decided not to internally due to the feelings of folks
other than me who felt that always using the block if structure was
simpler, but there is *so much* of that code in Lintian that I'd be rather
reluctant to rewrite it all as block ifs.

On the other hand, people made a really good argument that code like:

    die "with some error message" if $error;

is really confusing to people who come from languages that don't use
statement modifier forms heavily, and pushes the arguably most important
part (the error condition) out to the right margin past the less
interesting error message.  The same thing could be argued as applying to
tags as well.

>> # Subroutines
>> 
>> [...]
>> * Always return with an explicit return, to make certain that you are
>> not returning an implicit value that may not be what you expect.  Use a
>> bare return to return failure, as 'return undef' will not do what you
>> expect if the function was called in a list context.

> exception being one-line anonymous subs/blocks?

perlcritic makes it hard to make exceptions here without adding magic
comments.  I usually stick an explicit return in anyway when it's a
one-line anonymous sub; it usually isn't too hard to work in.  (Blocks are
different; blocks don't need a return.)

>> * Each function should have a leading comment that briefly describes
>> what the function does and what arguments it takes.  The format should
>> be the following:
>> 
>> [...]

> This one sounds like double work with the POD documentation requirement
> as well (except for internal functions)?

These style guidelines are mostly written for code that tends to have a
lot of internal functions and not a lot of public interfaces, which makes
it less double work.  I find the audiences different even with public
functions and write slightly different documentation, but it is some
degree of double work.  I'm not sure I'd advocate adopting this for
Lintian for public functions, since we're fairly far down the path of
intermixing POD.

>> * Use m{}xsm for matches by default.  x is very useful for readability
>> in longer regular expressions.  m will match multiline strings better,
>> and s will improve the handling of "." to match newlines in any
>> multiline string.  In many cases they aren't needed, but we want to
>> encourage them as a default.

> Given how rarely we seem to be maching multiline strings, I am not sure
> it makes sense to use this rule in Lintian?

The argument in Perl Best Practices is that xms should just be the default
regex flavor and you should just use that everywhere.  I've started doing
that, and I admit that I kind of like it.  There is tons of other advice
on how to structure regexes using /x too (such as always using character
classes instead of \ for metacharacters).

It does make all the regexes more verbose and it's a daunting amount of
conversion work with an existing code base.  I don't think it's a horribly
high priority.  But note that, for example, file names *can* contain
newlines in UNIX, so there's something to be said for avoiding surprises.

>> * Do not use interspersed POD to document functions.  All POD should
>> start after the code is done, with an __END__ statement before it.

> I have never quite understood this; especially not with the requirement
> of having a second description at the function itself.

I wouldn't adopt this for Lintian.  The primary objection to intermixed
POD that I've seen is that a lot of editors deal very poorly with it,
since the syntax is completely different than the surrounding code.
(Personally, I use mmm-mode in Emacs, which does okay, although emacs24
seems likely to make that more of a problem.)  There are other objections
in Perl Best Practices, which mostly amount to Conway's opinion that
internal documentation and user-facing documentation are different and
should be written in different styles with different audiences, something
that I buy in part but not enough to make the duplicate work for a case
like Lintian's Perl modules seem worth it.

>> * Use Exporter to export functions, and only do so by request (via
>> @EXPORT_OK).  Exporting all functions by default can clutter namespaces
>> and lead to conflicts if a module and program add new functions.  The
>> recommended way to do module inheritence with Perl 5.10 and later would
>> be to use Exporter by:

>>         use parent qw(Exporter);

> Why this over "use Exporter qw(import);", which has been possible since
> perl 5.8.3?

Yeah, I just realized that's even better and will change our coding style
accordingly.

>> * Use Getopt::Long::Descriptive for option parsing in any script that
>> takes options. Give each option a corresponding single-character option
>> unless it's rarely used. When listing the options, give the short
>> option first, then |, and then the long option, as in:
>> 
>>         my ($opt, $usage = describe_options(
>                           ^
> Missing a ")" ?

Yeah.

>>           '$0 $o <args>',
>>           [ 'd|delete', 'delete the given object' ],
>>           [ 'e|exclude=s', 'exclude any matching names' ],
>>           [ 'h|help',      'display help' ],
>>         ) or exit 1;

> Perhaps we should convert Lintian to use this?  If only so I don't have
> to remember to update the usage with changes to cmd options. :P

Yeah, exactly.  :)  I'd at least consider it.  We did a survey of all of
the available modules, and while Getopt::Euclid looked rather neat in
terms of not having to duplicate information from the documentation, it's
also full of scary black magic and had some other issues.

> Got anything on lazy loading stuff?  L::Collect is basically one big
> lazy-loader and it could do without requiring us to implement the lazy
> loading itself.  Maybe some smart use of Moose?

We don't -- that strategy is still the best one that I know, although I've
not investigated.

>> * Look at the Stanford::Infrared::* modules to see if they cover any
>> generalized cases you need, or to see if there are any cases you think
>> they should cover.  Stanford::Infrared::General covers general needs
>> used in several places, with functions specific to Stanford ITS
>> infrastructure.  Stanford::Infrared::MySQL covers interfacing with the
>> remctl commands in stanford-server-mysql, and other MySQL setup
>> specific to that infrastructure.  Stanford::Infrared::Wrappers covers
>> things that are not as specific to Stanford, but are useful for our
>> needs (such as an IPC::Open3 wrapper).

> A URL for these ... Yeah, too lazy to google for it. :)

Sadly, not available anywhere public right now.  I'll see if I can fix
that.  I'm not sure if that sort of thing is fair game for CPAN or not,
but I can at least put them up on my web site.

>> * Include the name of the script in error messages. This is
>> particularly important for scripts run from cron or other scripts,
>> although it's useful to have in general and never hurts. The easiest
>> way to do this is to put something like the following.  You should then
>> prefix any warn, die, carp, or croak messages with '$0: ', and use $0
>> in any messages that warn of incorrect option usage.  If you use this
>> format, the first character of the rest of the message should not be
>> capitalized.
>> 
>>         my $fullpath = $0;
>>         $0 =~ s{ ^ .* / }{}xms;

> Alternatively, basename $0.

Oh, yeah, much better.  I forgot File::Basename has been in core since
forever.  Will fix.

> What do you do about croak/exceptions from module functions ?

>   eval { function(....); }
>   if ($@) {
> 	die "$0: $@";
>   }

> Or do you use "$0" inside function where it calls croak?

We usually use it inside the functions when it calls croak if they're in
the same script.  For modules, I prefer to throw uncaught exceptions and
make the script catch them.

(I also changed the style to assign $@ to a variable if it's used.)

-- 
Russ Allbery (rra@debian.org)               <http://www.eyrie.org/~eagle/>
Reply to:
Follow-Ups:
- Re: Stanford IDG internal Perl style
  - From: Niels Thykier <niels@thykier.net>
Prev by Date: [SCM] Debian package checker branch, master, updated. 2.5.11-209-gdc7f4dd
Next by Date: [SCM] Debian package checker branch, master, updated. 2.5.11-210-g0dce35e
Previous by thread: [SCM] Debian package checker branch, master, updated. 2.5.11-209-gdc7f4dd
Next by thread: Re: Stanford IDG internal Perl style
Index(es):
- Date
- Thread