[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Consistent formating long descriptions as input data



On Thu, 23 Apr 2009, Manoj Srivastava wrote:

       While I can't speak for the policy team (I have not been
re-delegated yet), I suspect the answer might be to get a working
implementation out in the wild (it does not have to be packages.d.o or
anything official -- even a standalone software that takes the output
from grep-dctrl or parses a Packages file will suffice). This will
allow us to see what changes to policy might be needed, if any, for
package descriptions.

Would you consider the tasks pages I announced yesterday [1] as such
an implementation.  I continued to work a bit on this and have two
additions to the preprocessor:

   1. The inlist flag has to be unset not only if a line starts
      in the second column again but also if there is an empty line.
   2. You need to escape '#' signs if they appear as first
      character.

See the implementation at the end of this mail.

       Once we ahve a working implementation, and a clear idea of what
might need to be changed in package descriptions (for example, we
already know that packages using 'o' as a bullet in unordered lists
will have to be changed to use one of +.-. or *), we can scan the
package descriptions to see how many packages would be affected, and
then decide how to introduce that language into policy (more package
affected, the more the need for a transition plan)

I tried to detect some examples which need some changes.  You might
like to have a look at my "Debugging Blend":

   http://blends.debian.net/debug/tasks

Some issues are mentioned there - I intend to add some better
documentation if needed but some issues become clear.

       I do not see any reason this proposal should not become policy,
eventually, since this deals with the core charter of the technical
policy: standards that packages need to follow to allow for better
integration.

After dealing with the issue I would do the following resume:

  1. The preprocessing you have to do for markdown is basically
     the same I did for turning description text into html
     programmatically myself.  There is no real benefit if your
     main target is only HTML - however, other output formats
     might benefit from using the preprocessing + markup step.
  2. Markdown is probably better in detecting second level lists
     thank I would have done it programmatically - so here is
     a benefit.  On the other hand there are some strange false
     positives for second level lists.
  3. If we really are doing preprocessing it would be cheap to
     use 's/\so\s/ * /' and even this marker might be detected
     as list marker.  This would be perfectly in contrast to my
     initial suggestion - but consequent if you prefer
     preprocessing anyway.  BTW, I even detected non-ASCII
     bullets in the burn package and because it is QA maintained
     anyway I took the chance to change this while fixing bug
     #517793.  I think we should catch things like this quite
     quickly because even apt-cache show failed to disply
     the description of burn correctly and so I've though
     fixing the problem myself instead of adding another bug
     to a QA maintained package seems reasonable.
  4. I expect more not yet detected needs for preprocessing.
  5. I expect the lintian checks for the markdown format rather
     complicated because there is a lot more freedom in the
     format (which might be an advantage for the editors) and
     some valid markdown input might be successfully rendered
     but into something which conflicts the intention of the
     author.  Compared to my suggestion of formating the
     long descriptions according to stricter rules this adds
     another level of complecity while the lintien checks
     which would be needed for my suggestions would have been
     really cheap.  I'd consider this as a disadvantage.

I might note that I'm not happy that in the case of pure and
simple ASCII output of long descriptions as it is done by current
tools more or less we will have a rendering which does not fit my
taste at all - but I accept that I probably belong to a minority
and if markdown is widely accepted it leads to my initial goal
(tasks pages) as well.

Kind regards

       Andreas.

[1] http://lists.debian.org/debian-devel/2009/04/msg00815.html

Python implementation:

detect_list_start_re = re.compile("^\s+[-*+]\s+")
detect_code_start_re = re.compile("^\s")
detect_code_end_re   = re.compile("^[^\s]")
detect_url_re        = re.compile("[fh]t?tp://")

def PrepareMarkdownInput(lines):
    ret    = ''
    inlist = 0
    incode = 0
    for line in lines:
        # strip leading space from description as well as useless trailing
        line = re.sub('^ ', '', line.rstrip())
        # a '^\.$' marks in descriptions a new paragraph, markdown uses an empty line here
        line = re.sub('^\.$', '', line)

        if detect_code_start_re.search(line):
            if incode == 0: # If a list or verbatim mode starts MarkDown needs an empty line
                ret += "\n"
                incode = 1
                if detect_list_start_re.search(line):
                    inlist = 1
        if incode == 1 and inlist == 0:
            ret += "\t" # Add a leading tab if in verbatim but not in list mode
        # If there is an empty line or a not indented line the list or verbatim text ends
        # It is important to check for empty lines because some descriptions would insert
        # more lines than needed in verbose mode (see for instance glam2)
        if ( detect_code_end_re.search(line) or line == '' ) and incode == 1:
            inlist = 0 # list ends if indentation stops
            incode = 0 # verbatim mode ends if indentation stops
        # Mask # at first character in line which would lead to
        #   MARKDOWN-CRITICAL: "We've got a problem header!"
        # otherwise
        if line.startswith('#'):
            ret += '\\'
        if detect_url_re.search(line):
            # some descriptions put URLs in '<>' which is unneeded and might
            # confuse the parsing of '&' in URLs which is needed sometimes
            line = re.sub('<*([fh]t?tp://[-./\w?=~;&]+)>*', '[\\1](\\1)', line)
        ret += line + "\n"
    return ret


--
http://fam-tille.de


Reply to: