Re: Script to filter debdiff output [was: Help needed]

To: debian-tex-maint@lists.debian.org
Subject: Re: Script to filter debdiff output [was: Help needed]
From: Florent Rougon <f.rougon@free.fr>
Date: Sat, 27 Jan 2007 18:59:41 +0100
Message-id: <[🔎] 878xfotkoi.fsf@florent.maison>
Mail-followup-to: debian-tex-maint@lists.debian.org
In-reply-to: <[🔎] 20070122081115.GK16833@gamma.logic.tuwien.ac.at> (Norbert Preining's message of "Mon, 22 Jan 2007 09:11:15 +0100")
References: <[🔎] 20070122081115.GK16833@gamma.logic.tuwien.ac.at>

Hi,

Norbert Preining <preining@logic.at> wrote:

> Does one of you know a script (or can create one) that does something
> like an intelligent debdiff, ie files which basename are the same, but
> the LAST entry of the dirname may differ, are ignored.

Here is your script. It does basically what you asked here, and a bit
more that you wished but didn't dare ask: :-)
                         ^^^^^^^^^^^^^^^
                    (is this correct English?)

     - the files aren't ignored if they differ in either mode, owner or
       group (these are significant differences, I'd say);

     - the files can only be ignored if their full paths start with either
       /usr/share/texmf-texlive or /usr/share/doc/texlive
       (the intent here is that unwanted changes, e.g. under /usr/lib,
       don't get unnoticed because of the filtering done here).

       Actually, there is variable at the top of the file that you
       should customize if this isn't enough for you:

         filter_in_regexes = (r"^/usr/share/texmf-texlive",
                              r"^/usr/share/doc/texlive")

       It's a sequence of Python regular expressions (which are similar
       to the Perl regexes). Only those files whose full path matches at
       least one of the regexes in 'filter_in_regexes' can possibly be
       filtered out by the script (which of course only happens if they
       also fulfill the other conditions).

       Note: the 'r' in front of the Python strings means they are
             written in "raw syntax", i.e. with no special processing of
             backslashes. This is often preferable for readabilty when
             writing regexes, therefore I almost always use raw strings
             in these cases, even if the particular regexp has no
             backslash (it may need one later...).

             All the details here:

               http://docs.python.org/ref/strings.html
               http://docs.python.org/lib/re-syntax.html

The precise algorithm is described in the docstrings for
dump_filtered_section() and dirnames_are_equivalent(), which I'll paste
here:

dump_filtered_section(output, sec, other_sec, filter_in=None):

    Dump section 'sec' in a filtered way, depending on 'other_sec'.

    The algorithm used for the filtering is the following one:

    For every file entry in 'sec':
      if:

        (1) either 'filter_in' is None, or the full path of the entry
            (dirname + basename) matches at least one of the compiled
            regular expressions in 'filter_in', and

        (2) there is a corresponding entry in 'other_sec' with the same file
            basename, and

        (3) the dirnames of these two entries are considered equivalent by
            dirnames_are_equivalent() and

        (4) both entries have the same mode, owner and group

        then:
          do nothing

        else:
          print the file entry unmodified.

    Notes:

      (a) The condition (1), when 'filter_in' is not None, ensures that we
          don't filter out changes that ought to be noticed (e.g., for TeX
          Live, we typically want to filter out those files which fulfill the
          other conditions only if they appear under /usr/share/texmf-texlive/
          or /usr/share/doc/texlive*, but not under /usr/lib/!).

      (b) 'sec' and 'other_sec' typically correspond to those parts of
          debdiff's output labeled "Files in first .deb but not in second"
          and "Files in second .deb but not in first".

dirnames_are_equivalent(dirname1, dirname2):

    Tell whether two dirnames are to be considered equivalent by
    dump_filtered_section().

    Current implementation: two dirnames are considered equivalent if, and
    only if, they are equal or only differ in the last component.


Usage: simply call the script with debdiff's output on stdin. The
       filtered output is written to stdout.

Tests: performed on the texlive-lang-french.debdiff file you uploaded on
       http://www.tug.org/texlive/Debian/tl2007/ a few days ago.
       Unfortunately, the debdiffs aren't there anymore, so I couldn't
       test on other debdiffs.

       As a consequence, there might be some element of debdiff's output
       syntax that the script doesn't know. Just tell me if you
       encounter a problem.

       FYI, the diff between texlive-lang-french.debdiff and the
       corresponding filtered output is the following:

7d6
< -rw-r--r--  root/root   /usr/share/doc/texlive-lang-french/latex/frenchle/FAQ.pdf.gz
9d7
< -rw-r--r--  root/root   /usr/share/doc/texlive-lang-french/latex/frenchle/frenchle.pdf.gz
33d30
< -rw-r--r--  root/root   /usr/share/texmf-texlive/tex/latex/frenchle/french.ldf
35,36d31
< -rw-r--r--  root/root   /usr/share/texmf-texlive/tex/latex/frenchle/frenchle.ldf
< -rw-r--r--  root/root   /usr/share/texmf-texlive/tex/latex/frenchle/frenchle.sty
47d41
< lrwxrwxrwx  root/root   /usr/share/doc/texlive-doc/latex/frenchle/FAQ.pdf.gz -> ../../../texlive-lang-french/latex/frenchle/FAQ.pdf.gz
49d42
< lrwxrwxrwx  root/root   /usr/share/doc/texlive-doc/latex/frenchle/frenchle.pdf.gz -> ../../../texlive-lang-french/latex/frenchle/frenchle.pdf.gz
74,75d66
< -rw-r--r--  root/root   /usr/share/doc/texlive-lang-french/latex/le/FAQ.pdf.gz
< -rw-r--r--  root/root   /usr/share/doc/texlive-lang-french/latex/le/frenchle.pdf.gz
77,79d67
< -rw-r--r--  root/root   /usr/share/texmf-texlive/tex/latex/le/french.ldf
< -rw-r--r--  root/root   /usr/share/texmf-texlive/tex/latex/le/frenchle.ldf
< -rw-r--r--  root/root   /usr/share/texmf-texlive/tex/latex/le/frenchle.sty
81,82d68
< lrwxrwxrwx  root/root   /usr/share/doc/texlive-doc/latex/le/FAQ.pdf.gz -> ../../../texlive-lang-french/latex/le/FAQ.pdf.gz
< lrwxrwxrwx  root/root   /usr/share/doc/texlive-doc/latex/le/frenchle.pdf.gz -> ../../../texlive-lang-french/latex/le/frenchle.pdf.gz

       which shows exactly which files are filtered out by the script.

#! /usr/bin/env python

# tex-filter-debdiff.py --- Filter debdiff output
# Copyright (c) 2007 Florent Rougon
#
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 dated June, 1991.
#
# This program is distributed in the hope that it will be useful, but
# WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
# General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program; see the file COPYING. If not, write to the
# Free Software Foundation, Inc., 51 Franklin St, Fifth Floor,
# Boston, MA  02110-1301 USA.

import sys, os, re

# The filtering will only be done for files whose path (as reported in the
# debdiff output) matches one of the following regular expressions.
#
# Note: "^/usr/share/doc/texlive" matches /usr/share/doc/texlive-lang-french
#       and many others...
filter_in_regexes = (r"^/usr/share/texmf-texlive",
                     r"^/usr/share/doc/texlive")

filter_in = map(re.compile, filter_in_regexes)


# Regular expressions matching the "interesting" sections (those that will
# be filtered).
first_deb_sec_rec = re.compile(r"^Files in first \.deb but not in second$")
second_deb_sec_rec = re.compile(r"^Files in second \.deb but not in first$")


class error(Exception):
    pass

class ParseError(error):
    pass

class ProgramError(error):
    "Exception raised for obvious bugs (when an assertion is false)."


def split_input_into_sections(f):
    sections = []
    # How sections are underlined in the input
    sec_delim = re.compile(r"^-+$")

    section_number = 1

    while True:
        section = {"name": None,
                   "lines": []}
        # Will store the previous line (needed to remember section titles)
        prev_line = None

        # Line number within a section
        line_num = 1
        while True:
            line = f.readline()
            if line in ('', '\n'):
                break

            mo = sec_delim.match(line)
            # Section delimiters are only considered as such iff found on
            # the second line of a section.
            if mo and (line_num == 2):
                if prev_line is None:
                    raise ParseError(
                        "Section %u, line %u (within the section): section "
                        "delimiter not preceded by a section title",
                        section_number, line_num)
                
                # The section title will be stored in section["name"];
                # therefore, remove it from section["lines"], i.e., start over
                # since we are on the second line.
                section["lines"] = []
                # Strip the trailing newline before storing the section title
                section["name"] = prev_line[:-1]
                # Store the section delimiter, in order to reproduce the
                # debdiff output verbatim.
                section["title delimiter"] = mo.group(0)
            else:
                # Strip the trailing newline before storing the line
                section["lines"].append(line[:-1])

            prev_line = line
            line_num += 1

        sections.append(section)
        if line == '':                  # EOF
            break
        section_number += 1


    return sections


def locate_interesting_sections(sections):
    first_deb_sec, second_deb_sec = None, None

    for section in sections:
        if section["name"] is not None:
            if first_deb_sec_rec.match(section["name"]):
                first_deb_sec = section
            elif second_deb_sec_rec.match(section["name"]):
                second_deb_sec = section

    return first_deb_sec, second_deb_sec
            

def index_files(section):
    """Build a dictionary whose keys are the file basenames.

    This allows to easily find everything pertaining to a file given his
    basename.

    """
    if not section.has_key("files"):
        section["files"] = {}
    if not section.has_key("ordered list"):
        # We are going to store in section["ordered list"] a list made of
        # the file entries for every file in section["lines"], in the same
        # order as they appear in section["lines"]. This will be quite useful
        # in dump_filtered_section() to preserve the order of files in the
        # debdiff output.
        section["ordered list"] = []

    d = section["files"]                # 'd' for dictionary

    # The point of having the "sep_after_mode", "sep_after_owner_and_group"
    # and "symlink_arrow" groups is to recreate exactly the debdiff output
    # for each entry (in particular, use the same number of spaces/tabs
    # that were used by debdiff between the various fields).
    nonsymlink_line_rec = re.compile(
        r"^(?P<mode>[^l][^ \t]+)(?P<sep_after_mode>[ \t]+)"
        r"(?P<owner_and_group>[^ \t]+)(?P<sep_after_owner_and_group>[ \t]+)"
        r"(?P<path>.+?)$")
    symlink_line_rec = re.compile(
        r"^(?P<mode>l[^ \t]+)(?P<sep_after_mode>[ \t]+)"
        r"(?P<owner_and_group>[^ \t]+)(?P<sep_after_owner_and_group>[ \t]+)"
        r"(?P<link>.+?)(?P<symlink_arrow> -> )(?P<target>.+)$")

    for line in section["lines"]:
        if line.startswith("l"):
            mo = symlink_line_rec.match(line)
            if not mo:
                raise ParseError(
                    "Looks like a line for a symlink, but doesn't match the "
                    "corresponding regexp:\n\n  '%s'" % line)

            name = os.path.basename(mo.group("link"))
            if not d.has_key(name):
                # First time we find a file with basename 'name' -> create
                # a new entry for it.
                d[name] = []

            entry = \
                  {"name": name,
                   "type": "symlink",
                   "dirname": os.path.dirname(mo.group("link")),
                   "mode": mo.group("mode"),
                   "owner and group": mo.group("owner_and_group"),
                   "target": mo.group("target"),
                   "separator after mode": mo.group("sep_after_mode"),
                   "separator after owner and group":
                   mo.group("sep_after_owner_and_group"),
                   "symlink arrow": mo.group("symlink_arrow")}
        else:
            mo = nonsymlink_line_rec.match(line)
            if not mo:
                raise ParseError(
                    "Looks like a line for a file that is not a symlink, but "
                    "doesn't match the corresponding regexp:\n\n  '%s'" % line)

            name = os.path.basename(mo.group("path"))
            if not d.has_key(name):
                # First time we find a file with basename 'name' -> create
                # a new entry for it.
                d[name] = []

            entry = \
                  {"name": name,
                   "type": "not a symlink",
                   "dirname": os.path.dirname(mo.group("path")),
                   "mode": mo.group("mode"),
                   "owner and group": mo.group("owner_and_group"),
                   "separator after mode": mo.group("sep_after_mode"),
                   "separator after owner and group":
                   mo.group("sep_after_owner_and_group")}

        # Record all this precious data...
        # ... first, in section["files"][name]:
        d[name].append(entry)
        # and second, append a pointer to the file entry to
        # section["ordered list"], which will allow us to reproduce
        # debdiff's output correctly, preserving the order:
        section["ordered list"].append(entry)


def write_section_title(output, section):
    if section["name"] is not None:
        output.write("%s\n%s\n" % (section["name"],
                                   section["title delimiter"]))


def dump_unfiltered_section(output, section):
    """Dump a section verbatim."""
    # Section title, if any
    write_section_title(output, section)

    # Section contents
    for line in section["lines"]:
        output.write("%s\n" % line)


def dirnames_are_equivalent(dirname1, dirname2):
    """Tell whether two dirnames are to be considered equivalent by dump_filtered_section().

    Current implementation: two dirnames are considered equivalent if, and
    only if, they are equal or only differ in the last component.
    
    """
    return (os.path.dirname(dirname1) == os.path.dirname(dirname2))


def output_file_entry(output, entry):
    # Build the line for this file entry
    line = []
    for component in ("mode", "separator after mode", "owner and group",
                      "separator after owner and group"):
        line.append(entry[component])

    full_path = os.path.join(entry["dirname"], entry["name"])

    if entry["type"] == "symlink":
        line.append(full_path)
        for component in ("symlink arrow", "target"):
            line.append(entry[component])
    elif entry["type"] == "not a symlink":
        line.append(full_path)
    else:
        raise ProgramError(
            "Unexpected entry type '%s' for '%s' in section '%s'." %
            (entry["type"], full_path, section["name"]))

    line.append('\n')
    output.write(''.join(line))


def dump_filtered_section(output, sec, other_sec, filter_in=None):
    """Dump section 'sec' in a filtered way, depending on 'other_sec'.

    The algorithm used for the filtering is the following one:

    For every file entry in 'sec':
      if:

        (1) either 'filter_in' is None, or the full path of the entry
            (dirname + basename) matches at least one of the compiled
            regular expressions in 'filter_in', and

        (2) there is a corresponding entry in 'other_sec' with the same file
            basename, and

        (3) the dirnames of these two entries are considered equivalent by
            dirnames_are_equivalent() and

        (4) both entries have the same mode, owner and group

        then:
          do nothing

        else:
          print the file entry unmodified.

    Notes:

      (a) The condition (1), when 'filter_in' is not None, ensures that we
          don't filter out changes that ought to be noticed (e.g., for TeX
          Live, we typically want to filter out those files which fulfill the
          other conditions only if they appear under /usr/share/texmf-texlive/
          or /usr/share/doc/texlive*, but not under /usr/lib/!).

      (b) 'sec' and 'other_sec' typically correspond to those parts of
          debdiff's output labeled "Files in first .deb but not in second"
          and "Files in second .deb but not in first".

    """
    # Section title, if any
    write_section_title(output, sec)

    # Section contents
    for entry in sec["ordered list"]:
        name, dirname = (entry["name"], entry["dirname"])
        full_path = os.path.join(dirname, name)

        passes_through_regexp_filter = False

        if filter_in is None:
            passes_through_regexp_filter = True
        else:
            for regexp in filter_in:
                if regexp.match(full_path):
                    passes_through_regexp_filter = True
                    break

        filtered_out = False

        if passes_through_regexp_filter:
            if other_sec["files"].has_key(name):
                for other in other_sec["files"][name]:
                    # Note: if both entries ('entry' and 'other') have the
                    # same mode, then they are necessarily of the same type
                    # (symlink / not symlink). Therefore, it is useless to
                    # compare the types, since we already compare the modes.
                    if dirnames_are_equivalent(dirname, other["dirname"]) \
                           and (entry["mode"] == other["mode"]) \
                           and (entry["owner and group"] \
                                == other["owner and group"]):
                        filtered_out = True
                        break

        if not filtered_out:
            output_file_entry(output, entry)


def main():
    sections = split_input_into_sections(sys.stdin)
    # Locate the sections "Files in second .deb but not in first"
    # and                 "Files in first .deb but not in second"
    first_deb_sec, second_deb_sec = locate_interesting_sections(sections)

    # It is only useful to index the "interesting" sections if both of them
    # are present (otherwise, we'll just dump them verbatim).
    if (first_deb_sec is not None) and (second_deb_sec is not None):
        for section in first_deb_sec, second_deb_sec:
            index_files(section)

    # No section separator (newline) should be printed before the first section
    print_section_separator = False

    output = sys.stdout

    for section in sections:
        if print_section_separator:
            output.write('\n')
        else:
            print_section_separator = True

        if section["name"] is None:
            dump_unfiltered_section(output, section)
        elif first_deb_sec_rec.match(section["name"]):
            if second_deb_sec is not None:
                dump_filtered_section(output, first_deb_sec, second_deb_sec,
                                      filter_in)
        elif second_deb_sec_rec.match(section["name"]):
            if first_deb_sec is not None:
                dump_filtered_section(output, second_deb_sec, first_deb_sec,
                                      filter_in)
        else:
            dump_unfiltered_section(output, section)
        
                
    sys.exit(0)

if __name__ == "__main__": main()

Have a nice week-end!

-- 
Florent

Reply to:

Follow-Ups:
- Re: Script to filter debdiff output [was: Help needed]
  - From: Norbert Preining <preining@logic.at>

References:
- Help needed
  - From: Norbert Preining <preining@logic.at>

Prev by Date: Bug#225833: 225833: letter vs A4 in TeX
Next by Date: Bug#225833: 225833: letter vs A4 in TeX
Previous by thread: Re: Help needed
Next by thread: Re: Script to filter debdiff output [was: Help needed]
Index(es):
- Date
- Thread