[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [Distutils] formencode as .egg in Debian ??



At 08:12 PM 11/23/2005 +0100, Matthias Urlichs wrote:
Hi,

Phillip J. Eby:
> I'm thinking that perhaps I should add an option like
> '--single-version-externally-managed' to the install command so that you
> can indicate that you are installing for the sake of an external package
> manager that will manage conflicts and uninstallation needs.  This would
> then allow installation using the .egg-info form and no .pth files.
>
You might shorten that option a bit. ;-)  I agree that this would be a
good option to have.

I try to use very long names for options that can have damaging effects if used indiscriminately. A project that's installed the "old-fashioned way" (which is what this does, apart from adding .egg-info) is hard to uninstall and may overwrite other projects' files. So, it is only safe to use if the files are being managed by some external package manager, and it further only works for a single installed version at a time. So the name is intended to advertise these facts, and to discourage people who are just reading the option list from trying it out to see what it does. :)


> >People will often inspect sys.path to understand where Python
> >is looking for their code.
>
> As I pointed out, eggs give you much better information on this.

The .egg metadata does. That, as you say, is distinct from the idea of
packaging the .egg as a zip file. Most likely, one that includes .pyc
files which were byte-compiled with different file paths; That causes no
problems whatsoever ... until you get obscure ideas like trying to step
through the code with pdb, or opening it in your editor to insert an
assertion or a printf, trying to figure out why your code breaks.  :-/

This is actually what the .egg-info mode was designed for. That is, doing development of the project. A setuptools-based project can run "setup.py develop" to add the project's source directory to sys.path, after generating an .egg-info directory in the project source if necessary. This allows you to do all your development right in your source checkout, and of course all the file paths are just fine, and the egg metadata is available at runtime. You can then deploy the project as an .egg file or directory.

(Also, for the .egg directory format, note that easy_install recompiles the .pyc/.pyo files so their paths *do* point to the .egg contents instead of the original build paths. The issues with zipfiles and precompiled .pyc files are orthogonal to anything about setuptools, eggs, etc.; they will bite you in today's Python no matter what's in the zipfile or who precompiled the .pyc files. I do have some ideas for fixing both of these problems in future versions of Python, but they're rather off-topic for all the lists we are currently talking on.)


That's not exactly negotiable. Debian has a packaging format which
resolves generic installation dependencies on its own. Therefore it
cannot depend on Python-specific .egg metadata. Therefore we need a way
to translate .egg metadata to Debian metadata.

Yes, that's precisely what I was suggesting would be helpful. As Vincenzo already mentioned, the egg metadata is a good starting point for defining the Debian metadata. I'm obviously not proposing changing Debian's metadata system. Well, maybe it wasn't *obvious* that I wasn't proposing that, but in any case I'm not. :)


> I remain concerned about how such packages will work with namespace
> packages, since namespace packages mean that two different distributions
> may be supplying the same __init__.py files, and some package managers may
> not be able to deal with two system packages (e.g. Debian packages, RPMs,
> etc.) supplying the same file, even if it has identical contents in each
> system package.
>
Debian packaging has a method to explicitly rename a different package's
file if it conflicts with yours ("dpkg-divert"; it does _not_ depend on
which package gets installed first). IMHO that's actually superior
randomly executing only one of these files, since you are aware that
there is a conflict (the second package simply doesn't install if you
don't fix it), and thus can handle it intelligently.

The two kinds of possible conflicts are namespace packages, and project-level resources.

A namespace package is more like a Java package than a traditional Python package. A Java package can be split across multiple directories or jar files; it doesn't have to be all in one place. Thus you can have lots of jars with org.apache.* classes in them.

Python, however, requires packages to have an __init__.py file, and by default the entire package is assumed to be in the directory containing the __init__.py file. However, as of Python 2.3, the 'pkgutil' module was introduced in the Python standard library which allowed you to create a Java-style "namespace package", automatically combining package directories found on different parts of sys.path. So, if in one sys.path directory you had a 'zope.interface' package, and in another you had a 'zope.publisher' package, these would be combined, instead of the first one being treated as if it were all of 'zope.*', and the second being completely ignored. However, *each* of the subpackages needs its own zope/__init__.py file for this to work.

So, the issue here is that if you install two projects that contain zope.* packages into the *same* directory (e.g. site-packages), then there will be two different zope/__init__.py files installed at the same location, even though they will have the same content (a short snippet of code to activate the namespace mechanism via the pkgutil module or via setuptools' pkg_resources module).

To date, there are only a small number of these namespace packages in existence, but over time they will represent a fairly large number of *projects*. As I go through the breakup of the PEAK meta-project into separate components, I expect to have a dozen or so projects contributing to the peak.* and peak.util.* namespace packages. Ian Bicking's Paste meta-project has a paste.* namespace package spread out in two or three subprojects so far. There has been some off-and-on discussion about whether Zope 3 will move to eggs instead of their own zpkg tool (which has issues on Windows and Mac OS that eggs do not), and in that case they will likely have a couple dozen components in zope.* and zope.app.*.

So, for the long-term solution of wrapping Python projects in Debian packages, the namespace issue needs to be addressed, because renaming each project's zope/__init__.py or whatever isn't going to work very well. There has to be one __init__.py file, or else such projects need to be installed in their own .egg directories or zipfiles to avoid collisions.

The second collision issue with --single-version-externally-managed is top-level resource collisions. Some existing projects that are not egg-based manipulate their install_data operation in such a way that they create files or directories in site-packages directly, rather than inside their own package data structures. Setuptools neither encourages nor discourages this, because it doesn't cause any problems for any egg layout except the .egg-info one -- and the .egg-info one was originally designed to support development, not deployment. In the development scenario, any such files are isolated to the source tree, and for deployment the .egg file or directory keeps each projects' contents completely isolated.

So, what I'm saying is that putting all projects in the same directory (as all "traditional" Python installations do) has some inherent limitations with respect to namespace packages and top-level resources, and these limitations are orthogonal to the question of egg metadata. The .egg formats were created to solve these problems (including clean upgrades, multi-version support, and uninstallation in scenarios where a package manager isn't usable), and so the other features that they enable will be increasingly popular as well.

In other words, as people make more use of PyPI (because they now really *can*), more people will put things on PyPI, and the probability of package name conflicts will increase more rapidly. The natural response will be a desire to claim uber-project or organizational names (like paste.*, peak.*, zope.*, etc.) putting individual projects under sub-package names. (For example, someone has already argued that I should move RuleDispatch's 'dispatch' package to 'peak.dispatch' rather than keeping the top-level 'dispatch' name all to myself.)

So, I'm just saying that using the --single-version-externally-managed approach requires that a package manager like Debian grow a way to handle these namespace packages safely and sanely. One possibility is to create dummy packages that contain only the __init__.py file for that namespace, and then have the real packages all depend on the dummy package, while omitting the __init__.py. So, perhaps each project containing a peak.util.* subpackage would depend on a 'python2.4-peak.util-namespace' package, which in turn would depend on a 'python2.4-peak-namespace' package. It's rather ugly, to say the least, but it would work as long as upstream developers never put anything in namespace __init__.py files except for the pkg_resources.declare_namespace() call.

(By the way, since part of an egg's metadata lists what namespace packages the project contains code or data for, the generation of these dependencies can be automated as part of the egg-to-deb conversion process.)

Or, of course, the .egg directory approach can also be used to bypass all collision issues, but this brings sys.path and .pth files back into the discussion. On the other hand, it can possibly be assumed that anything in a namespace package can be used only after a require() (either implicit or explicit), so maybe the .pth can be dropped for projects with namespace packages. These are possibilities worth considering, since they avoid the ugliness of creating dummy packages just to hold namespace __init__.py files.



Reply to: