[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: PyPI and debian/watch



> On Feb 4, 2015, at 10:07 AM, Barry Warsaw <barry@debian.org> wrote:
> 
> On Feb 04, 2015, at 08:08 AM, Donald Stufft wrote:
> 
>> If it gets implemented it'll live at /uscan/ because it exists primarily to
>> work around the deficiencies that exist in uscan (Particularly the dificulty
>> in ignoring url fragments). Everyone else should just use the URLs at /simple/
>> which most systems use with no problem because they can parse the URLs and
>> ignore the URL fragments (or use them for verifying the hash if need be).
> 
> I'll just note that I've found the fragments inconvenient in other settings
> too.  They aren't very user friendly since (IMHO) they add noise that users
> cutting and pasting urls generally don't care about.  They also "feel" odd in
> that the md5 checksum doesn't fit what I think as a typical fragment.
> Traditionally, they are used to point to an anchor (sub-resource) within the
> parent resource.  That's not the case here.
> 
> http://en.wikipedia.org/wiki/Fragment_identifier
> 
> has this to say:
> 
> """
> Several proposals have been made for fragment identifiers for use with plain
> text documents (which cannot store anchor metadata), or to refer to locations
> within HTML documents in which the author has not used anchor tags:
> 
> As of September 2012 the Media Fragments URI 1.0 (basic) is a W3C
> Recommendation.[12]
> 
> The Python Package Index appends the MD5 hash of a file to the URL as a
> fragment identifier.[13] If MD5 were unbroken (it is a broken hash function),
> it could be used to ensure the integrity of the package.
> 
> https://pypi.python.org ... zodbbrowser-0.3.1.tar.gz#md5=38dc89f294b24691d3f0d893ed3c119c
> """
> 
> So even without the uscan incompatibility (which is just one of the two
> factors leading to noisy d/watch file), I think there's some value in
> fragment-less URLs.  I understand the checksum isn't being used
> cryptographically here, but maybe thinking ahead to the use of more secure
> algorithms in the future can lead to a more flexible design:
> 
> Legacy (if it indeed needs to be kept for backward compatibility):
> 
> /simple/foo-x.y.z#md5=blah
> 
> then:
> 
> /simple/plain/foo-x.y.z
> /simple/sha256/foo-x.y.z#sha256=blah
> 

Long term PyPI is going to move away from trying to cram a bunch of information
into a hyperlink and relying on HTML parsing and instead is going to move the
installer APIs over to using something more suited to the task, most likely
JSON. At that point we'll be able to design the API to be more extendable in
this regards since we'll be able to do something like:

    {
        ...,
        hashes: {
            "md5": "...",
            "sha256": "...",
        },
        ...
    }

and the client can simply select which hash it wants to use. Long term the
/simple/ API on PyPI will exist only for legacy purposes so people still using
versions of pip, easy_install, etc that only support /simple/ will still be
able to access PyPI.

That doesn't really help uscan at all though since as far as I know uscan has
no ability to parse JSON.

As far as copy/pasting goes, the /simple/ API is an API, it's not designed to
be human consumable but consumable by software. The UI centric pages at /pypi
are the ones designed to be consumable by humans (Although currently PyPI puts
the hash there as well, however Warehouse (aka PyPI 2.0) does not).

The problem here really lies within uscan making assumptions about the
structure of URLs and the content of the HTML on those pages. From looking at
https://wiki.debian.org/debian/watch I'm guessing that it inherited those
assumptions from when FTP was the more common way to distribute files instead
of HTTP(S). That same page also mentions that qa.debian.org runs a number of
"redirectors" for sites like SourceForge and GitHub so perhaps a better answer
is for Debian QA to run a redirector for PyPI instead of PyPI implementing a
redundant API endpoint with a slightly different layout and HTML primarily for
Debian.

One note in that regard is that the /simple/ indexes don't include the .asc
files if someone has uploaded them however the old URLs that debian/watch used
did. If that is something that is needed we could easily add them to the
/simple/ pages.

---
Donald Stufft
PGP: 7C6B 7C5D 5E2B 6356 A926 F04F 6E3C BCE9 3372 DCFA


Reply to: