Saturday, May 28, 2005

EasyInstall - A new era in Python package management?

Okay, maybe that's a bit bold. But the new EasyInstall command-line tool can take any number of distutils source package URLs (or local filenames), download them, build them, and install them with or without a .pth file. Better yet, you can also easily upgrade, downgrade, and/or uninstall packages, as well as select desired package versions at runtime.

This is the very first alpha release, so there might be some rough edges. I plan to start systematically working my way through PyPI to test it with different packages. (Well, first I plan to try cases that are likely to be tough.)

EasyInstall doesn't absolve you from needing to have a C compiler, though, if you're installing a package that includes extensions. However, if someone has packaged a Python Egg for that package for your platform, you can simply run EasyInstall on the .egg file's URL (or local filename) to download and install it on your system. So, perhaps we'll soon see people packaging eggs for various platforms, and making those download URLs available.

EasyInstall is at this point still something of a sketch. An ideal version would probably let you search PyPI for the desired packages, without needing to know an exact download URL. Maybe it would have a GUI, too. Who knows? But it's open source, so feel free to create your own extensions. In the meantime, I'll continue to add features to both EasyInstall and the underlying "setuptools" package as I need them.

I also hope that package authors will consider making their packages more "egg-friendly" by using the 'pkg_resources' module supplied by setuptools, and especially by including dependency information in their source distributions (in a 'PackageName.egg-info/depends.txt' file) so that the pkg_resources dependency manager can automatically activate the right package versions at runtime. This is where things will start to get really different, because it will then be a lot easier for people to break up large packages into smaller packages with dependencies.

But before that can start happening, I really need to finish the Python Eggs developer documentation, which is still lacking a bit, contentwise. However, given the widespread interest in "CPAN clones" of late, I felt it was important to get a real package management solution out there right away, before people started trying to implement package managers based on PEP 262 or similar approaches that don't work -- at least, not as well as Python Eggs do -- for applications, multiple version support, or any number of other scenarios.

So, here's hoping that those of you who are working on CPAN clones will take a look at EasyInstall, and see whether you can save some work by importing a few classes from the easy_install and pkg_resources modules. But, if you find that what's there isn't helpful, I'd really like to hear from you. Similarly, if you've found a package that EasyInstall won't build correctly, please let me know via the Distutils-SIG mailing list.

Update: There was a problem with the original EasyInstall "eggsecutable" for Python 2.4 that I uploaded; it seems that the "-m" option to Python will only work with modules that are actual .py files, which means you can't use it inside of eggs. Darn. Oh well, the "eggsecutable" recipe for Python 2.3 also works with 2.4, so I just rebuilt the Python 2.4 egg to work the same way.

Wednesday, May 25, 2005

A "dirt simple" download-and-install CPAN clone

It seems that CPAN clones are in the air of late. Yesterday, Ian Bicking posted a comment here that he was working on an automated download facility, and today, there was a Planet Python post about Uraga, a CPAN clone some other folks are working on.

So it got me to thinking, how simple could this actually be? Bob Ippolito and I had talked about making something like that using eggs, where there would be some sort of manifest file that told you where to download things from. But it occurred to me today that there are already "manifest files" out there on the web that work just fine: web pages and directory listings. For example, SourceForge download pages contain a wealth of links to URLs containing "SomePackage-1.0.zip" and the like. Since Python egg filenames also include Python version and platform information, everything you need to find a file is right there in the links - just parse the HTML page and go.

To test the usefulness of this theory, I went to PyPI and randomly selected 43 packages, investigating their download links. Here are the results:
  • 15 packages had no download URL at all
  • 2 had a download URL that went to their homepage, with no direct links to downloadable files
  • 10 had a download URL that pointed directly to a .zip, .tgz, .tar.gz, or .tar.bz2 file with a specific version number in the filename
  • 10 had a download URL that went to an HTML directory listing with links to versioned files
  • 6 had a download URL that went to a "latest version" (i.e. no explicit version number) archive or .py file
So, about half of the packages could have been processed by a spider hunting for specific versions. And about half of those could easily add .egg files to their download listings. Very interesting.

Of course, not every distributor of a package is going to want to mess with making eggs for different platforms, so a really useful tool is probably going to have to be able to download a source archive and build an egg from it.

Now you may be wondering, why build an egg? If you're going to have to build from source anyway, why not just install the package directly? Because eggs -- even unpacked ones -- let you keep multiple versions of a package on your system, and activate them at runtime.

So now I'm thinking, maybe there should also be an install_egg command for the distutils, that basically builds and egg and then installs it in site-packages (or wherever) for you. Then, we could use that with our hypothetical PyPI spider, to make a complete fetch-and-install utility.

Now, once we have that, let's say that somebody wanted to make a bunch of packages available as eggs for their platform. All they'd need to do is run that fetch-and-install such that it installs to a web-accessible directory, whose contents are visible as a directory listing. Now, somebody who adds the URL of that directory to their spider's search URLs would be able to find and download pre-built eggs for whatever they needed, without needing to do any building.

It's starting to sound an awful lot like what everybody's trying to make, doesn't it? So what are the architectural components we need?
  • The "reader": An HTML reader that scans a web page for links to eggs and/or source archives with names that match distutils-standard naming conventions
  • The "finder": A tool that takes a list of candidate start URLs and invokes the reader on them to search for specified package(s), caching the resulting index data
  • The "source catalog": A tool that, given a package name, finds download URLs from PyPI and determines whether they are archives or links that should be passed to the reader
  • The "fetcher": given a desired package, it consults the finder and the source catalog, trying to download a platform-suitable egg, falling back to finding a source archive and building an egg.
  • The "builder": given a source archive URL, download it, extract it, find the setup.py, and build/install an egg
  • Some way to decide what version of a package to build/install, if more than one version is available. (e.g. a way to select only stable versions, or whatever)
Interestingly, it might be possible to just repurpose an existing Python web spider to do a lot of this, just by spidering from PyPI with a reasonable external link depth, to build an index of package+version to download URLs. In fact, you could use that spider to simply create an HTML page with all the download links.

Given the existing capabilities of the egg runtime, and the assumption that an existing HTML-parsing spider (or browser-emulator) could be made to do the fetching and parsing, the biggest parts remaining are the "builder", and managing the whole thing's configuration. It seems to me there are lots of policy issues ranging from the trivial (where to put the eggs) to the critical (what versions to allow? code signing? checksums? what download sites do you trust?)

But the interesting thing about all this, I think, is that in a sense we already do have a CPAN: it's called the web. Now all we need is a smart enough client to use it. :)

In the meantime, I've actually managed to squeeze in a few more hours' work on Python Eggs and their documentation. Directory scanning, dependency management, and even namespace packages are implemented now, although some of these features have received rather minimal testing. The documentation has also undergone a significant overhaul to explain many more of the implemented features, although there's still a lot more to write, just to explain what can be done with the current version of the runtime.

Follow-up: I built eggs for the "mechanize" package and its dependencies, and found it only takes a few lines of code to retrieve and analyze the links of a download directory, such as the PEAK projects download directory, or a Sourceforge file listing. Of course, actually downloading the list and parsing it can be slow, so an "end-user quality" download tool might need to do a fair amount of tweaking to make the process more friendly for impatient people like me.

Sunday, May 22, 2005

Eggs get closer to hatching...

Well, I finally managed to squeeze out a few cycles to work on Python Eggs, which I'd left virtually untouched since the work Bob Ippolito and I did on them at Pycon. The net result is that this weekend I finished the core dependency resolution engine, the part of the egg runtime that lets eggs specify what other eggs they depend on (including required version(s) and requested optional features), and lets applications request that eggs be searched for and automatically added to sys.path along with their dependencies. There's even a hook that allows you to add support for automatic downloading of dependencies, although no such support will be included with the base system. (Automated downloads just have too many security issues and policy questions, so it'd be crazy to turn them on by default. In any case, GUI applications will want to integrate the download process into their UI in some fashion.)

The two big things that aren't done yet are: 1) actually scanning sys.path directories for .egg files and .egg-info directories, and 2) support for "namespace packages" so that mega-frameworks like Zope, PEAK and Twisted can be split into independent .egg files. In addition to these two big things, there are also a lot of little features and cleanups that would be useful to have. For example, peak.web can't be made .egg-friendly for web components until there's an API equivalent to listdir() for .egg file contents. You can pretty much see all the open issues in the Implementation Status section of the wiki page.

Still, this is an exciting milestone, because the egg system can not only handle cyclic dependencies, report version conflicts, and all sorts of other details, it can now handle "option" dependencies as well. An option is some feature of a package that may or may not be used by a given user of the package, and which may incur other dependencies. For example, let's say I was going to create an .egg for peak.web, with a distribution name of "PEAK-Web". PEAK-Web will depend on PEAK-Core, and also on the WSGIRef library. It also has optional support for FastCGI, but in order to use that support, you would need the FCGIApp library.

In a more simplistic dependency management system, PEAK-Web would have to do one of the following things to support this optional dependency:
  1. depend on FCGIApp (forcing you to install it when you don't need it)
  2. not depend on FCGIApp (forcing you to figure out whether you need it)
  3. create a PEAK-Web-FastCGI package whose only purpose is to depend on PEAK-Web and FCGIApp, which you then depend on in place of depending on PEAK-Web.
These are all ugly, so we invented a better solution for Python Eggs. PEAK-Web will instead define an option called "FastCGI", and it will have an "EGG-INFO/depends.txt" file that looks something like this:
PEAK-Core==0.5a4
WSGIRef>=0.1

[FastCGI]
FCGIApp>=0.1
This tells the egg runtime that PEAK-Web always depends on PEAK-Core and WSGIRef, but it only needs FCGIApp if the FastCGI option is requested.

How do you do that? Well, in your application's top-level script, you could call require("PEAK-Web[FastCGI]>=0.5a4"), and this will find and add to sys.path all the necessary eggs, or raise a DistributionNotFound (or VersionConflict) error if the right eggs can't be found, or if two eggs have conflicting version requirements. (Or if an egg that's already on sys.path has a version incompatible with your requested version, or that's incompatible with your request's dependencies' requirements.) While this doesn't eliminate the need for you to be aware of a package's optional features, it does at least eliminate the need to have dummy packages just to bundle optional dependencies.

Anyway, you can't actually use this yet, because I still haven't implemented that part that scans specified directories for .egg files to use. Oops. Hopefully I'll get that done next weekend. In the meantime, if you're adventurous, you can check out the latest setuptools from the Python CVS sandbox and play around with it. Ian Bicking and I also just added some updated documentation to the Building Eggs section of the wiki page, that should make it a bit easier to understand how to package your own or someone else's libraries as .egg files.

Update as of May 23: I squeezed in a few more minutes this evening and managed to actually hack up a halfway working distribution scanner, so the require() API now appears to be working. If anybody wants to start experimenting, I look forward to hearing about your experiences.