Feeds:
Posts
Comments

Posts Tagged ‘Perl’

The CPAN ecosystem is one of the most compelling reasons for the continued growth of the Perl programming language. It has been discussed at length by numerous people, and there have been several attempts to imitate this aspect of the Perl community through projects like: CRAN, CCAN, JSAN and more.

Unfortunately, in equal parts due to its age and design philosophy, the PAUSE system powering CPAN makes it difficult for distributions to be maintained by a group, rather than an individual. The inspiration for this post comes from a discussion I had recently with Florian Ragwitz, who contributes to several key Perl projects, including Catalyst, Moose, DBIx::Class and many more.

Permissions

First, a bit about how permissions on CPAN work.

In order to make a package installable using the CPAN Shell, there must be some mechanism to disambiguate a module name. Consider this simple example:

  1. I upload Acme::Package to CPAN.
  2. Some time passes, and unbeknownst to me, another author uploads a different package, but which is called Acme::Package to CPAN as well.

In the absence of any permission checking, if I then instructed users to install Acme::Package using the CPAN Shell, they would inadvertently install the wrong distribution! This has some rather serious implications: the other Acme::Package is probably quite different from mine, and a malicious author could have taken my software and added a backdoor vulnerability.

CPAN solves this issue by tracking each module namespace separately using the PAUSE Indexer, which assigns upload permissions to users through two mechanisms:

  1. The module namespace registration list.
  2. First-come status (the first uploader of a given package namespace “owns” that namespace).

Going back to the example given, the second uploader of Acme::Package would not have permission to use the namespace. The package will be accepted into the archive, but will not be indexed, meaning that users installing Acme::Package will still get my distribution.

If users want to install the other author’s package (which is marked as an UNAUTHORIZED upload in big red letters on CPAN Search), they would need to explicitly specify AUTHOR/Acme-Package-1.00.tar.gz.

For packages maintained by several people, it is also possible to assign co-maintainer status to others, so that they may also upload a package and have it correctly indexed. This way, two or more people can work on the same package together, and upload it under their own accounts (without causing the upload to be marked unauthorized). Thus, PAUSE credentials do not need to be shared.

This provides a nice solution to the malicious upload problem, but also has implications for team-maintained packages. In particular, consider the case where there are two authors working on Acme::Library.

  1. Alice uploads the first version to CPAN, containing modules: Acme::Library and Acme::Library::Main.
  2. The PAUSE Indexer grants Alice first-come permissions to both Acme::Library and Acme::Library::Main.
  3. Alice grants Bob co-maintainer status on both Acme::Library and Acme::Library::Main.
  4. Bob creates a new Acme::Library::Other module and adds it to the  package.
  5. The PAUSE Indexer grants Bob first-come permissions to Acme::Library::Other.
  6. Subsequent uploads by Alice will cause the upload of Acme::Library::Other to be marked UNAUTHORIZED.

Solutions

Clever Perl authors have attempted to solve this problem in many different ways over the years, but none of them have been widely successful because they all rely on some degree of human interaction.

Shared PAUSE Accounts

Some notable projects have attempted to solve the issue by creating a shared PAUSE user to hold the requisite first-come or module list upload permissions, which may then be granted to all other team members through the existing co-maintainer facility.

Alternatively, since it is easier for smaller projects, many modules simply assign first-come permissions to a single person, who is then in charge of providing co-maintainer permissions to others who would like to work on it.

Both of these approaches have the same limitation: any people uploading new modules must remember to assign first-come permissions to the group or user in question. In our case, Bob should have assigned first-come permissions to Acme::Library::Other to Alice, who then must pass co-maintainer permissions back to Bob. Unfortunately, this almost never happens, and Alice must chase down Bob (who happens to be on vacation in Antarctica) or, alternatively, the already over-worked PAUSE administrators.

Single Uploader

Some projects deal with this issue by sharing a version control system and having all the uploads go through a single person, in our case, Alice. This fixes the permission problem, since first-come permissions are always granted to Alice, but it results in a single point of failure. If there are some serious security issues requiring an immediate release, Alice must be available (and, as luck would have it, she is vacationing in Antarctica at the time).

Enter x_authority

One proposed solution, which is used in projects including Moose and Catalyst, is to use a special field in the CPAN Metadata file (META.yml or META.json) that defines someone as the “authority” for first-come namespaces in a distribution.

This is how it would work for Alice‘s Acme::Library distribution:

  1. Alice uploads a package to CPAN, containing modules: Acme::Library and Acme::Library::Main.
  2. Alice specifies, in META.yml:
    x_authority: cpan:ALICE

    This refers to Alice‘s PAUSE login, and is the person to whom permissions for new modules uploaded in this distribution are assigned.

  3. Alice grants Bob co-maintainer status on both Acme::Library and Acme::Library::Main.
  4. Bob creates a new Acme::Library::Other module and adds it to the package
  5. The PAUSE indexer, seeing the x_authority defined in META.yml, grants Alice (not Bob!) first-come permissions to Acme::Library::Other. At this time, Bob also automatically gets co-maintainer permissions to Acme::Library::Other.
  6. Subsequent uploads by Alice will be indexed properly.

Problems

There are still some outstanding issues that need to be resolved, but the x_authority proposal represents a giant leap forward for team-maintained software.

The name: any keys not part of the CPAN Metadata Specification must be prefixed with “x_” – eventually, once it is used by more people and accepted into the specification, this name will become, simply, “authority.”

Other comaintainers: if Charlie joined the project prior to Bob‘s upload of Acme::Library::Other, then Alice still needs to grant co-maintainer permissions to Charlie. Unfortunately, the PAUSE Indexer cannot automatically grant permissions to him, since it has no notion of a “distribution,” only module namespaces.

Malicious uploaders: in the worst case, if Eve joins the project and maliciously (or unintentionally!) changes the x_authority, she will automatically get first-come permissions on the namespace of any modules she adds. However, this is the same behaviour that we had in the absence of x_authority.

Conclusions

Ultimately, the benefits of this feature (making group maintenance easier) drastically outweigh the cost (only a few small changes need to be made to the PAUSE Indexer). They are unlikely to cause any problems in practice, and the worst-case behaviour is the same as if we did not have x_authority at all.

It isn’t perfect, but it is a solution that requires minimal effort and minimal changes to PAUSE. Eventually, the goal is to create a more sophisticated system that will handle the issues outlined above, as well as more complex ones, such as renaming distributions or moving modules between distributions.

Thanks to Florian Ragwitz for spending some time discussing x_authority at length with me. He and Leon Timmermans proofread this article prior to publication.

Read Full Post »

A specialized storage system known as a Round Robin Database allows one to store large amounts of time series information such as temperatures, network bandwidth and stock prices with a constant disk footprint. It does this by taking advantage of changing needs for precision. As we will see later, the “round robin” part comes from the basic data structure used to store data points: circular lists.

In the short term, each data point is significant: we want an accurate picture of every event that has occurred in the last 24 hours, which might include small transient spikes in disk usage or network bandwidth (which could indicate an attack). However, in the long term, only general trends are necessary.

For example, if we sample a signal at 5-minute intervals, then a 24-hour period will have 288 data points (24hrs*60mins/hr divided by 5 minutes per sample). Considering each data point is probably1 only 4 (float), 8 (double), 16 (quad) bytes, it’s not problematic to store roughly three hundred data points. However, if we continue to store each sample, a year would require about 105120 (365*288) data points; multiplied over many different signals, this can become quite significant.

To save space, we can compact the older data using a Consolidation Function (CF), which performs some computation on many data points to combine it into a single point over a longer period. Imagine that we take an average of those 288 samples at the conclusion of every 24 hour period; in that case, we would only need 365 data points to store data for an entire year, albeit at an irrecoverable loss of precision. Though we have lost precision (we no longer know what happened at exactly 5:05pm on the first Tuesday three months ago), the data is still tremendously useful for demonstrating general trends over time.

Though perhaps not the easiest to learn, RRDtool seems to have the majority of market share (without having done any research, I’d estimate somewhere between 90% and 98%, to account for those who create their own solutions in-house), and for good reason: it gets the job done quickly, provides appealing and highly customizable charts and is free and open source software (licensed under the GNU General Public License).

In a recent project, I learned to use RRDTool::OO to maintain a database and produce some interesting graphs. Since I was sampling my signal once every five minutes, I decided to replicate the archiving parameters used by MRTG, notably:

  • 600 samples store 2 days and 2 hours of data (at full resolution)
  • 700 samples store 14 days and 12 hours of data (where six samples become a 30-minute average)
  • 775 samples store 64 days and 12 hours of data (2-hour average)
  • 797 samples store 797 days of data (24-hour average)

F0r those interested, the following code snippet (which may be rather easily adapted for languages other than Perl) constructs the appropriate database:

archive => {
 rows    => 600,
 cpoints => 1,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 700,
 cpoints => 6,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 775,
 cpoints => 24,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 797,
 cpoints => 288,
 cfunc   => 'AVERAGE',
},

There are also plenty of other examples of this technique in action, mainly related to computing. However, there are also some interesting applications such as monitoring voltage (for an uninterruptible power supply) or indoor/outdoor temperature (using an IP-enabled thermostat).

Footnotes

1. This may, of course, vary depending on the particular architecture

Read Full Post »

I’ve recently been pushing for greater support for Catalyst and MojoMojo on Debian. For the uninitiated, Catalyst is a Model-View-Controller Framework designed for writing web applications. MojoMojo is a Wiki application based on Catalyst that provides a lot of neat features; while it seems less popular than Wikimedia’s MediaWiki software, it’s still got plenty of features other wikis don’t.

Here’s a blurb about it from their homepage:

We also have a bunch of features you won’t find in every wiki, like an attachment system that automatically makes a web gallery of your photos, live AJAX previews as you are editing your text, and a proper full text search engine built straight into the software.

Unfortunately, such a rich feature set comes at a price — this shiny piece of software has a rather large dependency chain. As a result, building the module (after building its prerequisites) from CPAN is both slow and prone to failure, since each module must be individually retrieved, extracted, built, tested and then installed.

To make matters worse, any failure anywhere in the chain (perhaps a new version of a module breaks things) will cause a complete failure to build the module — either Catalyst or MojoMojo — which has some serious implications for production applications.

In Debian, we mitigate this risk by having separate unstable and testing distributions, so if a newer version breaks things in unstable, we will catch it and have a chance to fix it before the package makes it into testing. By packaging these modules for Debian, we get the advantages of a faster installation process (since we’re installing pre-built binaries) combined with better Quality Assurance.

One of the big issues blocking both of these has been missing copyright information for a lot of modules. I’ve worked a lot with Matt S. Trout, one of the primary people behind coordinating the efforts of the Catalyst project, and gathered the necessary information for an upgrade and upload into Debian.

Recently, libcatalyst-modules-perl (version 35) and libcatalyst-modules-extra-perl (version 4) were uploaded to Debian, containing many necessary updates and fixes to improve the Catalyst experience on Debian. The next big push is to get MojoMojo’s dependencies packaged (currently only String::Diff is blocking it, due to missing copyright information).

A bounty of $150 is being offered by one of the MojoMojo developers to the first person who can re-implement the String::Diff functionality in a free/open source way.

Read Full Post »

Okay, so this is a long-awaited follow-up to my first post on the topic of  Debian Perl Packaging. Some of you might note I was pretty extreme in the first post, which is partially because people only really ever respond to extremes when they’re new to things. When you first begin programming, the advice you get is “hey, never use goto statements” — but as your progress in your ability and your understanding of how it works, what it’s actually doing in the compiler — then it might not be so bad after all. In fact, I hear the Linux kernel uses it extensively to provide Exceptions in C. The Wikipedia page on exception handling in various languages shows how to implement exceptions in C using setjmp/longjmp (which is essentially a goto statement). But I digress.

Back to the main point of this writeup. Previously I couldn’t really think of cases where packaging your own modules is really all that useful, especially when packaging them for Debian means that you benefit many communities — Debian, Ubuntu, and all of the distributions that are based on those.

During a discussion with Hans Dieter Piercey after his article providing a nice comparison between dh-make-perl and cpan2dist. (Aside: I feel like he was slightly biased toward cpan2dist in his writeup, but I’m myself biased toward dh-make-perl, so he might be right, even though I won’t admit it.)

I’m really glad for that article and the ensuing dialog, because it really got people talking about what they use Debian Perl packages for, and where it is useful to make your own.

Firstly, if you’ve got an application that depends on some Perl module that isn’t managed by Debian, but you need it yesterday, then you can either install that module via CPAN or roll your own Debian package. The idea here is to make and install the package so you can use it, but also file a Request For Package bug at the same time — see the reportbug command in Debian, or use LaunchPad if you’re on Ubuntu. This way, when the package is officially released and supported, you can move to that instead, and thus get the benefits of automatic upgrades of those packages.

Secondly, if you’ve got an application that depends on some internally-developed modules, then they probably wouldn’t exist on CPAN (some call this Perl code part of the DarkPAN), except in the rare case that a company open sources their work. But corporations will never open source all of their work, even if they consider the implications of providing some of it to the open source community, so at some point or another you’ll need to deal with internal packages. Previously, the best way to handle this was to construct your own CPAN local mirror, and have other machines install and upgrade from it — thus your internal code is easily distributed via the usual mechanism.

One of the advantages of using CPAN to distribute things is that it’s available on most platforms, builds things and runs tests automatically on many platforms. CPANPLUS will even let you remove packages, which was one of the main reasons I am so pro-Debian packages anyway. However, it does mean you’ll need to rebuild the package on other systems, which is prone to failures that cost time and money to track down and fix. CPAN and CPANPLUS are the Perl tradition of distributing packages.

If you are using an environment mostly with Debian systems, however, you may benefit from using a local Debian repository. This way, you only need to upgrade packages in your repository, and they’ll be automatically upgraded along with the rest of your operating system (you do run update and upgrade periodically right?). There is even the fantastic aptcron program to automate this, so there’s really no excuse not to automatically update.

In either case, creating a local package means you will be able to easily remove anything you no longer need via the normal package management tools. You can also distribute the binary packages between machines — though it sometimes depends on the platform (for modules that incorporate C or other platform-specific code that needs to be rebuilt). Generally, most Perl modules are Pure Perl, and thus you can compile and test it once, on one machine, and distribute it to other ones simply by installing the .deb package on other machines. You can copy packages to machines and use dpkg to install them, or better yet, create a local Debian mirror so it’s done automatically and via the usual mechanism (aptitude, etc.)

In conclusion: if you’re going to make your own Debian packages, do so with caution, and be aware of all the consequences (positive and negative) of what you’re doing. As always, a real understanding of everything is necessary.

Read Full Post »

One thing that makes Perl different from many other languages is that it has a rather small collection of core commands. There are only a few hundred commands in Perl itself, so the rest of its functionality comes from its rich collection of modules,  many of which are distributed via the Comprehensive Perl Archive Network (CPAN).

When CPAN first came on the scene, it preceded many modern package management systems, including Debian’s Advanced Packaging Tool (APT) and Ruby’s gem system, among others. As a consequence of its rich history, the CPAN Shell is relatively simplistic by today’s standards, yet still continues to get the job done quite well.

Unfortunately, there are two issues with CPAN:

  1. Packages are distributed as source code which is built on individual machines when installing or upgrading packages.
    • Since packages must be re-built on every machine that installs it, the system is prone to breaking and wastes CPU time and other resources. (The CPAN Testers system is a great way module authors can try to mitigate this risk, though.)
    • Due to wide variation in packages, many packages cause problems with the host operating system in terms of where they install files, or expect them to be installed. This is because CPAN does not (and cannot) know every environment that packages will be installed on.
  2. It does not integrate nicely with package managers
    • The standard CPAN Shell is not designed to remove modules, only install them. Removals need to be done manually, which is prone to human error such as forgetting to clean up certain files, or breaking other installs in the process.
    • It cannot possibly know the policies that govern the various Linux flavours or Unices. This means that packages might be installed where users do not expect, which violates the Principle of Least Surprise.
    • It is a separate ecosystem to maintain. When packages are updated via the normal means (eg, APT), packages installed via CPAN will be left alone (ie, not upgraded).

Here is the real problem: packages installed via CPAN will be left alone. This means that if new releases come out, your system will retain an old copy of packages, until you get into the CPAN Shell and upgrade it manually. If you’re administrating your own system, this isn’t a big problem — but it has significant implications for collections of production systems. If you are managing thousands of servers, then you will need to run the upgrade on each server, and hope that the build doesn’t break (thus requiring your, or somebody else’s, intervention).

One of the biggest reasons to select Debian is because of one of its primary design goal: to be a Universal Operating System. What this means is that the operating system should run on as many different platforms and architectures as possible, while providing the same rich environment to each of them to the greatest extent possible. So, whether I’m using Debian GNU/Linux x86 or Debian GNU/kFreeBSD x64, I have access to the same applications, including the same Perl packages. Debian has automated tools to build and test packages on every architecture we support.

The first thing I’m going to say is: if you are a Debian user, or a user of its derivatives, there is absolutely no need for you to create your own packages. None. Just don’t do it; it’s bad. Avoid it like the goto statement, mmkay?

If you come across a great CPAN package that you’d really like to see packaged for Debian, then contact the Debian Perl Packagers (pkg-perl) team, and let us know that you’d like a package. We currently maintain well over a thousand Perl packages for Debian, though we are by no means the only maintainers of Perl packages in Debian. You can do this easily by filing a Request For Package (RFP) bug using the command: reportbug wnpp.

On-screen prompting will walk you through the rest, and we’ll try to package the module as quickly as possible. When we’re done, you’ll receive a nice e-mail letting you know that your package has been created, thus closing the bug. A few days of waiting, but you will have a package in perfect working condition as soon as we can create it for you. Moreover, you’re helping the next person that seeks such a module, since it will already be available in Debian (and in due time it will propagate to its derivatives, like Ubuntu).

All 25,000+ Debian packages meet the rigorous requirements of Debian Policy. The majority of them meet the Debian Free Software Guidelines (DFSG), too; the ones which are not considered DFSG-free are placed in their own repository, separate from the rest of packages. A current work in progress is machine-parseable copyright control files, which will hopefully provide a way for administrators to quickly review licensing terms of all the software you install. This is especially important for small- and medium-sized businesses without their own intellectual property legal departments to review open source software, which is something that continues to drive many businesses away from using open source.

For the impatient, note this well: packages which are not maintained by Debian are not supported by Debian. This means that if you install something using a packaging tool (we’ll discuss these later) or via CPAN, then your package is necessarily your own responsibility. In the unlikely event that you totally break your system installing a custom package, it’s totally your fault, and it may mean you will have to restore an earlier backup or re-install your system completely. Be very careful if you decide to go this route. A few days waiting to ensure that your package will work on every platform you’re likely to encounter is worth the couple days of waiting for a package to be pushed through the normal channels.

The Debian Perl Packaging group offers its services freely to the public for the benefit of our users. It is much better to ask the volunteers (preferably politely) to get your package in Debian, so that it passes through the normal testing channels. You really should avoid making your own packages in a vacuum; the group is always open to new members, and it means your package will be reviewed (and hopefully uploaded into Debian) by our sponsors.

But the thing about all rules is that there are always exceptions. There are, in fact, some reasons when you might want to produce your own packages. I was discussing this with Hans Dieter Pearcey the other day, and he has written a great follow-up blog post about the primary differences between dh-make-perl and cpan2dist, two packaging tools with a similar purpose but very different design goals. Another article is to follow this one, where I will discuss the differences between the two.

Read Full Post »

When working on some code, I came across a random number generation algorithm (a pseudorandom number generator to be exact, since the numbers aren’t truly random but only designed to “look” like it to the casual observer) called ISAAC. The name stands for “Indirection, Shift, Accumulate, Add, and Count,” which is essentially what it does to a set of state variables, in that sequence.

The algorithm allows sequences of 32-bit numbers to be generated extremely quickly, since the operations it uses are relatively fast on modern processors; in fact, the author, Bob Jenkins, claims it only requires (on average) 18.75 machine processor cycles to generate each 32-bit value.

Even so, the algorithm is designed to be as uniformly distributed as possible.

I decided to test how fast this would be on a relatively modern multiprocessor system. The following is a benchmark on an amd64 machine with 2GB of memory. The benchmark script is available as part of the Math::Random::ISAAC package available via CPAN or its SVN Repository.

Rate ISAAC
PP
MT ISAAC
XS
Math::Random TT800 Core
ISAAC::PP 134/s -54% -79% -82% -90% -98%
MT 292/s 118% -53% -61% -78% -95%
ISAAC::XS 626/s 367% 114% -16% -54% -90%
Math::Random 748/s 458% 156% 19% -45% -88%
TT800 1350/s 907% 362% 116% 80% -78%
Core 6211/s 4534% 2024% 892% 730% 360%

So as we can see, Perl’s built in rand() function is evidently really, really fast; quicker than every other algorithm studied here by a few orders of magnitude. What remains to be seen, then, is whether the quality of random numbers actually generated is any good. Ideally, most random number generators are designed to produce distributions of numbers that are uniformly distributed — that is, every number is equally likely to occur.

Uniform distributions often do not occur in real-life “random” samples – for example, things like height or marks in a class tend to follow more of a normal distribution (or “bell curve”) – the majority of the sample population is near the mean, and it drops off as you go to higher or lower figures. Nonetheless, these types of random numbers are great for uses in computer science — where you want each number to be equally likely, thus ensuring that your sequence is as unpredictable as possible.

In this case, all of the algorithms produce relatively uniform distributions, though the TT800 and Perl rand() core function produce a somewhat jagged distribution. Less interesting distributions are those for ISAAC, Math::Random and the Mersenne Twister.

Read Full Post »

Good news everybody :-)

I found out yesterday that my proposal was one of ten from the Debian project selected for the Google Summer of Code 2009. So right now I’m getting pretty excited to start working on things.

I have quite a few things to do once exams finish – they begin with e-mailing the KDE-Bindings mailing list and finding the most appropriate home for a code repository. Likely this will be in the kde-bindings subversion, if I can get a commit bit there.

Another thing I need to do is contact the current maintainer of perlqt4 and make some plans there. I’ll probably need to take a first look at that module too.

In all, there were about 1000 accepted proposals and 4000 total. In the Debian project, we had 40 proposals and 10 finally accepted, so the odds were roughly the same as the average. I’m glad I get to work on something I love for the summer, but right now I’ve gotta focus on exams.

The good news is that exams will finish in a week, and then I can get coding!

Read Full Post »

Older Posts »