Feeds:
Posts
Comments

Archive for the ‘Computer Science’ Category

The CPAN ecosystem is one of the most compelling reasons for the continued growth of the Perl programming language. It has been discussed at length by numerous people, and there have been several attempts to imitate this aspect of the Perl community through projects like: CRAN, CCAN, JSAN and more.

Unfortunately, in equal parts due to its age and design philosophy, the PAUSE system powering CPAN makes it difficult for distributions to be maintained by a group, rather than an individual. The inspiration for this post comes from a discussion I had recently with Florian Ragwitz, who contributes to several key Perl projects, including Catalyst, Moose, DBIx::Class and many more.

Permissions

First, a bit about how permissions on CPAN work.

In order to make a package installable using the CPAN Shell, there must be some mechanism to disambiguate a module name. Consider this simple example:

  1. I upload Acme::Package to CPAN.
  2. Some time passes, and unbeknownst to me, another author uploads a different package, but which is called Acme::Package to CPAN as well.

In the absence of any permission checking, if I then instructed users to install Acme::Package using the CPAN Shell, they would inadvertently install the wrong distribution! This has some rather serious implications: the other Acme::Package is probably quite different from mine, and a malicious author could have taken my software and added a backdoor vulnerability.

CPAN solves this issue by tracking each module namespace separately using the PAUSE Indexer, which assigns upload permissions to users through two mechanisms:

  1. The module namespace registration list.
  2. First-come status (the first uploader of a given package namespace “owns” that namespace).

Going back to the example given, the second uploader of Acme::Package would not have permission to use the namespace. The package will be accepted into the archive, but will not be indexed, meaning that users installing Acme::Package will still get my distribution.

If users want to install the other author’s package (which is marked as an UNAUTHORIZED upload in big red letters on CPAN Search), they would need to explicitly specify AUTHOR/Acme-Package-1.00.tar.gz.

For packages maintained by several people, it is also possible to assign co-maintainer status to others, so that they may also upload a package and have it correctly indexed. This way, two or more people can work on the same package together, and upload it under their own accounts (without causing the upload to be marked unauthorized). Thus, PAUSE credentials do not need to be shared.

This provides a nice solution to the malicious upload problem, but also has implications for team-maintained packages. In particular, consider the case where there are two authors working on Acme::Library.

  1. Alice uploads the first version to CPAN, containing modules: Acme::Library and Acme::Library::Main.
  2. The PAUSE Indexer grants Alice first-come permissions to both Acme::Library and Acme::Library::Main.
  3. Alice grants Bob co-maintainer status on both Acme::Library and Acme::Library::Main.
  4. Bob creates a new Acme::Library::Other module and adds it to the  package.
  5. The PAUSE Indexer grants Bob first-come permissions to Acme::Library::Other.
  6. Subsequent uploads by Alice will cause the upload of Acme::Library::Other to be marked UNAUTHORIZED.

Solutions

Clever Perl authors have attempted to solve this problem in many different ways over the years, but none of them have been widely successful because they all rely on some degree of human interaction.

Shared PAUSE Accounts

Some notable projects have attempted to solve the issue by creating a shared PAUSE user to hold the requisite first-come or module list upload permissions, which may then be granted to all other team members through the existing co-maintainer facility.

Alternatively, since it is easier for smaller projects, many modules simply assign first-come permissions to a single person, who is then in charge of providing co-maintainer permissions to others who would like to work on it.

Both of these approaches have the same limitation: any people uploading new modules must remember to assign first-come permissions to the group or user in question. In our case, Bob should have assigned first-come permissions to Acme::Library::Other to Alice, who then must pass co-maintainer permissions back to Bob. Unfortunately, this almost never happens, and Alice must chase down Bob (who happens to be on vacation in Antarctica) or, alternatively, the already over-worked PAUSE administrators.

Single Uploader

Some projects deal with this issue by sharing a version control system and having all the uploads go through a single person, in our case, Alice. This fixes the permission problem, since first-come permissions are always granted to Alice, but it results in a single point of failure. If there are some serious security issues requiring an immediate release, Alice must be available (and, as luck would have it, she is vacationing in Antarctica at the time).

Enter x_authority

One proposed solution, which is used in projects including Moose and Catalyst, is to use a special field in the CPAN Metadata file (META.yml or META.json) that defines someone as the “authority” for first-come namespaces in a distribution.

This is how it would work for Alice‘s Acme::Library distribution:

  1. Alice uploads a package to CPAN, containing modules: Acme::Library and Acme::Library::Main.
  2. Alice specifies, in META.yml:
    x_authority: cpan:ALICE

    This refers to Alice‘s PAUSE login, and is the person to whom permissions for new modules uploaded in this distribution are assigned.

  3. Alice grants Bob co-maintainer status on both Acme::Library and Acme::Library::Main.
  4. Bob creates a new Acme::Library::Other module and adds it to the package
  5. The PAUSE indexer, seeing the x_authority defined in META.yml, grants Alice (not Bob!) first-come permissions to Acme::Library::Other. At this time, Bob also automatically gets co-maintainer permissions to Acme::Library::Other.
  6. Subsequent uploads by Alice will be indexed properly.

Problems

There are still some outstanding issues that need to be resolved, but the x_authority proposal represents a giant leap forward for team-maintained software.

The name: any keys not part of the CPAN Metadata Specification must be prefixed with “x_” – eventually, once it is used by more people and accepted into the specification, this name will become, simply, “authority.”

Other comaintainers: if Charlie joined the project prior to Bob‘s upload of Acme::Library::Other, then Alice still needs to grant co-maintainer permissions to Charlie. Unfortunately, the PAUSE Indexer cannot automatically grant permissions to him, since it has no notion of a “distribution,” only module namespaces.

Malicious uploaders: in the worst case, if Eve joins the project and maliciously (or unintentionally!) changes the x_authority, she will automatically get first-come permissions on the namespace of any modules she adds. However, this is the same behaviour that we had in the absence of x_authority.

Conclusions

Ultimately, the benefits of this feature (making group maintenance easier) drastically outweigh the cost (only a few small changes need to be made to the PAUSE Indexer). They are unlikely to cause any problems in practice, and the worst-case behaviour is the same as if we did not have x_authority at all.

It isn’t perfect, but it is a solution that requires minimal effort and minimal changes to PAUSE. Eventually, the goal is to create a more sophisticated system that will handle the issues outlined above, as well as more complex ones, such as renaming distributions or moving modules between distributions.

Thanks to Florian Ragwitz for spending some time discussing x_authority at length with me. He and Leon Timmermans proofread this article prior to publication.

Read Full Post »

Last year, I had a great time participating in the Google Summer of Code with the Debian project. I had a neat project with some rather interesting implications for helping developers to package and maintain their work. It’s still a work-in-progress, of course, as many projects in open source are, but I was able to accomplish quite a bit and am proud of my work. I learned quite a bit about coding in C, working with Debian and met some very intelligent people.

My student peers were also very intelligent and great to learn from. I enjoyed meeting them virtually and discussing our various projects on the IRC channel as the summer progressed and the Summer of Code kicked into full swing. The Debian project in particular also helps arrange travel grants for students to attend the Debian Conference (this year, DebConf10 is being held in New York City!). DebConf provides a great venue to learn from other developers (both in official talks but also unofficial hacking sessions). As the social aspect is particularly important to Debian, DebConf helps people meet those with whom they are working with the most, thereby creating lifelong friendships and making open source fun.

I have had several interviews for internships, and the bit of my work experience most asked about is my time doing the Google Summer of Code. I really enjoyed seeing a project go from the proposal stage, setting a reasonable timeline with my mentor, exploring the state of the art, and most importantly, developing the software. I think this is the sort of indispensible industry-type experience we often lack in our undergrad education. We might have an honours thesis or presentation, but much of the work in the Google Summer of Code actually gets used “in the field.”

Developing software for people rather than for marks is significant in a number of ways, but most importantly it means there are real stakeholders that must be considered at all stages. Proposing brilliant new ideas is important, however, without highlighting the benefits they can have for various users, the reality is that it simply will not gain traction. Learning how to write proposals effectively is an important skill and working with my prospective mentor (at the time – he later mentored my project once it was accepted) to develop mine was tremendously useful for my future endeavours.

The way I see the Google Summer of Code is, in many ways, similar to an academic grant (and the stipend is about the same as well). It provides a modest salary (this year it’s US$5000) but more importantly, personal contact with a mentor. Mentors are typically veterans in software development or the Debian project and act in the same role as supervisors for post-graduate work: they help monitor your progress and propose new ideas to keep you on track.

The Debian Project is looking for more students and proposals. We have a list of ideas as well as application instructions available on our Wiki. As I will be going on internship starting May, I have offered to be a mentor this year. I look forward to seeing your submissions (some really interesting ones have already begun to filter in as the deadline approaches).

Read Full Post »

A specialized storage system known as a Round Robin Database allows one to store large amounts of time series information such as temperatures, network bandwidth and stock prices with a constant disk footprint. It does this by taking advantage of changing needs for precision. As we will see later, the “round robin” part comes from the basic data structure used to store data points: circular lists.

In the short term, each data point is significant: we want an accurate picture of every event that has occurred in the last 24 hours, which might include small transient spikes in disk usage or network bandwidth (which could indicate an attack). However, in the long term, only general trends are necessary.

For example, if we sample a signal at 5-minute intervals, then a 24-hour period will have 288 data points (24hrs*60mins/hr divided by 5 minutes per sample). Considering each data point is probably1 only 4 (float), 8 (double), 16 (quad) bytes, it’s not problematic to store roughly three hundred data points. However, if we continue to store each sample, a year would require about 105120 (365*288) data points; multiplied over many different signals, this can become quite significant.

To save space, we can compact the older data using a Consolidation Function (CF), which performs some computation on many data points to combine it into a single point over a longer period. Imagine that we take an average of those 288 samples at the conclusion of every 24 hour period; in that case, we would only need 365 data points to store data for an entire year, albeit at an irrecoverable loss of precision. Though we have lost precision (we no longer know what happened at exactly 5:05pm on the first Tuesday three months ago), the data is still tremendously useful for demonstrating general trends over time.

Though perhaps not the easiest to learn, RRDtool seems to have the majority of market share (without having done any research, I’d estimate somewhere between 90% and 98%, to account for those who create their own solutions in-house), and for good reason: it gets the job done quickly, provides appealing and highly customizable charts and is free and open source software (licensed under the GNU General Public License).

In a recent project, I learned to use RRDTool::OO to maintain a database and produce some interesting graphs. Since I was sampling my signal once every five minutes, I decided to replicate the archiving parameters used by MRTG, notably:

  • 600 samples store 2 days and 2 hours of data (at full resolution)
  • 700 samples store 14 days and 12 hours of data (where six samples become a 30-minute average)
  • 775 samples store 64 days and 12 hours of data (2-hour average)
  • 797 samples store 797 days of data (24-hour average)

F0r those interested, the following code snippet (which may be rather easily adapted for languages other than Perl) constructs the appropriate database:

archive => {
 rows    => 600,
 cpoints => 1,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 700,
 cpoints => 6,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 775,
 cpoints => 24,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 797,
 cpoints => 288,
 cfunc   => 'AVERAGE',
},

There are also plenty of other examples of this technique in action, mainly related to computing. However, there are also some interesting applications such as monitoring voltage (for an uninterruptible power supply) or indoor/outdoor temperature (using an IP-enabled thermostat).

Footnotes

1. This may, of course, vary depending on the particular architecture

Read Full Post »

Older Posts »