Feeds:
Posts
Comments

Introducing WWW::OPG

While looking at Ontario Power Generation’s official web site, I noticed this number in the bottom right corner of the page:

It contains the amount of power being generated as well as the date/time of the last update. I refreshed a few times and realized that updates occur every five minutes. Curious, I thought I’d whip up a quick module to scrape this information from the web site and produce some nice graphs with RRDTool. I used the open source RRDTool::OO module to do this, which is freely available on the CPAN.

Recognizing that web scraping is not the most reliable means of getting data from a web site, I contacted OPG via e-mail and requested an API for this data. In the latest iteration of WWW::OPG (version 1.004 already on CPAN), a smaller machine-readable text file provides the same data in an easier-to-parse format. Thanks to someone I know only as “Rose” from OPG for providing this file, which is much easier to parse and less likely to change.

As OPG supplies roughly 70% of Ontario’s electric power demand, the consumption statistics provide a relatively good reflection on our behaviour patterns over time. During the course of this, I learned how to work with Round Robin Databases (and wrote an article about it) and was able to observe some interesting trends even in the first week of operation:

Power generation for week of 2009-12-25

The graph begins Saturday, December 26th, 2009 (Boxing Day) and continues through the week approaching the new year 2010. These particular trends are interesting because, while two observable peaks occur each day, the overall power consumption (including 95th percentile consumption) seems much lower than usual.

By comparison, consider this graph of a week ended 14 January 2010 (there were some rather long-lasting outages in the data collection which I’m trying to track down, but it still gives a sense of the general trends):

Power generation for week of 2010-01-07

In this case, the 95th percentile consumption is much higher at about 14GW rather than 10GW. Note that the 95th percentile gives a rather good approximation of an infrastructure’s utilization rate, since it works by indicating peak power after removing the highest 5% of data points. This means that 95% of the time, power consumption was at or below the given line.

Percentile is more important than averages because it indicates the minimum infrastructure to satisfy demand most of the time (95% of the time) so it gives us a simple way to determine whether more infrastructure is needed.

In the specific case of electric power utilities, and because electricity is so important for both industrial and commercial use, legal requirements stipulate that the demand must always be supplied, barring exceptional circumstances such as failures of distribution transformers. In this case, maximum power consumption is a more useful measure for infrastructure planning.

A specialized storage system known as a Round Robin Database allows one to store large amounts of time series information such as temperatures, network bandwidth and stock prices with a constant disk footprint. It does this by taking advantage of changing needs for precision. As we will see later, the “round robin” part comes from the basic data structure used to store data points: circular lists.

In the short term, each data point is significant: we want an accurate picture of every event that has occurred in the last 24 hours, which might include small transient spikes in disk usage or network bandwidth (which could indicate an attack). However, in the long term, only general trends are necessary.

For example, if we sample a signal at 5-minute intervals, then a 24-hour period will have 288 data points (24hrs*60mins/hr divided by 5 minutes per sample). Considering each data point is probably1 only 4 (float), 8 (double), 16 (quad) bytes, it’s not problematic to store roughly three hundred data points. However, if we continue to store each sample, a year would require about 105120 (365*288) data points; multiplied over many different signals, this can become quite significant.

To save space, we can compact the older data using a Consolidation Function (CF), which performs some computation on many data points to combine it into a single point over a longer period. Imagine that we take an average of those 288 samples at the conclusion of every 24 hour period; in that case, we would only need 365 data points to store data for an entire year, albeit at an irrecoverable loss of precision. Though we have lost precision (we no longer know what happened at exactly 5:05pm on the first Tuesday three months ago), the data is still tremendously useful for demonstrating general trends over time.

Though perhaps not the easiest to learn, RRDtool seems to have the majority of market share (without having done any research, I’d estimate somewhere between 90% and 98%, to account for those who create their own solutions in-house), and for good reason: it gets the job done quickly, provides appealing and highly customizable charts and is free and open source software (licensed under the GNU General Public License).

In a recent project, I learned to use RRDTool::OO to maintain a database and produce some interesting graphs. Since I was sampling my signal once every five minutes, I decided to replicate the archiving parameters used by MRTG, notably:

  • 600 samples store 2 days and 2 hours of data (at full resolution)
  • 700 samples store 14 days and 12 hours of data (where six samples become a 30-minute average)
  • 775 samples store 64 days and 12 hours of data (2-hour average)
  • 797 samples store 797 days of data (24-hour average)

F0r those interested, the following code snippet (which may be rather easily adapted for languages other than Perl) constructs the appropriate database:

archive => {
 rows    => 600,
 cpoints => 1,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 700,
 cpoints => 6,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 775,
 cpoints => 24,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 797,
 cpoints => 288,
 cfunc   => 'AVERAGE',
},

There are also plenty of other examples of this technique in action, mainly related to computing. However, there are also some interesting applications such as monitoring voltage (for an uninterruptible power supply) or indoor/outdoor temperature (using an IP-enabled thermostat).

Footnotes

1. This may, of course, vary depending on the particular architecture

Catalyst on Debian

Earlier in the year, I wrote a similar article discussing the Catalyst Web Framework and the MojoMojo Wiki software. At the beginning of December 2009, I wrote an article which was published in the Catalyst Advent Calendar. I’m re-posting it here for posterity, and because it is still relevant to others today.

Introduction

Because Catalyst is a rapidly evolving project, packages supplied by operating system vendors like Debian, Fedora, Ubuntu, and many others have historically been outdated compared to the stable versions. In effect, this limited users of Debian’s package management system to outdated versions of this software.

In 2009, thanks to the efforts of Matt S Trout and many others, Debian’s Catalyst packages have been improving. The idea that Debian’s Perl packages are outdated is an idea that is itself becoming obsolete. There are many situations where system-wide Debian packages (and similarly, Ubuntu packages) can be preferable to installing software manually via CPAN.

Advantages

Here are some reasons why packages managed by Debian are preferable to installing packages manually:

  • Unattended installation: the majority of our packages require absolutely no user interaction during installation, in contrast to installs via CPAN.
  • Quicker installs for binary packages: since binary packages are pre-built, installing the package is as simple as unpacking the package and installing the files to the appropriate locations. When many modules need to be built (as with Catalyst and MojoMojo), this can result in a significant time savings, especially when one considers rebuilding due to upgrades.
  • No unnecessary updates: if an update only affects the Win32 platform, for example, it does not make sense to waste bandwidth downloading and installing it. Our process separates packages with bugfixes and feature additions from those that have no functional difference to users, saving time, bandwidth, and administrative overhead.
  • Only packages offered by Debian are supported by Debian: if there are bugs in your Debian software, it is our responsibility to help identify and correct them. Often this means coordinating with the upstream software developers (i.e. the Catalyst community) and working toward a solution together – but our team takes care of this on your behalf.
  • Updates occur with the rest of your system: while upgrading your system using aptitude, synaptic, or another package management tool, your Perl packages will be updated as well. This prevents issues where a system administrator forgets to update CPAN packages periodically, leaving your systems vulnerable to potential security issues.
  • Important changes are always indicated during package upgrades: if there are changes to the API of a library which can potentially break applications, a supplied Debian.NEWS file will display a notice (either in a dialog box or on the command line) indicating these changes. You will need to install the “apt-listchanges” utility to see these.

This year has seen greatly improved interaction between the Debian Perl Group and the Catalyst community, which is a trend we’d like to see continue for many years to come. As with any open source project, communicating the needs of both communities and continuing to work together as partners will ultimately yield the greatest benefit for everyone.

Disadvantages

As with all good things, there are naturally some situations where using Debian Perl packages (or, indeed, most operating-system managed packages) is either impossible, impractical, or undesirable.

  • Inadequate granularity: due to some restrictions on the size of packages being uploaded into Debian, there are plenty of module “bundles”, including the main Catalyst module bundle (libcatalyst-modules-perl). Unfortunately, this means you may have more things installed than you need.
  • Not installable as non-root: if you don’t have root on the system, or a friendly system administrator, you simply cannot install Debian packages, let alone our Perl packages. This can add to complexity for shared hosting scenarios where using our packages would require some virtualization.
  • Multiple versions: with a solution like local::lib, it’s possible to install multiple versions of the same package in different locations. This can be important for a number of reasons, including ease of testing and to support your legacy applications. With operating-system based packages, you will always have the most recent version available (and if you are using the stable release, you will always have the most recent serious bug/security fixes installed).
  • Less useful in a non-homogeneous environment: if you use different operating systems, it can be easier to maintain a single internal CPAN mirror (especially a mini-CPAN installation) than a Debian repository, Ubuntu repository, Fedora/RedHat repository, etc.

For my purposes, I use Debian packages for everything because the benefits outweigh the perceived costs. However, this is not the case for everyone in all situations, so it is important to understand that Debian Perl packages are not a panacea.

Quality Assurance

The Debian Perl Group uses several tools to provide quality assurance for our users. Chief among them is the Package Entropy Tracker (PET), a dashboard that shows information like the newest upstream versions of modules. Our bug reports are available in Debian’s open bug reporting system.

If you have any requests for Catalyst-related modules (or other Perl modules) that you’d like packaged for Debian, please either contact me directly (via IRC or email) or file a “Request For Package” (RFP) bug. If you have general questions or would like to chat with us, you’re welcome to visit us at any time – we hang around on irc.debian.org, #debian-perl.

See Also

  • Our IRC channel, irc.debian.org (OFTC), channel #debian-perl
  • Package Entropy Tracker is a dashboard where we can see what needs to be updated. It allows us (and others, if interested!) to easily monitor our workflow, and also contains links to our repository.
  • Our welcome page talks about what we do and how you (yes you!) can join. You don’t need to be a Debian Developer to join the group (actually, I’m not yet a DD and yet I maintain 300+ packages through the group).
  • This guide explains how to file a Request For Package (RFP) bug, so that the modules you use can be added to the Debian archive. Note that Debian is subject to many restrictions, so issues like inadequate copyright information may prevent the package from entering the archive.

Statistics

Here are some statistics of note:

  • We maintain over 1400+ packages as of today. For details, see our Quality Assurance report
  • We have quite a few active members; probably around 10 or 20

Acknowledgments

Thanks to Matt S Trout (mst) for working so closely with the group to help both communities achieve our goal of increasing Catalyst’s profile. Also thanks to Bogdan Lucaciu (zamolxes) for inviting us to contribute this article, and Florian Ragwitz (rafl) for his review and feedback.

Everything that is good in nature comes from cooperation. Neither Catalyst, nor Perl, nor Debian Perl packages could exist without the contributions of countless others. We all stand on the shoulders of giants.

Older Posts »