Feeds:
Posts
Comments

Posts Tagged ‘Algorithms’

A specialized storage system known as a Round Robin Database allows one to store large amounts of time series information such as temperatures, network bandwidth and stock prices with a constant disk footprint. It does this by taking advantage of changing needs for precision. As we will see later, the “round robin” part comes from the basic data structure used to store data points: circular lists.

In the short term, each data point is significant: we want an accurate picture of every event that has occurred in the last 24 hours, which might include small transient spikes in disk usage or network bandwidth (which could indicate an attack). However, in the long term, only general trends are necessary.

For example, if we sample a signal at 5-minute intervals, then a 24-hour period will have 288 data points (24hrs*60mins/hr divided by 5 minutes per sample). Considering each data point is probably1 only 4 (float), 8 (double), 16 (quad) bytes, it’s not problematic to store roughly three hundred data points. However, if we continue to store each sample, a year would require about 105120 (365*288) data points; multiplied over many different signals, this can become quite significant.

To save space, we can compact the older data using a Consolidation Function (CF), which performs some computation on many data points to combine it into a single point over a longer period. Imagine that we take an average of those 288 samples at the conclusion of every 24 hour period; in that case, we would only need 365 data points to store data for an entire year, albeit at an irrecoverable loss of precision. Though we have lost precision (we no longer know what happened at exactly 5:05pm on the first Tuesday three months ago), the data is still tremendously useful for demonstrating general trends over time.

Though perhaps not the easiest to learn, RRDtool seems to have the majority of market share (without having done any research, I’d estimate somewhere between 90% and 98%, to account for those who create their own solutions in-house), and for good reason: it gets the job done quickly, provides appealing and highly customizable charts and is free and open source software (licensed under the GNU General Public License).

In a recent project, I learned to use RRDTool::OO to maintain a database and produce some interesting graphs. Since I was sampling my signal once every five minutes, I decided to replicate the archiving parameters used by MRTG, notably:

  • 600 samples store 2 days and 2 hours of data (at full resolution)
  • 700 samples store 14 days and 12 hours of data (where six samples become a 30-minute average)
  • 775 samples store 64 days and 12 hours of data (2-hour average)
  • 797 samples store 797 days of data (24-hour average)

F0r those interested, the following code snippet (which may be rather easily adapted for languages other than Perl) constructs the appropriate database:

archive => {
 rows    => 600,
 cpoints => 1,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 700,
 cpoints => 6,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 775,
 cpoints => 24,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 797,
 cpoints => 288,
 cfunc   => 'AVERAGE',
},

There are also plenty of other examples of this technique in action, mainly related to computing. However, there are also some interesting applications such as monitoring voltage (for an uninterruptible power supply) or indoor/outdoor temperature (using an IP-enabled thermostat).

Footnotes

1. This may, of course, vary depending on the particular architecture

Read Full Post »

When working on some code, I came across a random number generation algorithm (a pseudorandom number generator to be exact, since the numbers aren’t truly random but only designed to “look” like it to the casual observer) called ISAAC. The name stands for “Indirection, Shift, Accumulate, Add, and Count,” which is essentially what it does to a set of state variables, in that sequence.

The algorithm allows sequences of 32-bit numbers to be generated extremely quickly, since the operations it uses are relatively fast on modern processors; in fact, the author, Bob Jenkins, claims it only requires (on average) 18.75 machine processor cycles to generate each 32-bit value.

Even so, the algorithm is designed to be as uniformly distributed as possible.

I decided to test how fast this would be on a relatively modern multiprocessor system. The following is a benchmark on an amd64 machine with 2GB of memory. The benchmark script is available as part of the Math::Random::ISAAC package available via CPAN or its SVN Repository.

Rate ISAAC
PP
MT ISAAC
XS
Math::Random TT800 Core
ISAAC::PP 134/s -54% -79% -82% -90% -98%
MT 292/s 118% -53% -61% -78% -95%
ISAAC::XS 626/s 367% 114% -16% -54% -90%
Math::Random 748/s 458% 156% 19% -45% -88%
TT800 1350/s 907% 362% 116% 80% -78%
Core 6211/s 4534% 2024% 892% 730% 360%

So as we can see, Perl’s built in rand() function is evidently really, really fast; quicker than every other algorithm studied here by a few orders of magnitude. What remains to be seen, then, is whether the quality of random numbers actually generated is any good. Ideally, most random number generators are designed to produce distributions of numbers that are uniformly distributed — that is, every number is equally likely to occur.

Uniform distributions often do not occur in real-life “random” samples – for example, things like height or marks in a class tend to follow more of a normal distribution (or “bell curve”) – the majority of the sample population is near the mean, and it drops off as you go to higher or lower figures. Nonetheless, these types of random numbers are great for uses in computer science — where you want each number to be equally likely, thus ensuring that your sequence is as unpredictable as possible.

In this case, all of the algorithms produce relatively uniform distributions, though the TT800 and Perl rand() core function produce a somewhat jagged distribution. Less interesting distributions are those for ISAAC, Math::Random and the Mersenne Twister.

Read Full Post »

While working on my Google Summer of Code project today, I came across a bug that pretty much halted my productivity for the day.

Early on in my project, I decided that working with Unicode is hard, among other things. Since I was restricted to using C, I had to find a way to easily manipulate Unicode stuff, and I came across GLib (I’m not even entirely sure how, I think I just remember other projects using it, and decided to look it up.)

Not only did it have Unicode handling stuff, it also provides a bunch of convenient things like a linked list implementation, memory allocation, etc. All in a way intended to be cross-platform compatible, since this is the thing that’s used to power Gtk.

I’m not entirely sure how it differs from Apache’s Portable Runtime (APR); maybe it’s even a Not Invented Here syndrome. In any case, I, not suffering from that particular affliction, decided to be lazy and re-use existing code.

For some reason, GLib’s g_slice_alloc() call was failing. For those of you that don’t know what this does, it is similar to malloc() in standard C – it allocates a chunk of memory and returns it to you, so that you can make use of dynamic memory allocation, rather than everything just being auto variables. In particular, it means you can be flexible and allocate as much or as little memory as you need.

So I spent the entire day trying to figure out why my program was segfaulting. Looking at the output of gdb (the GNU Debugger), the backtrace showed that it was crashing at the allocation statement. No way, I thought, that test is so well-tested, it must be a problem with the way I’m using it.

I changed the code to use malloc() instead of g_slice_alloc(), and the program started crashing right away, rather than after four or five executions with g_slice_alloc(). Not exactly useful for debugging.

I mentioned my frustration with C on the Debian Perl Packager Group channel, and a friend from the group, Ryan Niebur took a look at the code (accessible via a public repository). After a bit of tinkering, he determined that the problem was that I was using g_slice_alloc instead of g_slice_alloc0, which automatically zeroes memory before returning it.

It stopped the crashing and my program works as intended. I’m still left totally puzzled as to why this bug was happening, and I’m not sure if malloc isn’t supposed to be used with structs, or some other limitation like that.

But thanks the magic of open source and social coding/debugging, the expertise of many contribute to the success of a single project. It’s such a beautiful thing.

Update: There were a lot of questions and comments, mainly relating to the fact that malloc and friends return chunks of memory that may not have been zeroed.

Indeed, this was the first thing I considered, but the line it happened to be crashing on was a line that pretty much just did g_slice_alloc, rather than any of the statements after that.

For those that are curious, all of the code is visible in the public repository.

I do realize that the fixes that have been made are pretty much temporary and that they are probably just masking a bigger problem. However, I’m at a loss for the issue is. Hopefully the magic of open source will work for me again, and one of the many people who have commented will discover the issue.

Read Full Post »

They say that premature optimization is the root of all evil, and they are right. Most likely, this code performs as well as, or potentially even worse than, checking that strcmp(s1, s2) == 0.

Nonetheless, while working on my Google Summer of Code project, I needed to test that two strings are equal while ignoring the case. As a result I wrote a simple algorithm called strieq(s1,s2) which simply returns 1 if the strings are equal, and 0 otherwise. This differs from strcmp because strcmp provides the stuff necessary for an actual lexicographical comparison, which my particular application didn’t require.

So here I publish my code for your viewing pleasure. Possibly you will be able to use it, and while I’ve tested it and tried to make it safe, I’m not an extremely experienced C coder, so I may have missed something. If there is a bug, please report it (via e-mail or in the comments section) so that I can update it.

int strieq(const char *, const char *);
static inline char lower(char);

/* strieq(const char *str1, const char *str2)
 *
 * This function takes two ASCII strings and compares them for equality in a
 * case-insensitive manner. It doesn't work with Unicode strings -- that is,
 * it will only fix case for normal ASCII strings. Everything else is passed
 * through as-is.
 */
int strieq(const char *str1, const char *str2)
{
  while (TRUE)
  {
    if ((*str1 == '' && *str2 != '') || (*str1 != '' && *str2 == ''))
      break;

    /* From the above test we know that they are either both non-null or both
     * strings are at the end. The latter case means we have a match.
     */
    else if (*str1 == '')
      return TRUE;

    /* Check if the lowercased version of the characters match */
    if (lower(*str1) != lower(*str2))
      break;

    str1++; str2++;
  }
  return FALSE;
}

/* lower(char ch)
 *
 * This function takes a single alphabetic ASCII character and makes it
 * lowercase, useful for comparison.
 */
static inline char lower(char ch)
{
  /* Check that ch is between A and Z */
  if (ch >= 'A' && ch <= 'Z')
    return (ch + ('a' - 'A'));
  /* Otherwise return it as-is */
  return ch;
}

In case this is an issue:

I, the copyright holder of the above snippet for stricmp and lower, hereby release the entire contents therein into the public domain. This applies worldwide, to the extent that it is permissible by law.

In case this is not legally possible, I grant any entity the right to use this work for any purpose, without any conditions, unless such conditions are required by law.

Update: Small fixes to documentation and code; WordPress must have stripped characters when I first inserted the code.

Read Full Post »

One of my Year II-level Computer Science courses is called Data Structures and Algorithms. In a somewhat rigorous and primarily mathematical way, we determine the optimality of different ways of solving the most common problems in computer science.

Not every method of accomplishing a task is the same, so it goes without saying that some will be better than others, for a given application. Because we spend a lot of time solving a few common but complex problems, general solutions are indispensable. But algorithms are like apples and oranges and it is difficult to directly compare them. Every day, we make simple comparisons between disparate objects by finding a common measure – perhaps the weight, or the size/volume.

In computer science, one such metric is Big Oh time complexity, which tells us the so-called “order” of a function or algorithm. In simple terms, Big Oh tells us the maximum number of basic operations (additions, subtractions, “print” statements) required to accomplish a task. Our class is concerned with the asymptotic properties of Order – that is, what happens when the parameter approaches very large (“infinite”) numbers – for large structures, for large sets of data. There is also Big Oh space complexity, which I won’t discuss here.

After years of rigorous work by mathematicians, software engineers and computer scientists alike, you might ask yourself, why is it important to understand algorithm analysis? After all, conceivably, there already exists an “optimal” solution to every problem, right?

But that’s only half of the picture. The definition of optimality depends largely on the domain you’re looking at, and this makes a lot of sense – your ability to use these general solutions depends on external constraints. For example, development on embedded systems poses problems for developers because they offer slower processor speeds and have more constrained memory.

However, the scientific method we use most often to study anything tells us that we must first hold all variables constant except the one under inspection. That means to really compare two algorithms properly, we must be inspecting them on the same dimension (time or space) and we must be ignoring the other factors.

The measure we focus on in Data Structures and Algorithms is the “worst-case time complexity” of algorithms, which is to say, the number of primitive operations that must be conducted to accomplish a given task in the worst case. The definition of what “worst case” is varies from application to application and requires a greater understanding of the algorithm in question.

If there is an algorithm which always requires T(n) = n + 5 primitive operations to complete, for different values of the input n, then the Order of T(n), O(T(n)) is O(n). Why? The strictly mathematical definition is a bit convoluted (and can be found elsewhere), but the essential idea is that Big Oh notation provides an upper bound of the amount of time an algorithm will take to complete in the worst conceivable case and for data sets approaching infinite size. It’s not difficult to see why the “5” constant term is dropped, then: for large data sets, the value of n becomes much larger than the constant shift factor “5”, so the constant becomes less and less significant.

Now we have a way of comparing algorithms. Unfortunately, while this measure of algorithms is simple to learn and good in a basic mathematical sense, it neglects the impact of different technologies on algorithmic performance. Some algorithms are specially designed by Computer Scientists to be optimized for some of these more complicated mechanisms, which is why benchmarking on your target systems is always important.

Modern processors come with all sorts of nifty additions, including technologies like hyper-threading, multi-core processors, extra L1/L2/L3 cache and different instruction sets like RISC or MISC. Working on understanding these requires an understanding of the physical properties of these hardware, and I hope to spend some time explaining these in greater depth for later articles.

Read Full Post »