Feeds:
Posts
Comments

Posts Tagged ‘Data Structures’

A specialized storage system known as a Round Robin Database allows one to store large amounts of time series information such as temperatures, network bandwidth and stock prices with a constant disk footprint. It does this by taking advantage of changing needs for precision. As we will see later, the “round robin” part comes from the basic data structure used to store data points: circular lists.

In the short term, each data point is significant: we want an accurate picture of every event that has occurred in the last 24 hours, which might include small transient spikes in disk usage or network bandwidth (which could indicate an attack). However, in the long term, only general trends are necessary.

For example, if we sample a signal at 5-minute intervals, then a 24-hour period will have 288 data points (24hrs*60mins/hr divided by 5 minutes per sample). Considering each data point is probably1 only 4 (float), 8 (double), 16 (quad) bytes, it’s not problematic to store roughly three hundred data points. However, if we continue to store each sample, a year would require about 105120 (365*288) data points; multiplied over many different signals, this can become quite significant.

To save space, we can compact the older data using a Consolidation Function (CF), which performs some computation on many data points to combine it into a single point over a longer period. Imagine that we take an average of those 288 samples at the conclusion of every 24 hour period; in that case, we would only need 365 data points to store data for an entire year, albeit at an irrecoverable loss of precision. Though we have lost precision (we no longer know what happened at exactly 5:05pm on the first Tuesday three months ago), the data is still tremendously useful for demonstrating general trends over time.

Though perhaps not the easiest to learn, RRDtool seems to have the majority of market share (without having done any research, I’d estimate somewhere between 90% and 98%, to account for those who create their own solutions in-house), and for good reason: it gets the job done quickly, provides appealing and highly customizable charts and is free and open source software (licensed under the GNU General Public License).

In a recent project, I learned to use RRDTool::OO to maintain a database and produce some interesting graphs. Since I was sampling my signal once every five minutes, I decided to replicate the archiving parameters used by MRTG, notably:

  • 600 samples store 2 days and 2 hours of data (at full resolution)
  • 700 samples store 14 days and 12 hours of data (where six samples become a 30-minute average)
  • 775 samples store 64 days and 12 hours of data (2-hour average)
  • 797 samples store 797 days of data (24-hour average)

F0r those interested, the following code snippet (which may be rather easily adapted for languages other than Perl) constructs the appropriate database:

archive => {
 rows    => 600,
 cpoints => 1,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 700,
 cpoints => 6,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 775,
 cpoints => 24,
 cfunc   => 'AVERAGE',
},
archive => {
 rows    => 797,
 cpoints => 288,
 cfunc   => 'AVERAGE',
},

There are also plenty of other examples of this technique in action, mainly related to computing. However, there are also some interesting applications such as monitoring voltage (for an uninterruptible power supply) or indoor/outdoor temperature (using an IP-enabled thermostat).

Footnotes

1. This may, of course, vary depending on the particular architecture

Read Full Post »

The first part of my Google Summer of Code project involves the creation of a library to handle Debian package control files, which are, upon closer inspection of the Debian Policy Manual, actually encoded as UTF-8 (8-bit Unicode).

Initially, the files just “looked” like ASCII (a rather common issue with a downward-compatible system like UTF-8). You see, UTF-8 is, by design, indistinguishable from ASCII unless high-order characters are used – this is so that files using only ASCII characters can still be interpreted as UTF-8.

All of this meant that there is the possibility of “wide characters” – that is, characters that require multiple bytes to render, such as those in other languages. This means that using C would become a bit tedious, as you have to handle these cases.

I had read about the GNOME Project’s GLib but not looked at it in any depth until now. Much to my surprise, I discovered that it is an entire framework of portable C code, providing I/O manipulation, string handling, common data structures like trees and hashes, among other things. These functions all provide a Unicode-safe system too, which are all manipulated internally in UTF-32 and written back out in UTF-8. I’m all about code reuse and, being lazy and not completely understanding Unicode in-depth (after all, there is the common statement that “Internationalisation is hard”), I decided that using GLib was the best way to go.

The unfortunate side effect of this is a bit of wasted space – even if all the characters are 8-bits wide, 32-bits will be required to store them, meaning 24-bits of wasted space per character. An 8kb ASCII file is roughly 8,000 characters, meaning 192kb of space are wasted for this file, which could very well be a moderately complex Debian control file. All in all, it’s not a big deal and can be converted to UTF-8 later if desired.

Admittedly, the documentation available online in the GNOME Library – the GLib Reference Manual – leaves something to be desired. In particular, each function is explained in terms of its parameters, input and output, but does not provide a trivial example of its use. Some of the functions are unclear on how they work — when manipulating strings in GLib 2.20, the documentation describes functions with a signature like:

GString* g_string_append_len(GString *string, const gchar *val,
gssize len);

In this case, it’s unclear what the returned GString is useful for. Generally, C functions like strcat (don’t use that, by the way, use strlcat instead) will update strings in-place rather than returning pointers to them.

On the other hand, without knowing the intentions of the developers, it might be designed for portability to create bindings in languages that do not have the notion of returning through a pointer, such as Perl. However, in my opinion, the particulars of that should depend on the implementation of the bindings, as in, the XS glue code should have provisions for this.

Update: From reading the body of other functions, it seems that the return value is used to allow nested calls, like: g_string_ascii_down(g_string_append_len(…)). It’s sort of neat, really; shows some design foresight in the library.

Read Full Post »

One of my Year II-level Computer Science courses is called Data Structures and Algorithms. In a somewhat rigorous and primarily mathematical way, we determine the optimality of different ways of solving the most common problems in computer science.

Not every method of accomplishing a task is the same, so it goes without saying that some will be better than others, for a given application. Because we spend a lot of time solving a few common but complex problems, general solutions are indispensable. But algorithms are like apples and oranges and it is difficult to directly compare them. Every day, we make simple comparisons between disparate objects by finding a common measure – perhaps the weight, or the size/volume.

In computer science, one such metric is Big Oh time complexity, which tells us the so-called “order” of a function or algorithm. In simple terms, Big Oh tells us the maximum number of basic operations (additions, subtractions, “print” statements) required to accomplish a task. Our class is concerned with the asymptotic properties of Order – that is, what happens when the parameter approaches very large (“infinite”) numbers – for large structures, for large sets of data. There is also Big Oh space complexity, which I won’t discuss here.

After years of rigorous work by mathematicians, software engineers and computer scientists alike, you might ask yourself, why is it important to understand algorithm analysis? After all, conceivably, there already exists an “optimal” solution to every problem, right?

But that’s only half of the picture. The definition of optimality depends largely on the domain you’re looking at, and this makes a lot of sense – your ability to use these general solutions depends on external constraints. For example, development on embedded systems poses problems for developers because they offer slower processor speeds and have more constrained memory.

However, the scientific method we use most often to study anything tells us that we must first hold all variables constant except the one under inspection. That means to really compare two algorithms properly, we must be inspecting them on the same dimension (time or space) and we must be ignoring the other factors.

The measure we focus on in Data Structures and Algorithms is the “worst-case time complexity” of algorithms, which is to say, the number of primitive operations that must be conducted to accomplish a given task in the worst case. The definition of what “worst case” is varies from application to application and requires a greater understanding of the algorithm in question.

If there is an algorithm which always requires T(n) = n + 5 primitive operations to complete, for different values of the input n, then the Order of T(n), O(T(n)) is O(n). Why? The strictly mathematical definition is a bit convoluted (and can be found elsewhere), but the essential idea is that Big Oh notation provides an upper bound of the amount of time an algorithm will take to complete in the worst conceivable case and for data sets approaching infinite size. It’s not difficult to see why the “5” constant term is dropped, then: for large data sets, the value of n becomes much larger than the constant shift factor “5”, so the constant becomes less and less significant.

Now we have a way of comparing algorithms. Unfortunately, while this measure of algorithms is simple to learn and good in a basic mathematical sense, it neglects the impact of different technologies on algorithmic performance. Some algorithms are specially designed by Computer Scientists to be optimized for some of these more complicated mechanisms, which is why benchmarking on your target systems is always important.

Modern processors come with all sorts of nifty additions, including technologies like hyper-threading, multi-core processors, extra L1/L2/L3 cache and different instruction sets like RISC or MISC. Working on understanding these requires an understanding of the physical properties of these hardware, and I hope to spend some time explaining these in greater depth for later articles.

Read Full Post »