Wednesday, September 12, 2007

Data problems

Not a new observation, or anything, but there's been an explosion in storage capacity, with revolutions of change still coming.... but our ability to organize and manage that data has only incrementally improved over the last two decades. For example, at the high end, although every medium to large business drowns in an excess of data they have trouble managing, core rDB tech has only evolved incrementally, at least commercially - yet remains the tool of choice.

Cutting edge is considered selecting MySQL instead of Oracle - yeah, that's bold :P

Companies like Google, whose mission statement is about data organization, of course, don't just fall back on tried and true patterns - they created tools (genuine internal infrastructure) like BigTable to help address these issues.

In a similiar vein (courtesy of Slashdot), you should check out the Database Column - a "multi-author blog on database technology and innovation" ("Column", get it? :D). Clearly, there's some interesting thinking going in the space that will change how data management happens - column-based dB tech vs. row-based is really only the tip of the iceberg, but provides a nice visual metaphor for how sideways things will get.

Interestingly enough, the "middeware" thing I referred to previously was in this space. If I manage to get out off my keister with that project some weekend, I'll post a sample application... but don't hold your breath :)

Labels: , , ,

Tuesday, May 29, 2007

The Bandwidth Shell Game

Whilst getting my slashdot groove on yesterday, I encountered this: Will ISPs Spoil Online Video?

The main thrust of the article is that no ISP can actually deliver the "promised" sustained bandwidth for all users on its network (or even a large percentage of its users) at any one time.

The article is basically true, in the facts, and I've touched on the topic of video bandwidth and the 'net in the past, but its (somewhat) unfair to narrow this to an ISP issue. (I say "somewhat" considering my previous gig was at an ISP, and my current employer offers Broadband ISP services, so perhaps I'm not the most objective here...)

For example, every website plans against peak load, not total possible usage - same problem: you can't access promised services (paid or free) as advertised/committed. And, more on point, Google's ever increasing g-mail mailbox size is also bogus in the same sense -they can offer that much storage because not everyone uses 2+GB for mail (very few do, in fact).

Really, all businesses do capacity planning (online AND offline) to determine pricing (and therefore marketing claims), and bandwidth is no different in this regard.

I can't even make a call for the first 30 minutes after American Idol ends - wireless capacity planning never forsaw the Seacrest effect.
And although I went to the Buffalo Wing Factory in Va ("Home of the Flatliner") for some spicy buffalo wings one night after a goodbye party for a departing colleague, they were, in fact, out of Flatliners. Grrrr....

What makes it thorny for most connected users is that the usage profile of the service, of the Internet, continues to evolve very rapidly, making terms of service seem quickly antiquated. What people should bear in mind though, is that the terms of service are simply a reflection of the economic and topological constraints of the network itself - usually in place to guarantee some core QoS (Quality of Service) for as many customers as possible.

Nobody's trying to trick anybody, or game the system - but you can't plan for what you don't know, and the increasing interconnectedness of things make prediction a dicey thing. That is to say, the dumber the network, the less visibility available.

Consider, for example, P2P applications are good (i.e. cost) for the endpoints (origin and destination), but usually MORE traffic (i.e. cost) for the network itself.

Labels: ,

Tuesday, April 10, 2007

Shift Registers and De Bruijn Sequences

I tell ya' - everything time you think you've got a novel idea, turns out somebody's been there already... doesn't even seem to matter how small it might be.

For example: I have this minor function I wrote ages ago, which I had recently rewritten/had to re-derive. Its purpose was fairly mundane - I wanted to compute the (integer) log base 2 for a power of 2 integer value. Pretty simple and esoteric problem, but it pops up fairly often in graphics/sound programming (especially 3d stuff).

There are lots of solutions, and its not really a significant performance bottleneck or anything, but I always thought my solution was rather novel. The code looks like this : (link)


The idea is pretty simple, but clever (er... if I do say so myself, I suppose :)).

Very (very) briefly: multiplying by a power of 2 is the same as left-shifting by that power (determining that power being the log we're looking for). Therefore, we can construct a bit pattern that uniquely encodes the possible patterns sequence 0 - 31, such that any 5 bit sequence is uniquely of that range. Then multiplying by a power of two number puts a unique sequence into the top 5 bits of the result, which we can then shift down and use a table to "reverse" into our result. We need the table, because of course the encoding isn't linear with respect to the domain.

There are a number of constants you could use to fulfill this criteria, and I always thought of it as a compression/huffman tree encoding problem. The current constant I computed is 0x04ad19df.

Turns out not only is this a well known idea, but, further, its actually a special class of space filling curves (amongst other things) know as De Bruijn sequences. See the great "Bit Twiddling Hacks" page by Sean Eron Anderson for more information. The constant they computed, which of course also uses a different table, is 0x077CB531.

Results are the same.

Ah, well. Time to move on, I think..... :)

Labels: , ,

Wednesday, January 03, 2007

How to make software faster (with hardware)

I also read over the holidays some nice posts from a Photoshop architect (and engineer) about performance in software with regards to both 64 bit computing and the broader proliferation of multi-core processors (with Photoshop specifically, in this case). I've said this many (many) times, but this is a good excuse to repeat it: the best way for hardware designers and vendors (*cough* Intel *cough*) to improve utility (and therefore value) of the CPU is to improve memory bandwidth from the CPU. Clock speeds and cores and more processors and fancier instructions will make micro-benchmarks perform better, but RAM access is what will provide BY FAR the most improvement for well optimized software (read: where algorithms trump assembly).

For example, the biggest advantage that NVidia and ATI, with their GPUs, provide isn't the specialized HW instructions for rendering graphics (though those help): its that they can access RAM, oh, about a 50x to 100x faster than it can be addressed by the CPU.

That said, it is clear that the dramatic increase in multi-core processors in computers (and units per CPU) will create a new class of software and algorithms to be advantaged by them. Nobody's arguing that their sh!# doesn't stink...

Labels: ,