Lund Logo
Click here for Lund sponsor message

A Practical Guide to Making Use of B-trees

By Wirt Atmar

Editor's Note: The Express 3 release of MPE/iX 5.5 , now shipping to HP customers on support, has the first built-in indexes for TurboIMAGE included as a free enhancement. Known informally as b-trees, the upgrade promises radically faster database access for HP 3000 systems.

You’re going to like the new b-trees – a lot – if you write queries into IMAGE datasets.

The b-trees are attached only to master datasets (automatic or manual, it makes no difference). But the masters are, of course, themselves attached to detail datasets. What used to be a two-level structure is now three-level. But you don’t have to think about it in that manner. You can’t directly access the information in the b-tree itself. The b-trees are there in support of queries that you couldn’t ask before in IMAGE without a serial search, such as:

find name is GOR@
find city > Chic
find amount ib 50.00,99.99

Each individual query language that will use the new b-tree feature may use slightly different syntaxes, but the basic behaviors will be the same, nonetheless. Each one of these queries is now a chained search, rather than a serial search, although they may not look like it.

If the data items NAME, CITY, and AMOUNT are search items in a detail dataset (or key items in a master dataset), and if a b-tree is “attached” to the master dataset appropriate for each of these search items, then each of the query formats above will go out and first create what is now called a “superchain.” That is, the process first searches the b-tree attached to the master dataset for all of the names that begin with GOR and creates a list of all of the fully-specified names that exist in the master dataset that meet this criterion (GORDON, GORE, GORFINKLE, GORTON, GORVEY, etc.). Each one of these name’s chain is then searched in turn and every qualifying entry is marked, exactly as it always has been, as if it were just a single name you were looking for. All of this is done completely automatically, without you having to do anything special at all.

Getting started

As for turning the b-tree feature on, that too is surprisingly simple. Three new commands have been added to DBUTIL. They are: ADDINDEX, DROPINDEX, and REBUILDINDEX. The commands are fully explained in DBUTIL’s help. I personally recommend adding b-tree indexes to every master dataset in your databases – unless you have some overwhelming reason to argue that no benefit would be obtained from one or another specific master dataset having a b-tree attached.

B-trees also offer the possiblity of changing the manner by which IMAGE databases have been used in the past. In the examples I gave above, AMOUNT is clearly a real number. Ordinarily, it would have been considered a little declasse to put a hashing index on a real number, but I think that changes now. An automatic master could be attached to a real number field, such as AMOUNT, and with a b-tree attached, you can now ask range questions and get answers back almost immediately.

What performance increase should you expect with b-trees over serial searches? In our in-house measurements, we’re only seeing about a 100x increase, but that’s simply because our databases are too small. We’re not a production shop, so we don’t have enormously large datasets. But if the NAME field existed in a four-million record dataset and the query above only had thirty records in total for all of the names that began with GOR, you could expect potentially a 5000x increase in retrieval speed, as compared to an MR-NOBUF, high-speed serial search. That’s not a bad bit of performance enchancement – especially given the price of the enhancement.

You almost don’t have to do any benchmarks to understand how well the new b-trees are implemented. Since the b-tree is attached to the master dataset, the b-tree is automatically connected to all of the detail datasets that any single master has paths to.

The KSAM b-tree maintains an exact one-to-one relationship with the values held in the master dataset, data item value-for-data item value. Thus, for most transactions – because most transactions do not cause a master dataset’s search value to be either added or deleted – there will be absolutely no performance hit due to the presence of the b tree. You simply can’t do better than that. The addition or deletion of records from the detail datasets will proceed as they always have in the past. The appropriate search chains for that value in that dataset are simply made longer or shorter. The master and its attached b-tree are untouched.

The only time the b-tree is (automatically) modified is when a new indexing value is added to the master dataset or when it is deleted (which occurs only when it disappears from the last detail dataset of the sets which the master dataset indexes). Both of these conditions are relatively rare events for most of the indexing values held in most databases. The only possible exception would be for search fields such as order numbers that are uniquely created on the fly. In this case, the first dataset to add the order number would pay a small price – the addition of one new search value to the master dataset, as it always has been in the past, and one new search value to the b-tree. That’s a new additional cost, but it is very small. What you get for that small cost is that every subsequent detail dataset to which this new order number value is written during this transaction is free, without penalty.

The single largest performance hit will occur when you first turn b trees on and build the KSAM indexes for all of your masters. Due to a variety of circumstances – the size of your database being the primary one – its impossible to estimate how long that will take, but it ain’t long. I drop and add all of the b-tree indexes to our databases on a regular basis. I’ve been pleased with how fast it goes – and we did almost all of our testing on a Series 922. No matter long it takes, in a production circumstance, it only has to be done once.

When you should index

IMAGE’s R4 and E4 number types are 64-bit numeric formats – where most of the information about the value in a real number resides in the left-most bits. For small, integer numbers stored in R4/E4 formats, the right-most 32 bits will all be zero – and that represents a catastrophe for the hashing algorithm that IMAGE uses. All of these small, integer R4 numbers will hash to the same address in the master dataset, leading to one incredibly long synonym chain (and abyssmal performance)!

IMAGE uses the right-most 32 bits because that is where most of the information lies for an integer number. Integer-based formats begin counting in the right-most (least significant) bit. Real numbers, on the other hand, begin counting at the other end, more or less at the left-most (most significant) bits – and that’s where the greatest amount of “information” (unexpectedness) lies for a real number. IMAGE, unfortunately, doesn’t discriminate between reals and integers and always hashes its address values based on the right-most bits, leaving us with the R4 problem.

However, if the real numbers are stored as R2/E2 (32-bit) numbers, then there is no real problem (pun intended), so long as we expect to see a fair diversity in the numeric values in the dataset. If, by the worst possible luck, the real-numbered values in the dataset predominately represent a harmonic series of the current capacity of the master dataset, then we’ll again be saddled with a substantial amount of clustering and long synonym chains, but the cure for such a condition is, as it always has been, to simply change the capacity of the master. Nonetheless, such resonances are an extremely unlikely condition, by any measure.

So, the bottom-line moral is: don’t go about indexing your R4/E4 numeric fields without forethought. Absolutely don’t index your R4/E4 fields if they are likely to contain only small number, integer values. However, R2/E2 numbers shouldn’t give you too many problems.

What b-trees can mean to you

The mental model that we preach to every one of our users, CEO and DP manager alike, is to think of IMAGE as a set of steel filing cabinets, the detail datasets being the large steel drawers, marked with little white cards on the front, saying things such as JOBS, INVOICES, CHECKS, etc., and the master datasets as being 3x5 indexing cards that sit on top of the filing cabinets.

While we actually do explain how b-trees work to everyone, most people simply sit there and politely listen. It’s all interesting, of course, in the same way that it’s interesting for you to know how the electrical capacitance changes in your car engine’s cylinders as the piston nears the top of its stroke, but that’s all it is: interesting. Most people merely want to know how the feature affects their work and how might they take advantage of it.

The important take-home lesson of b-trees is that you no longer need to know the search value in IMAGE in its entirety in order to create a high-speed, indexed search. While we debated for some time what to call the new, b-tree-enhanced search items, we eventually settled on labeling them as “full” and “simple” search items, with “simple” being a standard, hashed IMAGE search item, where you have to know the value in its entirety, and a “full search” being one that you can ask for either a high speed range or a generic search (if you know the beginning of the value – and it’s a text field).

CSY is to be sincerely congratulated for the quality of work they’ve done on b-trees.

Wirt Atmar is a member of the SIGIMAGE Executive Committee and has been leading the effort for IMAGE b-tree access since the 1980s.