Robelle Tech: IMAGE: Can you increase your capacity too much?

WRQ Sponsor Message

November 2002

Boosting Your e3000 Productivity

IMAGE: Can you increase
your capacity too much?

By Bob Green

You have probably heard that increasing the capacity of a TurboIMAGE master dataset will improve the performance. So wouldn’t increasing the capacity a lot be really good for performance?

First, let’s review why increasing the capacity can improve the performance of Puts, Deletes and Keyed Retrievals. For this we need to understand how entries are stored in master datasets.

I like to think of master datasets as magic dartboards with the darts as your data entries. (Actually, I am only thinking of master datasets with ASCII-type keys, since they are the only ones that use hashing. If the key field is of a binary datatype, things happen completely differently.)

You start with an empty dartboard and you throw the first dart. This is like adding your first entry.

The dart hits the board at a “random” location (this is the “hashing) which becomes it’s “primary” location, the place where it “belongs.” The magic part of the dartboard is that if you throw the same dart again, it always goes to the same spot. This allows you to find it quickly. In TurboIMAGE terms, you perform repeatable mathematical calculations on the unique key value to produce a “hash location” in the master dataset. This is where that dart always lands.

Does the next dart land right next to the first one? Not likely.

The next dart lands at a new random location and the next and the next. If the hashing is working properly, the darts are spread randomly across the board. As the number of darts increases, the board gets filled and it becomes more likely that one of the darts will hit an existing dart!

When a dart wants to be in the same primary, this is called a collision and leads to placing the dart in a nearby “secondary” location. It takes more work to create, find and maintain secondary entries than it does primary. As the number of entries approaches the capacity of the master dataset, the percentage that are secondaries will increase (assuming that the distribution is actually random, which it sometimes isn’t, but that is another story).

Note: TurboIMAGE tries to put colliding entries nearby to their primary location. For this reason, performance only degrades seriously when all the locations in that page of disk space are full and TurboIMAGE has to start reading and examining neighboring pages.

What happens when you increase the capacity of the master dataset?

You must remove all the darts and throw them again, because now each will have a new magic location. In TurboIMAGE terms, you need to start with a new empty master dataset and re-put all the entries. Because there is more space and the hash locations are random, the entries will spread out more, with fewer collisions!

This explains why increasing the capacity usually improves the performance of DBPUT, DBDELETE and DBGET mode 7 (keyed).

Now back to our original question: if increasing the capacity by 50 percent reduces collisions and improves performance, wouldn’t increasing the capacity by 500 percent improve it even more? The answer is “Yes and No.”

Yes, most functions will improve by a very small amount. However, once you have reduced the number of secondaries so that 99 percent of them reside in the same disk page as their primary, you can’t improve performance anymore. The minimum time it takes to find an entry is one disk read – once you reach that point, adding unused capacity is pointless.

But there is a more important “No” answer.

If you increase the capacity of a master dataset to create a lot of empty space, at least one function, the serial scan, will be much slower.

A serial scan is a series of DBGET mode 2 calls to look at every entry in a dataset. It is used when you need to select entries by some criteria that is not the key field. For example, you want to select all the customers in the 215 area code, but the key field is customer-id.

Each DBGET mode 2 call must find and return the “next” physical entry in the master dataset. If the dataset is mostly empty, the next physical entry could be far away. But the DBGET cannot complete until it has read all the intervening disk space!

To read all the entries in the master dataset, DBGET must read every page in the dataset until it reaches the last entry. If you increase the size of a master dataset from 10,000 pages to a 100,000 pages, you have made serial scans take 10 times as long!

So increasing the capacity of a master dataset drastically above the expected number of entries is usually a very bad idea.

Note: TurboIMAGE has a new feature called Dynamic Dataset Expansion (MDX), which allows collisions to be put in an overflow area rather than in an adjacent free secondary location. This can be helpful in cases where the entry values are not hashing randomly or you need a quick fix but don’t have time to shut the database down to do a capacity change.

For more information on how the internal mechanisms of TurboIMAGE impact performance, read this Robelle tutorial “IMAGE Internals and Performance” at www.robelle.com/library/tutorials/pdfs/imgperf.pdf