Robelle Tech: Paths are for People

Acucorp Sponsor Message

July 2003

Boosting Your e3000 Productivity

Paths are for People

By Bob Green

Imagine you have a customer order tracking system and you need a weekly report on sales by salesman. Many programmers automatically think of making the salesman code into a TurboIMAGE search key in order to optimize the report. Is this a good idea?

First a little background information, which may be review for many readers.

A path is the structural relationship between a TurboIMAGE master dataset and a TurboIMAGE detail dataset linked through a common search item.

For example, the customer master dataset provides the index for unique customer names as well as the customer name and address. The line item detail dataset provides information on what a customer has ordered: customer number, item number, quantity, price, order number, salesman code, date ordered, date shipped, etc. There can of course be multiple line items per customer, and usually over time there are many. A path between the two datasets allows you to retrieve the line items for a specific known customer number.

A chain is a collection of detail entries that have the same search value; the head of the chain is in the master entry for that search value. The head of the chain points to the first detail entry with the search value, and the last one. Each detail entry has a forward and backward pointer, so retrieving the list of entries is relatively easy and fast (a maximum of one disk read per record). This is implemented in TurboIMAGE as a DBFIND followed by chained DBGETs (mode 5).

Every path you add to a TurboIMAGE dataset adds overhead and slows response time, because the pointer chains must be changed for each DBPUT and DBDELETE into the detail dataset. The more paths, the more chains to update. And this overhead usually occurs at the worst possible time: when the end-user is entering or amending an order.

To decide how many inquiry paths you should have for a dataset, you need to know the volatility of the dataset: how often do people ask questions about the data (read access) versus how often they update the search keys in the data (write access)? The read/write ratio tells you whether the dataset is stable or volatile.

If the read/write ratio is low, you should not have many search paths, because the overhead of updating them during the many writes will overwhelm the benefit during the few reads. For example, a dataset that contains active customer orders is probably active.

However, if the read/write ratio is high then the dataset is more stable (i.e., it doesn’t change that quickly). You can afford many more search keys in such a dataset. For example, a dataset that archives completed customer orders is probably very seldom (if ever) updated. Spending a second or more to write a record into this dataset makes sense, because it gives you many search keys per record and is only done once.

Suppose you only write records 500 times a day: the total CPU time will be less than 10 minutes each day. And even if the system is called upon to search for over a million records a day, the heavy use of keys will make the overall application work better. The only way to be confident that you have properly indexed your data is to measure, measure, measure.

The purpose of a path is to divide a dataset into small, manageable subsets of entries that can be retrieved, displayed, and updated easily for real people who are sitting at workstations!

A path should divide a dataset into many short chains, rather than a few long chains.

So a report that is only run weekly may be marginally faster with a custom search key, but the extra search keys make the on-line usage slower during the entire week.

Paths are never absolutely necessary. You can accomplish the same inquiry by a serial scan of the entire dataset. The only reason for a path is to make the retrieval faster than a serial scan.

So our weekly report may actually be slower with the search key, since it may need to read all or most of the dataset anyway, in which case a serial scan is almost always the fastest approach.

If the number of unique search values is less than the number of entries per disc page, a serial scan will be faster than a chained retrieval. That is because a chained read takes one disc read, while a serial read takes only 1/N disc reads (where N is the blocking factor per page).

It seldom makes sense to add an inquiry path strictly for a batch program. Paths should be reserved for on-line users.

Serial Dataset Scans: Good and Bad

A serial scan is an access method that reads an entire dataset in physical order to find what is wanted, rather than following a search key maintained for that dataset. MPE programmers are usually taught that TurboIMAGE search keys are efficient and serial scans are to be avoided, but that is an over-simplification.

There are situations where a serial scan is faster than a keyed search, and vice versa.

Good Serial Scans

Question: What are two ways to perform the following data retrieval?

We have a large detail dataset called Ord-Line with 2,307,685 records. We have a list of 162,770 Ord-Num key values of interest. We want to extract the Ord-Line records for those keyvalues and we want them sorted by Ord-Num. The Ord-Line record size is 308 bytes.

Answer: The traditional solution in TurboIMAGE is to call the DBFIND intrinsic for each of the 162,770 Ord-Num key values, then repeatedly call DBGET mode-5 (directed access) to retrieve the Ord-Line detail entries on that key-value chain (261,230 in all). If the list of Ord-Num values is sorted, the Ord-Line records will also be sorted.

An alternative solution is to call DBGET mode-2 (serial access) for each of the 2,308,685 detail entries (or use Suprtool’s Get Command), select out the 261,230 records that match the Ord-Num table, then sort those records. That sounds like a lot of work, but do we know that it will be slower than the traditional search-key method?

The traditional method takes about 162,770 DBFIND disc reads and 261,230 DBGET disc reads, for a total of 424,000 disc reads. The reason why it consumes about one disc read per record accessed is that the access is random and direct — when you retrieve one record the chances that the next record retrieved is adjacent is extremely tiny.

If you do the serial scan with DBGET it takes at most 177,514 disc reads. The reason is that each disc read transfers a page containing 13 records. Suprtool uses a 50,000-byte buffer and reads 159 records of 308 bytes per disc read. If you use Suprtool to do the task, it will take only 14,500 total disc reads. The Suprtool serial scan reduces disc reads by 96.6 percent.

Suprtool can do serial scans much faster than other tools and application programs because it bypasses the overhead of TurboIMAGE by doing NOBUF/MR access to the underlying file of the dataset.

Bad Serial Scans

Question: Will a serial scan be faster than keyed access if you have 1 million entries in an TurboIMAGE master dataset and you need to retrieve 1,000 key values (which you have in a table). The entries are 400 bytes long and are packed 10 records per page in the master dataset.

Answer: Keyed access takes about 1,000 disc reads, since master entries are stored randomly by hashing the key value. Serial access takes about 100,000 disc reads, since they are packed 10 per page. Serial access with Suprtool takes about 6,500 disc reads, since Suprtool reads 12 pages per access. So even with Suprtool, the keyed access is much faster because we are reading a small portion of the dataset.

Question: You have deleted 90,000 old entries from a dataset with a capacity of 110,000. Why does the time to serially read the dataset remain the same?

Answer: Because TurboIMAGE must scan the empty space in the dataset to find the remaining entries. There are no pointers around empty chunks in a dataset.

In a master dataset the entries are distributed randomly over the entire capacity and there is no way to find the next occupied entry after N except by looking at N+1, N+2, and so on, until you find it. The time to read a master dataset is proportional to the capacity, not to the number of entries.

In a detail dataset, the entries are stored chronologically adjacent to each other. As you add entries to a detail dataset, it moves up an EOF marker called the highwater mark. Serial scans must read all the space up to the highwater mark. When you delete an entry, it is put on a delete-chain for reuse on the next add operation, but the highwater mark is not necessarily reduced. The time to read a detail dataset is proportional to the highwater mark, not to the number of entries.

Moral: Use a dataset packing tool such as Detpack from Adager after you clean out a detail dataset. This will reduce the highwater mark by repacking the entries. For a master dataset, reduce the capacity if you won’t need the empty space in the near future.

Update to last month’s column on Zero Secondaries

Rene Woc of Adager writes:

Bob’s article on zero secondaries in the June NewsWire explains very well the pitfalls and solutions when using “integer keys” in IMAGE master datasets. However, I believe Bob’s article missed a fifth solution, one that guarantees no-secondaries with integer keys, if you know how the key values are assigned. This is the most common case.

The solution is described in Fred White’s article “Integer Keys: The Final Chapter” available from Adager’s web site:

www.adager.com/TechnicalPapersHTML/IntegerKeysFinalChapter.html. It’s the solution Adager customers with “integer keys” use.

Any HP 3000 user can always call Adager to obtain the proper capacity values for their particular case.

Update on CI programming

Jeff Vance, the HP engineer in charge of the HP 3000’s Command Interface, writes about a prior entry from Robelle Tech:

Though I found many minor nits with Ken Robertson’s article titled “Quick and Dirty Solutions with CI Programming” in the March, 2003 edition of the NewsWire, I’d like to address just a couple of points below:

1. In the section titled “Getting Data into and out of Files” the PRINT command is highlighted as a way of reading records from a file (especially reading past the first record). Although this is correct, I’d like to point out that using a message file can be 4+ times faster. And for larger files, a method using alternate entry points yields a performance boost of 31 times or better! These IO techniques are described in my CI Programming paper on Jazz at: jazz.external.hp.com/papers/SolSymposium_03/ CIProgramming_files/frame.htm

2. Under the “Simulating Arrays” section, there is a CAVEAT USER paragraph which, if I understand the author, is completely incorrect. Interactive vs. scripting CI usage is almost always identical. The type of differences that are visible are cases where the user is prompted to confirm an action when they are interactive, e.g. the PURGEGROUP command, which assumes a “YES” in a batch-like environment. So where the author writes, “This only works within command files. If you try to do this interactively the expression evaluation on the Setvar command is performed for the part of the command line after the variable name,” this is incorrect. Variable arrays can be created interactively, in a batch job, or in a script.

3. As of the MPE/iX 5.5 base release, CI string variables can be up to 1024 characters long, which is greater than the 256 byte limit cited in the article.