************************************* * * * DB/C Newsletter * * November 1994 * * * ************************************* Editor's Notes The first article in this month's Newsletter was first published in The Articulator, a DATABUS newsletter that didn't last very long. The article details how AIM works. As we continue to get questions about AIM, republishing this article may help answer some questions. The second article describes the DB/C Online Support's conversion to the Internet. As we have discussed previously, the Internet is going to have a large impact on application development software and systems. We should all jump on the "infobahn" quickly, if for no other reason than because everybody else is jumping on it. Hopefully, it will be useful and profitable also. Again, there is nothing to report on finalization of the PL/B Standard. don.wills@swc.com THE ASSOCIATIVE INDEX ACCESS METHOD ONE IMPLEMENTATION The most mysterious of all DATABUS language features is the file access method known as AIM. The Associative Index Method (AIM) was one of several advanced features that Datapoint put into the DATABUS language during its growth years in the 1970's. At the time of first release, AIM was unique in the industry. Today, at least three other DATABUS compiler vendors have implementations available that provide the same features as Datapoint's AIM. Similar access methods are beginning to become available for other languages, but none is as widely used as AIM. What is AIM and how is it used? AIM is a floating key file access facility that can be used to randomly access records in a data file. It is useful for a wide variety of applications, many of which are interactive in nature. The easiest way to describe AIM is by using an example. Suppose you have inventory records that contain a part number field, a part description field and other fields. In the following discussion, these inventory records will be used for our example: -REC #- -PART NUM- ------------DESCRIPTION---------------- 0 78186 Machine Screw 10-24 x 2 1 74189 Locknut 10-24 2 STD541425 Locknut 1/4-20 3 STD601003 Self-Tapping Screw #10-24 x 3/8 4 51433 Nylok Nut 5/8-18 5 62335 Spring Washer 6 69229 Blade 7 69385 Washer - Blade 8 54867 Spring - Washer (Belleville Type) 9 84012 Upper Handle 10 STD523117 Handle Bolt 11 83553 Clamp A traditional indexed sequential access method (ISAM) with an index on the part number is sufficient to provide quick access if the part number is known. However if the part number is not known, the ISAM approach becomes much less useful. Suppose we wanted to find a certain type of washer, but did not know its part number or exact description. What is needed is the ability to search through the descriptions looking for the keyword "washer", without regard to case or position within the field. AIM can do this search without having to read all of the records in data file. An AIM search using "washer" as a floating key will find the following three parts from the example file: 62335 Spring Washer; 69385 Washer - Blade; and 54867 Spring Washer (Belleville Type). AIM searches can be done for multiple keywords on an AND basis. For example, to find all screws of size 10-24, a search can be done with two keys: "10-24" and "screw". The AIM lookup using these two keys will correctly find parts "78186 Machine Screw 10-24 x 2" and "STD601003 Self-Tapping Screw #10-24 x 3/8". Searches like this are part of making interactive software more user friendly. This interactive nature is typical of AIM applications. What is so amazing about AIM is that it can accomplish this search with very few disk reads. The number of disk reads is more directly related to the search key than to the size of the file. Even more amazing is the fact that if more information is given for a search (number or size of lookup keys), the retrieval will typically take fewer disk reads. How does AIM work? No, it's not mirrors. AIM is based on an idea attributed to Malcolm C. Harrison in an article published in 1971. The basic concept is known as superimposed coding. A good discussion of this can be found in Knuth's book The Art of Computer Programming volume 3 - Searching and Sorting on pages 559-563. Specific details given in the rest of this article are based on the AIM implementation used in DB/C. (DB/C is the DATABUS product created by Subject, Wills & Company.) However, from what is known about the Datapoint implementation of AIM, all of the concepts are the same. The key to understanding AIM is to recognize how it stores keys. AIM stores key information by first breaking a key field up into 0 or more subkeys. Each of these subkeys is a sequence of exactly three contiguous non-blank characters called a triplet. For example, the description field "Spring Washer" would break up into the following triplets: "spr", "pri", "rin", "ing", "was", "ash", "she" and "her". Each record may have one or more key fields that will be used for AIM lookup. Each one of these fields has its own set of 0 or more triplets. Each triplet is converted to an integer using common hashing techniques. This integer is known as the hash value or hash number. For DB/C, the default is for the hash number to be in a range between 0 and 159 inclusive. The maximum value may be changed when the file is AIMDEXed (using the -Z=nnn option). For Datapoint, the maximum value is unknown and is fixed. The actual hashing algorithm used is not important as long as it provides a uniform mapping of triplets to hash numbers. See the Knuth book for a discussion of hashing algorithms. Once each triplet is converted to a hash number, the logical union of all hash numbers for all key fields of a record is then created for each record. A table of record numbers with associated hash numbers is then created. The following table lists the record numbers from the example parts file along with the hash numbers for the description field of that record: -REC #- --------------------HASH NUMBERS---------------------- 0 8 13 14 18 50 55 67 69 133 137 148 1 13 18 53 78 107 131 148 2 13 53 78 107 114 131 144 148 3 13 18 20 33 46 48 50 55 70 71 105 108 112 120 133 148 4 49 79 88 108 120 139 141 148 5 9 14 37 40 50 71 83 146 6 4 33 37 7 4 33 37 40 50 83 8 9 12 14 16 37 40 50 71 73 76 83 101 108 140 146 150 9 5 36 76 78 82 101 112 10 5 36 76 78 84 140 11 13 16 97 This table of record number/hash numbers is then inverted to be a table of hash number/record numbers. This is how the key information is stored in the AIM file. This is what the inverted table for our example looks like: --HASH NUMBER-- --RECORD NUMBERS-- 4 6 7 5 9 10 8 0 9 5 8 12 8 13 0 1 2 3 11 14 0 5 8 16 8 11 18 0 1 3 20 3 33 3 6 7 36 9 10 37 5 6 7 8 40 5 7 8 46 3 48 3 49 4 50 0 3 5 7 8 53 1 2 55 0 3 67 0 69 0 70 3 71 3 5 8 73 8 76 8 9 10 78 1 2 9 10 79 4 82 9 83 5 7 8 84 10 88 4 97 11 101 8 9 105 3 107 1 2 108 3 4 8 112 3 9 114 2 120 3 4 131 1 2 133 0 3 137 0 139 4 140 8 10 141 4 144 2 146 5 8 148 0 1 2 3 4 150 8 The method used to search for a key is as follows: Break up the lookup key or keys into triplets using the same rules as before. Compute the hash number of each triplet using the same algorithm as before. Get the record numbers associated with each triplet by lookup in the AIM file. The logical intersection of these record numbers is a list of prospective records that may match the lookup key. Read in each prospective record and see if it actually does satisfy the lookup key criteria. Using the example lookup for "washer", the steps are as follows: The key breaks up into "was", "ash", "she" and "her". These four triplets hash to the hash numbers 83, 40, 37 and 50 respectively. Hash number 83 indicates record numbers 5, 7 and 8. Hash number 40 indicates record numbers 5, 7 and 8. Hash number 37 indicates record numbers 5, 6, 7 and 8. Hash number 50 indicates record numbers 0, 3, 5, 7 and 8. The logical intersection of these record numbers is 5, 7 and 8. Upon inspection of these three records, we find that all actually contain the lookup key "washer". Using just "she" from "washer" as a lookup key is interesting. The triplet "she" hashes to 37 which indicates records 5, 6, 7 and 8. On reading each of these four records and checking for the existence of the lookup key "she", we find that record 6 does not contain "she". It turns out that the triplet "ade" also hashes to 37. Thus the description "Blade" (which contains the triplet "ade") found in record 6 gives a false indication of the existence of the triplet "she". The concept of having two different triplets hash to the same hash number is called a collision. In large files, many collisions are common. The performance of AIM is based on the fact that when several triplets are used for a lookup, there will be very few false indications of key matches so that very few records will be read that do not match the lookup key. This explains why lookups using longer keys typically perform better than lookups using shorter keys. This also explains why some lookups take a long time (many false indications because of many collisions) and some lookups are almost instantaneous (no false indications). The actual format of the DB/C .aim file is basically as follows: The first part of the file is header information. Following the header is the inverted hash number/record number information. Using the DB/C default of 160 hash numbers, there are 160 "slots" of hash number/record number information. Slot 0 contains the record numbers that are associated with hash number 0, and etc. The size of each slot is based on the number of records in the data file. One bit will be reserved for each record in the slot. For example, if we assume 200 records, 25 bytes will be reserved for each slot (200 records divided by 8 bits per byte). In each slot, the first bit of the first byte will correspond to record 0, the second bit will correspond to record 1 and etc. If a bit is 1, it means that the corresponding record contains a triplet that hashes to this hash number. If a bit is 0, it means that the record contains no triplets that hash to this hash number. AIM is a sophisticated tool for creating user friendly application software. In the short space of this article, I have attempted to give you a quick overview of its inner workings. Hopefully, by understanding how AIM works, you will be better able to know when and how to fully exploit all of its capabilities. Changes to DB/C Online Support The existing DB/C BBS system will be shut down on December 31, 1994. Online support for DB/C will accomplished via the Internet. Two different methods of communications will be used - Internet email and FTP file transfer. There are many different ways to send and receive Internet email. Compuserve has provided Internet email since 1989. All other major online services also provide Internet email. The method of sending and receiving Internet email is specific to each online service, but Internet email addresses are all the same general form. The general form is "username@place", without the quote characters, but with the "at" sign. User names can contain alpha- numeric characters and periods. For Subject, Wills & Company, the place part of the Internet email name is "swc.com", again without the quote characters. To send email to Subject, Wills & Company concerning either DB/C technical or non-technical matters, send email to this address: dbc@swc.com You can also communicate directly with all Subject, Wills & Company employees by sending email directly to their personal address. In all cases, the email address is the first name, followed by a period, followed by their last name, followed by @swc.com. Here are the personal email addresses of the DB/C folks: jamie.findlay@swc.com dwight.luetscher@swc.com stuart.nelsen@swc.com judi.tamkevic@swc.com don.wills@swc.com There are a few rules for email. Do not send attachments in your email. Not all mail programs handle attachments the same way. Put a phone number in your email message where we can call you. We had one situation where we could not reply via email because there was a problem with the sender's email system that put the wrong return address in his mail header. He sent three messages without phone numbers asking us to respond and we were unable to do anything. Expect email to be handled in 24 hours (Monday-Friday). Use the phone if you have something that needs quicker response. File transfer, both uploading and downloading, is done using the FTP protocol. There are many different programs that use the FTP protocol to do file transfer. The Subject, Wills & Company FTP server is at a machine with this name: ftp.swc.com When you connect with ftp.swc.com, you specify a user name and password. The special user name "anonymous" allows you access to the public directories on the FTP server. It is customary to use your email address as the password for "anonymous", although no password is required. The public directories are read-only. They contain archives of the DB/C Newsletters, current DB/C marketing information, and various other items of general interest. Current DB/C customers are assigned a private FTP user name and password. This allows read-write access to a private directory that is used to move software, test programs, etc. This non-anonymous username also allows access to directories that contain the latest version of the "readme" file and other software (in source, object and executable format). Although not necessary for support, Subject, Wills & Company also provides a World Wide Web server with this name: www.swc.com The WWW server may be accessed with Mosiac, Netscape, Lynx or any other graphical WWW browser. Except for pictures, all information at www.swc.com is also available at ftp.swc.com. There are several choices for gaining access to Internet FTP services. Compuserve will start beta testing their Internet FTP access service this month. It will be generally available by the end of December. Delphi already provides Internet FTP access. Other major online services are also promising FTP access in the near future. For full Internet access, we recommend a PPP or SLIP access provider that is local to your area. Sometimes this is called "IP service". Full Internet access will give you access to email, FTP, WWW and many other useful Internet services. We can help you find an Internet access provider in your local calling area. Just give us a call. To find out more about the Internet, go to your local bookstore. They have dozens of books to help you get started. DB/C Class Schedule The next DB/C class is scheduled for January 1995. Actual dates are not yet finalized. The class is held in the Oak Brook, Illinois office of Subject, Wills and Company. For more information, contact Judi Tamkevic via email at dbc@swc.com or via telephone at (708) 572-0240.