OTMSORT.TXT - Sorting algorithms for 3d graphics released 2-20-95 by Voltaire/OTM [Zach Mortensen] email - mortens1@nersc.gov INTRODUCTION During the past month, I have received many inquiries concerning sorting algorithms used in 3d engines, which are the fastest, etc. I figured I could save myself some IRC and email writing time by compiling a text on the matter that would explain everything in a concise manner. In this text, I will describe various sorting algorithms, and the pros and cons of each. This primarily covers variations of the radix sort, which I have found to be the fastest algorithm. A fast sort is critical to a fast 3d engine, perhaps even more critical than a fast poly fill. In the first version of my 3d engine, I implemented a linear sort (quite a misnomer actually). When I began optimizing, I found that the sort was a huge bottleneck. After I switched to a z-buffered drawing scheme which eliminated the sorting algorithm, my code ran about 7 times faster than it had before. I quickly discovered that z-buffering was not the fastest method either. It requires an enormous amount of memory accesses per polygon, which can be very very slow on a machine without a fast bus and writeback cache. I needed to find an algorithm that would allow me to sort my polygons with the fewest number of memory accesses. The radix sort was the answer I had been looking for. The radix sort is adventageous over all other sorting algorithms because its sorting time as a function of the number of data to be sorted is linear. The sorting time as a function of number of data in most commonly used sorts is exponential, which causes a tremendous slowdown when a large number of polygons are being sorted. THE BUBBLE SORT Here is an example of a bubble sort for (count = 0; count < numData; count++) for (index = 0; index = numData; index++) if (data[count] > data[index]) swap(data[count], data[index]); This is by far the simplest sorting algorithm known to man. It is also the most inefficient sorting algorithm that could possibly be used, literally. In the bubble sort, each element of the set is compared with all other elements, yeilding a total of numData * numData iterations through the sorting loops. As you can see, this is a quadratic function. By doubling the number of data to be sorted with a bubble sort, the sorting time increases FOUR TIMES. The bubble sort is a definite no no if you are remotely interested in speed. THE "LINEAR" SORT The linear sort was taught to me in my High School pascal class. It was a much faster sort than the bubble sort in our examples which required us to sort a set of not more than 20 numbers, so it was the first sort I implemented in my 3d code. However, a closer look shows that it is nothing more than a slight variation of the bubble sort algorithm, with a slight increase in the performance. for (count = 0; count < numData; count++) { x = count; for (index = count + 1; index < numData; index++) if (data[index] > data[x]) x = index; if (x > count) swap(data[x], data[count]); } This algorithm yeilds numData iterations through the outer loop, with an average of (numData / 2) iterations through the inner loop per outer loop iteration. Therefore, the total number of iterations through the inner loop is (numData * numData) / 2. This total is half the total of the bubble sort, but it is still a quadratic relationship. By doubling the number of data, the sort time is doubled. This seems to be a linear increase (hence the name linear sort), but it is not. If the size of the data is increased by a factor of 4, the sort time is increased by a factor of 8 (4 * 4 / 2). An increase by a factor of 8 in the size of the data will result in the sort time increasing by a factor of 32 (8 * 8 / 2). So, this sort is nearly as bad as the bubble sort when sorting VERY large data sets. THE RADIX SORT If you have never heard of the radix sort, you are not alone. If you learned about the radix sort in a programming class and thought it was totally useless, you are like I was. The way the radix sort is usually taught in classes is efficeint for everything but computers. This is because the textbooks usually use base 10 (decimal), which is extremely foreign and difficult to implement in a computer. The idea behind a radix sorting algorithm is to sort the data by each radix, starting with the least significant and moving to the most significant. This is best illustrated by an example. First off, it helps to define radix. A radix is a 'place' in a number. The base 10 radices have common names (1s place, 10s place, 100s place, etc), but the concept can be extended to numbers of any base. Consider the base 10 number 32768. The value of the first (least significant) radix is 8, the second is 6, the third is 7, the fourth is 2, and the fifth is 3. Now consider the base 16 (hexadecimal) number 3f8. The value of the first radix is 8, the second is f, the third is 3. Now the following example will make a bit more sense. base 10 radix sort example data |first |second set |pass |pass ---------------+---------------+--------------- | | 12 |09 |83 65 |38 |73 44 |37 |65 37 |65 |44 03 |44 |38 38 |03 |37 83 |83 |12 09 |73 |09 73 |12 |03 The first pass through a radix sorting algorithm deals ONLY with the least significant radix (in base 10, the least significant radix is the one's place). The data is sorted from greatest to least (or least to greatest depending on the application) based on the least significant radix. After the first pass, the data with the least significant radix of greatest value is at the top of the list. The second pass is similar to the first, with the exception that the resultant data from the first pass is sorted (NOT the original data), and the data is sorted based on the second radix. It is extremely important to preserve the order of the previous pass, so make sure to traverse the list the same way the original data was traversed in the first pass (in this case, top to bottom). Any additional passes to sort additional radices are performed in the same manner, by sorting the data from the previous pass with respect to the radix in question. The radix sort has an advantage over the other sorts presented here, because its sort time as a function of number of data is linear. The number of iterations needed in a radix sort is given by (numData * numRadices) where numRadices is constant for the entire data set. THE BIT SORT (BASE 2 RADIX SORT) Now that we have an algorithm that gives a linear increase in the number of iterations needed to sort larger sets of data, we need to find a way to make each iteration as fast as possible. This can be accomplished by changing the base in which the data to be sorted is interpreted. In base 10, we have to go through each element of the set looking for a 9 in a given radix, then look for an 8, etc. This is quite slow, and fairly difficult to implement on a computer. A better approach than using base 10 is to use base 2, where there are a total of 2 possible values for each radix (0 or 1). It is very easy to test whether or not a binary number contains a 1 in a given place, this can be accomplished by an AND or TEST instruction. Since there are only two possible values for a radix of base 2, it is also extremely easy to sort from greatest to least based on a given radix. Simply put all the '1' radices at the top and all the '0's at the bottom. Here is some example code for the bit sort applied to 16 bit data: // this requires two temporary arrays of [numData] elements, // one for 1s and one for 0s short data[]; // 16 bit data in this example short oneArray[numData]; short zeroArray[numData]; int numOnes; int numZeros; int mask = 1; for (count = 0; count < 16; count++) { numOnes = 0; numZeros = 0; for (index = 0; index < numData; index++) { if (data[index] & mask) { oneArray[numOnes] = data[index]; numOnes++; } else { zeroArray[numZeros] = data[index]; numZeros++; } } // put the 1s back in the original data array first memcpy(data, oneArray, 2 * numOnes); // now put the 0s back memcpy(data + (2 * numOnes), zeroArray, 2 * numZeros); // mask out the next most significant bit next time through mask <<= 1; } This code is considerably faster than either the bubble sort or the linear sort. The inner loop is executed 16 * numData times, and consists of three indirect references, one TEST, one MOV, one INC, and one JZ/JMP (depending on the branch taken) plus loop overhead. The outer loop is actually more costly than the inner loop because of the two block memory transfers. However, the outer loop is only executed 16 times in this example. Even so, this is a lot of iterations through the loop. If we could somehow find a way to cut down on the 16 iterations required per data element while keeping the inner loop very simple, we could get a big increase in performance. THE BYTE SORT (BASE 256 RADIX SORT) The obvious solution to this problem is to increase the base of each radix examined. The next logical base to use is base 256, which can be represented in a single byte. If we were to implement the byte sort the same way we implemented the base 10 radix sort in the original radix sort example, we would look for a value of 255 in the least significant byte of each data element, then look for a 254 in each element, etc. This would yeild (numData * 256 * 2 iterations) of the inner loop if we used 16 bit data. This would be nowhere near as fast as a bit sort. However, we can apply a bit of ingenuity to the byte sort algorithm by taking a hint from the implementation of the bit sort. In the bit sort, we had a separate array for each possible value of a given radix. In a base 2 sort, there were two possible values for each radix, therefore we had two arrays. If we extend this concept to a base 256 sort, we would need 256 arrays, one for each possible value of a radix of base 256. Now, following the bitsort algorithm, we would go through the least significant byte of the data, placing the data in the appropriate array which corresponds to the value of the least significant radix. We would of course repeat the procedure for the next most significant radix and so on until all radices have been considered. In the case of 16 bit data, this method would yeild numData * 2 iterations through the inner loop, which is an 8 fold speed increase over the bit sort. However, there is a rather large problem in the implementation of the byte sort: memory. In the implementation of the bit sort, it is fairly easy to organize code around two arrays. However, a byte sort necessitates that the radix being examined be used as an index to an appropriate array (for example, a radix of value 4 would need to be placed in the '4s' array, a radix of value 33 would need to be placed in the '33s' array, etc). Therefore, a two dimensional data structure needs to be used, with one dimension corresponding to the possible values of radices, and the other index corresponding to the actual data containing radices of a certain value. <-- RADIX --> ^ |00|01|02|03|04|05|06|07|08|09|0A|0B|0C|0D|0E|..|FC|FD|FE|FF | --|----------------------------------------------------------- 00| | | | | | | | | | | | | | | |..| | | | D --|----------------------------------------------------------- A 01| | | | | | | | | | | | | | | |..| | | | T --|----------------------------------------------------------- A ..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|.. --|----------------------------------------------------------- | nn| | | | | | | | | | | | | | | |..| | | | v -------------------------------------------------------------- Our dilemma lies in the fact that there is no way whatsoever to determine the number of data that will contain a given radix at a given time. Therefore, there is no way of knowing how large to make the dimension (nn) when initializing the data structure. We could always make 256 arrays of [numData] elements each, but this would be extremely inefficient in that we would be allocating 256 times the memory we really need. The solution to this problem lies in dynamic data structures. By setting up an array of 256 linked lists, we can assure that we will not be allocating much more memory than we need. The total overhead for a singly linked list is 4 bytes (the size of the starting list pointer), plus 4 bytes per data element. Therefore, the total amount of storage needed for an array of 256 linked lists containing a total of numData elements is 4 * (256 + numData), only 1024 bytes more than we would need to store the data alone. Now comes the task of selecting a data structure to fit our needs. The two types of singly linked lists that could be used are stacks and queues. I have no desire whatsoever to teach a course in data structures in this file; suffice to say that a stack is a last in first out (LIFO) structure, and a queue is a first in first out structure (FIFO). In other words, the first value placed on a stack will be the last one removed, and the first value placed in a queue will be the first one removed. I chose stacks because the operations to place (push) and remove (pop) values on and off the stack are faster than those for queues, and it is very easy to check to see if a stack is empty. The LIFO nature of stacks complicates things a bit between loops, but within a loop a stack is by far the speediest option. Using stacks, the byte sort looks something like this typedef struct { (polygon data goes here) ... // pointer to next polygon in the stack polygon *next; } polygon; polygon *stack1[256], *stack2[256]; // our arrays of stacks // this is the inner sort loop for (count = 0; count < numData; count++) { index = poly[count]->z & 0xFF; // consider only the // least significant radix push(stack1[index], poly[count]); // put the poly in its // proper place } That excruciatingly simple loop will sort the least significant byte. Now, sorting the next most significant byte is a bit more complicated. You must remember that the radix sort algorithm states we must sort the previously sorted data from greatest to least if we want our resultant data to be sorted from greatest to least. So we must start with the highest value for the first radix and count backwards. That means we need to start with the data on stack 255 and count down. for (count = 255; count >= 0; count--) { while (!empty(stack1[count]) { temp = pop(stack1[count]); index = (temp->z & 0xFF00) >> 8; // next radix push(stack2[index], temp); } } After this loop, the data will be in stack2[] with the smallest value at the top of stack2[0], and the largest value at the bottom of stack2[255]. From here, you can have your data sorted from least to greatest or greatest to least depending on how you put it back in the original poly[] array. WELL...WHAT NOW? If you are experienced with data structures, start coding. If you are a bit dilenquent in your knowledge, start reading. I recommend that you code the entire sort in assembler. If you are smart about the way you set things up, your code will consist solely of MOV, AND, and SHL instructions with a JMP thrown in every once in awhile for good measure, and will occupy about 25 lines. I have to confess that the assembler version of my sort is not the most highly optimized, simply because it is inherently so fast. As I said before, my sort takes a whopping 5% of runtime. Since the sort time of a radix sort is linear with respect to data size, this figure is the same whether you are sorting 10 polygons or 10,000. Before I spend long hours looking for places to save a cycle or two, I am going over the slower parts of my code and optimizing them. However, I am positive I would have to spend a great deal of time optimizing my sort if I had used a slower algorithm. FINAL WORDS I am not interested in doing your coding for you. I am happy to answer any questions you may have, but I will not write your code for you, and I refuse to teach a course in data structures. If you don't understand basic stack operations, please don't ask me to explain them to you. Any mail pertaining to this topic will be at best ignored, at worst returned accompanied by a polemic. GREETS & THANKS Siri - for being YOU! Hurricane/OTM - for continuing to give us direction Zilym Limms/OTM - for BWSB, and asm help Phred/OTM - 3d coalescence Rest of OTM - for surviving in the otherwise dead 602 Rich Beerman - we were going to...make a game? :> Jailcoder - first mentioned 'byte sort' to me Darkshade - taught me the bit sort, many other things Daredevil and Tran - pmode/w Otto Chrons - for telling me what I am doing wrong Barry Egerter - showed me how cool 640x480x256 looks Jmagic/Complex - cool intro for TP94 Phantom/Sonic PC - show us a 3d demo already, will you? :> StarScream - I want to see your vector code too :) Kodiak_ -\ ae- - Error - For continued faith in the potential Omni - of the North American scene ior -/ Anyone else I forgot - lame, cliche, copout greet phrase