OTMSORT.TXT - Sorting algorithms for 3d graphics

        released 2-20-95
        by Voltaire/OTM [Zach Mortensen]

        email -
        mortens1@nersc.gov


INTRODUCTION

                During the past month, I have received many inquiries
        concerning sorting algorithms used in 3d engines, which are the
        fastest, etc.  I figured I could save myself some IRC and email
        writing time by compiling a text on the matter that would
        explain everything in a concise manner.  In this text, I will
        describe various sorting algorithms, and the pros and cons of
        each.  This primarily covers variations of the radix sort,
        which I have found to be the fastest algorithm.

                A fast sort is critical to a fast 3d engine, perhaps
        even more critical than a fast poly fill.  In the first version
        of my 3d engine, I implemented a linear sort (quite a misnomer
        actually).  When I began optimizing, I found that the sort was
        a huge bottleneck.  After I switched to a z-buffered drawing
        scheme which eliminated the sorting algorithm, my code ran
        about 7 times faster than it had before.

                I quickly discovered that z-buffering was not the
        fastest method either.  It requires an enormous amount of
        memory accesses per polygon, which can be very very slow on a
        machine without a fast bus and writeback cache.  I needed to
        find an algorithm that would allow me to sort my polygons with
        the fewest number of memory accesses.

                The radix sort was the answer I had been looking for.
        The radix sort is adventageous over all other sorting
        algorithms because its sorting time as a function of the number
        of data to be sorted is linear.  The sorting time as a function
        of number of data in most commonly used sorts is exponential,
        which causes a tremendous slowdown when a large number of
        polygons are being sorted.


THE BUBBLE SORT

        Here is an example of a bubble sort


                for (count = 0; count < numData; count++)
                    for (index = 0; index = numData; index++)
                        if (data[count] > data[index])
                            swap(data[count], data[index]);


                This is by far the simplest sorting algorithm known to
        man.  It is also the most inefficient sorting algorithm that
        could possibly be used, literally.  In the bubble sort, each
        element of the set is compared with all other elements,
        yeilding a total of numData * numData iterations through the
        sorting loops.  As you can see, this is a quadratic function.
        By doubling the number of data to be sorted with a bubble sort,
        the sorting time increases FOUR TIMES.  The bubble sort is a
        definite no no if you are remotely interested in speed.


THE "LINEAR" SORT

                The linear sort was taught to me in my High School
        pascal class.  It was a much faster sort than the bubble sort
        in our examples which required us to sort a set of not more
        than 20 numbers, so it was the first sort I implemented in my
        3d code.  However, a closer look shows that it is nothing more
        than a slight variation of the bubble sort algorithm, with a
        slight increase in the performance.

                for (count = 0; count < numData; count++)
                {
                    x = count;
                    for (index = count + 1; index < numData; index++)
                        if (data[index] > data[x])
                            x = index;

                    if (x > count)
                        swap(data[x], data[count]);
                }


                This algorithm yeilds numData iterations through the
        outer loop, with an average of (numData / 2) iterations through
        the inner loop per outer loop iteration.  Therefore, the total
        number of iterations through the inner loop is (numData *
        numData) / 2.  This total is half the total of the bubble sort,
        but it is still a quadratic relationship.  By doubling the
        number of data, the sort time is doubled.  This seems to be a
        linear increase (hence the name linear sort), but it is not.
        If the size of the data is increased by a factor of 4, the sort
        time is increased by a factor of 8 (4 * 4 / 2).  An increase by
        a factor of 8 in the size of the data will result in the sort
        time increasing by a factor of 32 (8 * 8 / 2). So, this sort is
        nearly as bad as the bubble sort when sorting VERY large data
        sets.


THE RADIX SORT

                If you have never heard of the radix sort, you are not
        alone.  If you learned about the radix sort in a programming
        class and thought it was totally useless, you are like I was.
        The way the radix sort is usually taught in classes is
        efficeint for everything but computers.  This is because the
        textbooks usually use base 10 (decimal), which is extremely
        foreign and difficult to implement in a computer.  The idea
        behind a radix sorting algorithm is to sort the data by each
        radix, starting with the least significant and moving to the
        most significant.  This is best illustrated by an example.

                First off, it helps to define radix.  A radix is a
        'place' in a number.  The base 10 radices have common names (1s
        place, 10s place, 100s place, etc), but the concept can be
        extended to numbers of any base.  Consider the base 10 number
        32768.  The value of the first (least significant) radix is 8,
        the second is 6, the third is 7, the fourth is 2, and the fifth
        is 3.  Now consider the base 16 (hexadecimal) number 3f8.  The
        value of the first radix is 8, the second is f, the third is 3.
        Now the following example will make a bit more sense.


        base 10 radix sort example


        data           |first          |second
        set            |pass           |pass
        ---------------+---------------+---------------
                       |               |
        12             |09             |83
        65             |38             |73
        44             |37             |65
        37             |65             |44
        03             |44             |38
        38             |03             |37
        83             |83             |12
        09             |73             |09
        73             |12             |03


                The first pass through a radix sorting algorithm deals
        ONLY with the least significant radix (in base 10, the least
        significant radix is the one's place).  The data is sorted from
        greatest to least (or least to greatest depending on the
        application) based on the least significant radix.  After the
        first pass, the data with the least significant radix of
        greatest value is at the top of the list.

                The second pass is similar to the first, with the
        exception that the resultant data from the first pass is sorted
        (NOT the original data), and the data is sorted based on the
        second radix.  It is extremely important to preserve the order
        of the previous pass, so make sure to traverse the list the
        same way the original data was traversed in the first pass (in
        this case, top to bottom).

                Any additional passes to sort additional radices are
        performed in the same manner, by sorting the data from the
        previous pass with respect to the radix in question.

                The radix sort has an advantage over the other sorts
        presented here, because its sort time as a function of number
        of data is linear.  The number of iterations needed in a radix
        sort is given by (numData * numRadices) where numRadices is
        constant for the entire data set.


THE BIT SORT (BASE 2 RADIX SORT)

                Now that we have an algorithm that gives a linear
        increase in the number of iterations needed to sort larger sets
        of data, we need to find a way to make each iteration as fast
        as possible.  This can be accomplished by changing the base in
        which the data to be sorted is interpreted.  In base 10, we
        have to go through each element of the set looking for a 9 in a
        given radix, then look for an 8, etc.  This is quite slow, and
        fairly difficult to implement on a computer.  A better approach
        than using base 10 is to use base 2, where there are a total of
        2 possible values for each radix (0 or 1).  It is very easy to
        test whether or not a binary number contains a 1 in a given
        place, this can be accomplished by an AND or TEST instruction.
        Since there are only two possible values for a radix of base 2,
        it is also extremely easy to sort from greatest to least based
        on a given radix.  Simply put all the '1' radices at the top
        and all the '0's at the bottom.  Here is some example code for
        the bit sort applied to 16 bit data:

        // this requires two temporary arrays of [numData] elements,
        // one for 1s and one for 0s

        short data[];                   // 16 bit data in this example
        short oneArray[numData];
        short zeroArray[numData];
        int numOnes;
        int numZeros;

        int mask = 1;

        for (count = 0; count < 16; count++)
        {
            numOnes = 0;
            numZeros = 0;

            for (index = 0; index < numData; index++)
            {
                if (data[index] & mask)
                {
                    oneArray[numOnes] = data[index];
                    numOnes++;
                }
                else
                {
                    zeroArray[numZeros] = data[index];
                    numZeros++;
                }
            }

            // put the 1s back in the original data array first

            memcpy(data, oneArray, 2 * numOnes);

            // now put the 0s back

            memcpy(data + (2 * numOnes), zeroArray, 2 * numZeros);

            // mask out the next most significant bit next time through

            mask <<= 1;
        }

                This code is considerably faster than either the bubble
        sort or the linear sort.  The inner loop is executed 16 *
        numData times, and consists of three indirect references, one
        TEST, one MOV, one INC, and one JZ/JMP (depending on the branch
        taken) plus loop overhead.  The outer loop is actually more
        costly than the inner loop because of the two block memory
        transfers.  However, the outer loop is only executed 16 times
        in this example.  Even so, this is a lot of iterations through
        the loop.  If we could somehow find a way to cut down on the 16
        iterations required per data element while keeping the inner
        loop very simple, we could get a big increase in performance.


THE BYTE SORT (BASE 256 RADIX SORT)


                The obvious solution to this problem is to increase the
        base of each radix examined.  The next logical base to use is
        base 256, which can be represented in a single byte.  If we
        were to implement the byte sort the same way we implemented the
        base 10 radix sort in the original radix sort example, we would
        look for a value of 255 in the least significant byte of each
        data element, then look for a 254 in each element, etc.  This
        would yeild (numData * 256 * 2 iterations) of the inner loop if
        we used 16 bit data.  This would be nowhere near as fast as a
        bit sort.

                However, we can apply a bit of ingenuity to the byte
        sort algorithm by taking a hint from the implementation of the
        bit sort.  In the bit sort, we had a separate array for each
        possible value of a given radix.  In a base 2 sort, there were
        two possible values for each radix, therefore we had two
        arrays.  If we extend this concept to a base 256 sort, we would
        need 256 arrays, one for each possible value of a radix of base
        256.  Now, following the bitsort algorithm, we would go through
        the least significant byte of the data, placing the data in the
        appropriate array which corresponds to the value of the least
        significant radix.  We would of course repeat the procedure for
        the next most significant radix and so on until all radices
        have been considered.  In the case of 16 bit data, this method
        would yeild numData * 2 iterations through the inner loop,
        which is an 8 fold speed increase over the bit sort.

                However, there is a rather large problem in the
        implementation of the byte sort: memory.  In the implementation 
        of the bit sort, it is fairly easy to organize code around two 
        arrays.  However, a byte sort necessitates that the radix being 
        examined be used as an index to an appropriate array (for 
        example, a radix of value 4 would need to be placed in the '4s' 
        array, a radix of value 33 would need to be placed in the '33s' 
        array, etc).  Therefore, a two dimensional data structure needs 
        to be used, with one dimension corresponding to the possible 
        values of radices, and the other index corresponding to the 
        actual data containing radices of a certain value.

                          <-- RADIX  -->

        ^   |00|01|02|03|04|05|06|07|08|09|0A|0B|0C|0D|0E|..|FC|FD|FE|FF
        | --|-----------------------------------------------------------  
          00|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |..|  |  |  |
        D --|-----------------------------------------------------------
        A 01|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |..|  |  |  |
        T --|-----------------------------------------------------------
        A ..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..|..
          --|-----------------------------------------------------------
        | nn|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |..|  |  |  |
        v --------------------------------------------------------------

                Our dilemma lies in the fact that there is no way 
        whatsoever to determine the number of data that will contain a 
        given radix at a given time.  Therefore, there is no way of 
        knowing how large to make the dimension (nn) when initializing 
        the data structure.  We could always make 256 arrays of 
        [numData] elements each, but this would be extremely 
        inefficient in that we would be allocating 256 times the memory 
        we really need.

                The solution to this problem lies in dynamic data 
        structures.  By setting up an array of 256 linked lists, we can 
        assure that we will not be allocating much more memory than we 
        need.  The total overhead for a singly linked list is 4 bytes 
        (the size of the starting list pointer), plus 4 bytes per data 
        element.  Therefore, the total amount of storage needed for an 
        array of 256 linked lists containing a total of numData 
        elements is 4 * (256 + numData), only 1024 bytes more than we 
        would need to store the data alone.

                Now comes the task of selecting a data structure to fit 
        our needs.  The two types of singly linked lists that could be 
        used are stacks and queues.  I have no desire whatsoever to 
        teach a course in data structures in this file; suffice to say 
        that a stack is a last in first out (LIFO) structure, and a 
        queue is a first in first out structure (FIFO).  In other 
        words, the first value placed on a stack will be the last one 
        removed, and the first value placed in a queue will be the 
        first one removed.  I chose stacks because the operations to 
        place (push) and remove (pop) values on and off the stack are 
        faster than those for queues, and it is very easy to check to 
        see if a stack is empty.  The LIFO nature of stacks complicates 
        things a bit between loops, but within a loop a stack is by far 
        the speediest option.


                Using stacks, the byte sort looks something like this

        typedef struct 
        {
            (polygon data goes here)

            ...

            // pointer to next polygon in the stack

            polygon *next;

        } polygon;


        polygon *stack1[256], *stack2[256];     // our arrays of stacks


        // this is the inner sort loop

        for (count = 0; count < numData; count++)
        {
            index = poly[count]->z & 0xFF;      // consider only the 
                                                // least significant radix

            push(stack1[index], poly[count]);   // put the poly in its 
                                                // proper place
        }

                That excruciatingly simple loop will sort the least 
        significant byte.  Now, sorting the next most significant byte 
        is a bit more complicated.  You must remember that the radix 
        sort algorithm states we must sort the previously sorted data 
        from greatest to least if we want our resultant data to be 
        sorted from greatest to least.  So we must start with the 
        highest value for the first radix and count backwards.  That 
        means we need to start with the data on stack 255 and count 
        down.


        for (count = 255; count >= 0; count--)
        {
            while (!empty(stack1[count])
            {
                temp = pop(stack1[count]);
                index = (temp->z & 0xFF00) >> 8; // next radix
                push(stack2[index], temp);
            }
        }

                After this loop, the data will be in stack2[] with the 
        smallest value at the top of stack2[0], and the largest value 
        at the bottom of stack2[255].  From here, you can have your 
        data sorted from least to greatest or greatest to least 
        depending on how you put it back in the original poly[] array.


WELL...WHAT NOW?

                If you are experienced with data structures, start 
        coding.  If you are a bit dilenquent in your knowledge, start 
        reading.  I recommend that you code the entire sort in 
        assembler.  If you are smart about the way you set things up, 
        your code will consist solely of MOV, AND, and SHL 
        instructions with a JMP thrown in every once in awhile for 
        good measure, and will occupy about 25 lines.  I have to 
        confess that the assembler version of my sort is not the most 
        highly optimized, simply because it is inherently so fast.  As 
        I said before, my sort takes a whopping 5% of runtime.  Since 
        the sort time of a radix sort is linear with respect to data 
        size, this figure is the same whether you are sorting 10 
        polygons or 10,000.  Before I spend long hours looking for 
        places to save a cycle or two, I am going over the slower parts 
        of my code and optimizing them.  However, I am positive I would 
        have to spend a great deal of time optimizing my sort if I had 
        used a slower algorithm.


FINAL WORDS

                I am not interested in doing your coding for you.  I am 
        happy to answer any questions you may have, but I will not 
        write your code for you, and I refuse to teach a course in data 
        structures.  If you don't understand basic stack operations, 
        please don't ask me to explain them to you.  Any mail 
        pertaining to this topic will be at best ignored, at worst 
        returned accompanied by a polemic.


GREETS & THANKS

        Siri                    - for being YOU!
        Hurricane/OTM           - for continuing to give us direction
        Zilym Limms/OTM         - for BWSB, and asm help
        Phred/OTM               - 3d coalescence
        Rest of OTM             - for surviving in the otherwise dead 602 
        Rich Beerman            - we were going to...make a game? :>
        Jailcoder               - first mentioned 'byte sort' to me 
        Darkshade               - taught me the bit sort, many other things 
        Daredevil and Tran      - pmode/w 
        Otto Chrons             - for telling me what I am doing wrong 
        Barry Egerter           - showed me how cool 640x480x256 looks 
        Jmagic/Complex          - cool intro for TP94 
        Phantom/Sonic PC        - show us a 3d demo already, will you? :>  
        StarScream              - I want to see your vector code too :)
        Kodiak_                 -\
        ae-                     -  
        Error                   -  For continued faith in the potential 
        Omni                    -  of the North American scene 
        ior                     -/

        Anyone else I forgot    - lame, cliche, copout greet phrase