Sorting String Arrays

QUESTION: I have a large string array (~100000 elements) that need sorting on two fields within each string. The array looks like this:

   array=['F;100', 'ABC;6', 'DE;2', 'DE;10', 'DE;1']

The order required is: a) sort items to left of ';', followed by b) sort items numerically to right of ';'. The correct sort would produce this:

    ABC;6    DE;1    DE;2    DE;10    F;100

A simple sort (Sort(array)) procudes this:

    ABC;6    DE;1    DE;10    DE;2    F;100

The only way I've found to do this is to convert the right-hand part to I4.4 format within a loop and search on the new values. The code looks something like this:

   ......

    for j=0, n_elements(array)-1 do begin
        parts=strsplit(array[j], ';', /extract)
        RH=string(fix(parts[1]), format='(I4.4)')
        new[j]=parts[0]+RH
    endfor
    order=sort(new)

This causes the new array to look like this:

   F;0100    ABC;0006    DE;0002    DE;0010    DE;0001

which is then sorted correctly.

This is pretty slow. Is there a clever way of sorting on two fields like this without using a loop?

ANSWER: Peter Albert couldn't think of a way to sort on two fields, but he did have a valuable suggestion for producing the right-hand side of the string for the proper sort.

I don't know how to sort over two fields at the same time, but I could propose a much faster way of creating your new string array, avoiding the for loop:

   spos = strpos(array, ";")

   left = strmid($
      array, $
      transpose(replicate(0, n_elements(array))), $
      transpose(spos)$
     )

   right = string($
      fix($
         strmid($
            array, $
            transpose(spos+1)),$
        ), format = '(i4.4)'$
      )

The basic idea is to use an array for the variables "first_character" and "length" in the call to StrMid. It has to be the transposed ones, as the first dimension of those arrays determines how many substrings are extracted from each element of the string array - one in your case.

On my PC this method is approximately 10 times faster then the for-loop method.

As usual, JD Smith offered a loop-free way of solving the problem, and a comment about IDL sorting symantics.

Here is an approach that is similar to Peter's approach, but without the loop:

   IDL> array=['F;100', 'ABC;6', 'DE;2', 'DE;10', 'DE;1']
   IDL> pos=transpose(strpos(array,';'))
   IDL> s=sort(strmid(array,0,pos)+string(FORMAT='(I5.5)',strmid(array,pos+1)))
   IDL> print,array[s]
      ABC;6 DE;1 DE;2 DE;10 F;100

IDL has no good way to alter the sorting semantics, to simultaneously sort on multiple fields. Most languages offer the ability to specify a sorting function, which compares two elements for GT, LT, or EQ, using any logic you like. Since IDL doesn't allow this, you're forced to re-cast your entire set as strings or integers.

It is reported that JD's no-loop method was about 10 times faster than the loop method.

Google
 
Web Coyote's Guide to IDL Programming