Chunk Indexing

QUESTION: I know there's got to be an easy way to do this, but my brain is cramped up right now. Given an array input, I'd like to output an array where the index of the of the input array is replicated by the corresponding value in the input array. The input array will always contain positive values.

For example...

   input:  [1, 2, 1, 4]
   output: [0, 1, 1, 2, 3, 3, 3, 3]

   input:  [3, 3, 3, 1]
   output: [0, 0, 0, 1, 1, 1, 2, 2, 2, 3]

I know that I could loop over my input array, replicate the loop variable by the value of the input array at each position and concatenate the resulting arrays, but this solution does not seem like “The IDL Way.”

ANSWER: JD Smith touches on this topic in the Histogram Tutorial, but he expands on the answer here.

This is the “chunk indexing problem,” which is covered in the Histogram Tutorial under “Using the i-Vector.” The i-vector is the leading portion of the reverse indices vector (the part which, narcissistically, indexes itself).

It looks like this:

   IDL> h=histogram(total(n,/CUMULATIVE)-1,/BINSIZE,MIN=0,REVERSE_INDICES=ri)
   IDL> i=ri[0:n_elements(h)-1]-ri[0]

Actually this very example points out a potential problem with the IDL way: it's not terribly transparent. I suppose if it were, it would not have a name with such cachet, and a special section devoted to it on Coyote's Guide to IDL Programming.

Maybe I should provide a little more explanation for what really is an underhanded trick. The trick requires that you understand how REVERSE_INDICES are setup. For a given bin in a histogram, the "i-vector" (or first half) part of REVERSE_INDICES (RI) will contain a series of index pairs into the "o-vector" (or second half) of RI. These two vectors could have been kept separate, saving countless person-hours of head scratching, but they were instead glued together into one ungainly beast. In any case, there will be a pair of indices in the i-vector for each bin in the histogram. This pair tells you the range of elements of REVERSE_INDICES which contain the original indices of the data which fell in that bin. If the bin is empty, the pair will be the same number (i.e. spanning no elements). This is the crux of our trick. We don't care about the data, or the histogram itself, of the indices in the data, just the i-vector.

Here's how it works. Let's imagine we have a histogram which has lots of empty bins, like this:

   |   |   |   |   |
   |   |   |   |   |
   |   |   |   | * |
     0   1   2   3  

Remember, when a bin is empty, the corresponding pair of i-vector indices is the same number. The i-vector for this might look like: [5,5,5,5,6], which is to say:

   Bin  RI Range
   0    5:5 (i.e. empty)
   1    5:5
   2    5:5
   3    5:5
   6    5:6 (i.e. one inside)

For those first three empty bins, there were 4 5's in a row. Why is it 5? Because the histogram has 4 elements. So, we were able to get 4 5's in a row, simply by creating a histogram with an integer 3, like this:

   IDL> h=histogram([3],MIN=0,/BINSIZE,REVERSE_INDICES=ri)

Subtracting off ri[0], and we have 4 0's in a row. Getting closer. How can we arrange to sparsely sprinkle a single drop in all the correct histogram buckets, such that the spacing between them will create the right sort of i-vector? By using total(/CUMULATIVE).

   IDL> n=[1,5,4,1]
   IDL> print,total(n,/CUMULATIVE)
         1.00000      6.00000      10.0000      11.0000

Now we must subtract 1 because it always takes 1 more index for an equivalent set of pairs in the i-vector. You can see that this is now the perfect "sparse sprinkling" to get just what we want:

   IDL> print,histogram(total(n,/CUMULATIVE)-1,MIN=0)
              1           0           0           0           0           1
              0           0           0           1           1

Notice the spacing between the 1's... 1,5,4,1. We're almost there. Now the first part of RI will contain the chunked indices we need:

   IDL> h=histogram(total(n,/CUMULATIVE)-1,MIN=0,REVERSE_INDICES=ri)
   IDL> print,ri[0:n_elements(h)-1]
             12          13          13          13          13          13
             14          14          14          14          15

Simply subtract off the offset:

   IDL> print,ri[0:n_elements(h)-1]-ri[0]
              0           1           1           1           1           1
              2           2           2           2           3

And there you have it. This is not intuitive. It's simply a convenient trick to leverage the speed of HISTOGRAM to do something its designers probably never intended you to do.


Here is another example from the IDL newsgroup:

I need to expand and an array non-uniformly based on its content. I am trying to to the following:

   input array: [1,5,4,1]
   output array should be: [1,0.2,0.2,0.2,0.2,0.2,0.25,0.25,0.25,0.25,1]

Each elements of the input array is basically tuned into FltArr(inputarray[i])/inputarray[i] and the subarray concatenated. Is there a way to do this in one step, without using FOR loops and array concatenation? If not, I can work around this, but knowing for sure that this doesn't work would at least allow me to stop thinking about this problem. :-)

The answer is given like this:

   IDL> n = [1, 5, 4, 1]
   IDL> d = 1./n
   IDL> Print, d
         1.00000     0.200000     0.250000      1.00000
   IDL> h = Histogram(Total(n,/CUMULATIVE)-1, BINSIZE=1, MIN=0, REVERSE_INDICES=ri)
   IDL> indices = ri[0:n_elements(h)-1]-ri[0]
   IDL> Print, d[indices], format='(f4.2)'
         1.00  0.20  0.20  0.20  0.20  0.20  0.25  0.25  0.25  0.25  1.00

Web Coyote's Guide to IDL Programming