Are FOR Loops Really Evil?

QUESTION: I've read JD Smith's tome on the dangers of the FOR loop, but I remain unconvinced. Don't you have a stable of more secular writers in that den of iniquity over there?

ANSWER: Perhaps you are right. JD is pretty much on the fringe most of the time. Here are a couple of traditional answers from two prominant members of the IDL community: Craig Markwardt and Struan Gray. First, Craig.

So what defines a slow FOR loop? A good question. I think there are two parts to it, and you are on the right track. Theory comes first, then some practical solutions.

IDL is a scripted language, so almost by definition very little optimization can be done. Sure, the IDL code is "compiled," but what that really means is that the IDL statements are converted to some intermediate form, sometimes called bytecode. These bytecodes represent a higher level of abstraction and portability compared to true machine code, but with that comes the burden that each bytecode requires many, many machine code operations to implement.

Therein lies the rub. Even for a simple FOR loop, there's always going to be a lot of "overhead" or extra processing. Darn. This is the price we pay to have IDL be so general purpose. Another thing to keep in mind is that basically everything gets re-evaluated each iteration. When you compare the following two examples:
  for i = 0L, N-1 do x(i) = f(x(i).tag)    ; vs.
  x = f(x.tag)
you should realize that in the first, the values of I, X(I), X(I).TAG and F(X(I).TAG) are re-evaluated every iteration, because IDL is dynamic. In the second example, the operands are evaluated once.

What this really boils down to is, reduce the number of iterations, and do more per iteration. This will minimize the useless processing compared to the "real" processing.

You can do more per iteration with vectorization. The vectorized operations are heavily optimized for speed, but you need to know how to take advantage of them. That is left as an exercise to the reader :-), but obviously it's often non-trivial, and sometimes impossible.

Let's see what this boils down to.

Q: When is the number of iterations, N, too many?

A: When the execution time of the empty loop becomes perceptible:
   FOR I = 0L, N-1 DO BEGIN & END
(Tthis is about a million for my faster machine.)

Q: How much vectorization should I do?

A: Generally it is sufficient to only vectorize the innermost loop. If you have an image with rows and columns, and you can vectorize the operation at the row-level, you should be safe. This is often described as chunking or banding.

Here is a bit of IDL-ku for you to ponder as you think about these questions.

for when=you, must do $
many, many, iterates, $
vectorize (a=chunk)

Followed by thoughts from Struan.

Wayne Landsman writes:
> inarr= randomn(seed, 3, 2048,2048)
> outarr = fltarr(2048,2048,/nozero)
>
> (1) for j=0,2047 do for i=0,2047 do outarr[i,j] = median(inarr[*,i,j])
>
> (2) for j=0,2047 do begin
>     for i=0,2047 do begin
>         outarr[i,j] = median(inarr[*,i,j])
>     endfor
>     endfor
>
> Form (1) is slightly faster, but the calculation
> cannot be interrupted with a Control-C.   Also,
> it is my impression that the speed difference
> is less than it used to be, and that form (2)
> is now better optimized.
On my machine (a Mac G3 powerbook, IDL 5.3) (1) is slightly faster for small arrays but the difference is insignificant by the time you are up to 2048x2048.
> (I also assume that the two FOR loops are unavoidable
> here, but I would be delighted to be proved wrong.)
I often take data which consists of 1D spectra on a 2D spatial grid, ending up with arrays which are (Nspecpoints, Ngridx, Ngridy) in dimension. Often I want to do some processsing operation on all the individual spectra, and it helps a lot to "unwrap" the array so that you do one loop instead of two nested ones. In your case the code would look like this:
    npoints = 2048
    inarr = reform(inarr, 3, npoints*npoints, /overwrite)
    outarr = fltarr(npoints*npoints, /nozero)
    for i=0L, long(npoints)*npoints - 1 do $
      outarr[i] = median(inarr[*,i])
    inarr = reform(inarr, 3, npoints, npoints, /overwrite)
    outarr = reform(outarr, npoints, npoints, /overwrite)
For the sorts of arrays I use (100, 200, 200) this is quite a bit faster, but interestingly enough by the time you get up to arrays like yours the speed advantage has gone. Watch out for overflow on your loop indices.

In this particular case you can actually work without loops altogether:
    inarr = reform(inarr, 3*npoints*npoints, /overwrite)
    outarr = reform(median(inarr, 3), 3, npoints, npoints, /overwrite)
    outarr = reform((temporary(outarr))[1,*,*])
    inarr = reform(inarr, 3, npoints, npoints, /overwrite)
The median filter creates a whole load of "wrong"' elements, but we can ignore them and eliminating the loop speeds the whole thing up so much that it's worth the overhead to calculate them. This version was substantially faster than the other three at all array sizes.

Both of these techniques require the "interesting" dimension to be first, but sandwiching the code with a ROTATE or TRANSPOSE will do that without a punitive overhead so they can be made quite general.

I apologise to J.D. for not fitting HISTOGRAM in there somewhere.

FOR Loop Case Study

And, finally, here is an interesting exchange that occurred in the IDL newsgroup on March 4th, 2005. Benjamin Hornberger foound that FOR loops weren't faster than loops in his example, and wondered why. The answer was provided by no less an authority than JD, himself. First, Benjamin.

I'm disappointed. I've been told loops suck and arrays rock in IDL. I've read the dimensional juggling tutorial and was striving hard to eliminate all loops from my code. Until today. When suddenly my code, with two FOR loops, ran twice as fast as my loopless code. My one loop example fell right in between the others.

The scientific case: image_data is a 3D array, a stack of “channels” of the same image [n_fast_pixels, n_slow_pixels, n_data] (take fast and slow as x and y, if you want). A display is a linear combination of channels. The variable linear_combinations is a 2D array [n_data, n_displays] holding the coefficients for n_displays interesting displays. The entries will typically be -1, 0, or 1.

My example program, Test_Loop, calculates the varaible displays, a 3D array [n_fast_pixels, n_slow_pixels, n_displays] holding all these interesting displays in one array. I can do it loopless, or with one or two loops. In the array calculation, I even packed everything into one expression, making the code quite unreadable, to make sure I don't lose time by making unnecessary copies of arrays. With two loops it's twice as fast as without, with one loop it's right in between.

Yes, I also heard that FOR loops are not always evil. Still, it would be great if somebody could shed some light on why in this particular case loops are faster. Do the REBIN/ REFORMsteps take a lot of time? If I have to execute that code many times, do I gain by saving and reusing the result of the REBIN/ REFORM step?

By the way, I also hit the point when I didn't have enough memory to hold the 4D intermediate array. Then arrays really suck. The task manager had to come rescue me.

This is all on Win XP, IBM Thinkpad T40 (Pentium M, 1.6 GHz, 512 MB).
  
  IDL> test_loop,fltarr(300,300,10),fltarr(10,20),disps,nloops=0
  took       0.82099986 sec
  IDL> test_loop,fltarr(300,300,10),fltarr(10,20),disps,nloops=1
  took       0.58000016 sec
  IDL> test_loop,fltarr(300,300,10),fltarr(10,20),disps,nloops=2
  took       0.47000003 sec
  

This was followed by JD's response.

Well, as the perpetrator of the tongue-in-cheek embodiment-of-pure-evil FOR loop characterization, I'll add this qualifier: arrays are great, until you run out of memory for them. Then they are paged to disk, and take 1000 times as long to access.

Thought experiment. Total up the first 20 billion prime numbers.

ARRAY method: allocate array of 20 billion long-long(-long-long) integers, pre-compute primes and fill into array, TOTAL.

LOOP method: compute primes in sequence, increment running total as you go.

Can you see why the former is doomed to fail?

There are classes of algorithms where you trade memory size for compute speed. The typical trade-off occurs at a much larger memory size in IDL than it does in many other languages, because IDL is so fast at manipulation huge piles of array data, and so relatively slow at looping through things.

Then, there are other problems where your memory size is fixed (e.g. you have a real data cube of several GB). This is a crucial distinction. In the second case, your data size is dictated. You must try to cut it up into big, but manageable chunks and operate on it.

In your case, you're trying to create a few floating array of size 300x300x10x20, which should easily fit in memory. However, in none of your cases are you accessing a single index at a time, so they are all "array based" at some level. Even your two-loop example is hammering away at sub-images of size 300*300 per iteration.

Why don't you try adding another case:
          else if nloops eq 4 then begin 
             FOR j=0, n_data-1 DO  $
                for k=0,n_fast_pixels-1 do $
                   for l=0,n_slow_pixels-1 do $
                      displays[k, l, i] += linear_combinations[j, i] * $
                                           (image_data)[k, l, j]
  
to see what happens when you iterate over each and every member of the array:
  IDL> test_loop,findgen(300,300,10),findgen(10,20),disps,nloops=0
  took        1.1159251 sec
  IDL> test_loop,findgen(300,300,10),findgen(10,20),disps,nloops=1
  took       0.96256900 sec
  IDL> test_loop,findgen(300,300,10),findgen(10,20),disps,nloops=2
  took        2.0543392 sec
  IDL> test_loop,findgen(300,300,10),findgen(10,20),disps,nloops=4
  took        27.267108 sec
  
Ouch! As Craig has said many times, if you can keep the amount of work you are doing per iteration high, you won't feel the looping "penalty" much, and can actually keep up good speed. As soon as the looping penalty is comparable to the work you are doing per iteration (or much larger, as in the last example), you're out of luck. The fastest case for me, NLOOPS=1, still uses REFORM/REBIN.