"My IDL Program Speed Improved by a Factor of 8100!!!"

"I'm just your average astrophysicist," says Bruno Schmidt, of Northern Florida State University, "I don't know good programming from shinola. But in one day I took an IDL program that took three days to run (and nearly cost me my girfriend when the system administrator threatened to boot me off the system and scuttle my Ph.D. research) and turned it into a flat-out screaming machine that was done before I could even pour myself a cup of coffee." Schmidt, who had been exposed to a bit of FORTRAN programming as an undergraduate ("...the worst experience of my life!"), recalled the moment for National Enquirer.

"I was, like, 'Yeah, sure.' I'd heard that loops were fairly slow in IDL, but I had no idea when I contacted the IDL Newsgroup for some help that those guys could improve my code (which my girlfriend helped me write, by the way) by some 8100 percent. I mean, even I was impressed by that and I'm used to working with some big numbers!

"The problem was pretty simple. I had two columns of numbers: right ascencion and declination. Well, let's just call them X and Y, okay? Essentially they told me where the stars were in space. All I wanted to know was the 6th closest star to each of the others. Simple, right? I mean, this is the kind of stuff astronomers do pretty much every day. Although I can't remember right this second why it was so all-fired important to my advisor to have this information before he went out for beers with the boys Friday afternoon. He probably had some kind of a bet on, knowing him.

"Here is the code that took three days to run with a 10,000 star file. Pretty straightforward stuff. I mean, you just think it through logically, you know. No fancy pansy. Just basic programming like they teach in FORTRAN 101."

[Ed. Note:] You can find a similar star problem involving position matching between two lists in this article.

k=0
while k lt u do begin              ; I know the number of stars (lines) which is equal to "u"
   i=0				          
   while i lt u do begin		
      dxki = (x(k) - x(i))         ; This variable is the RA distance between point i
                                   ; and the kth point in the array.
					
      dyki = (y(k) - y(i))         ; This variable is the Dec distance between point i
                                   ; and the kth point in the array.
	
      dki = sqrt (dxki^2 + dyki^2) ; Now I define the distance, in 2-d space, 
                                   ; between the two objects i and k
		
      D2(k,i) = dki                ; This array contains all the distance data for all the stars.
                                   ; The reference star is the column number (k).
      i = i+1                      ; For all the values of i.
   endwhile			
k=k+1
endwhile

; Now I have to sort the distances so I can find the star distance of 
; the 6th closest star.

n=0
while n lt u do begin
   D2n = D2(n,*)                        ; This is so that I am sorting column by column.
	
   for j=0, n_elements(D2n)-2 do begin		
      for l=j+1, n_elements(D2n)-1 do begin	
         if D2n[l] lt D2n[j] then begin	; This just arranges distances in increasing order.
            temp=D2n[j]			
            D2n[j]=D2n[l]			
            D2n[l]=temp      ; So the array that I get is now the array of distances
         endif               ; to the nearest neighbours, in order of the neighbour number!
      endfor				
   endfor	

   D2(n,*) = D2n
   n=n+1							;
endwhile

"Anyhow, I was thinking I was probably going to have to make the numbers up (How would my advisor know? He can barely tell a parcsec from a hole in the ground), when I decided to give it one more shot and see what the geeks on the IDL newsgroup could come up with. Boy, was I surprised! I mean, I was thinking maybe a factor of 10 or something. Like maybe I could push it off to Sunday and run the program when the Sys Admin was still hung over. You know, get it finished before he heard about it. But within a half hour someone was back to me with a simple modification that improved the speed of a small file with only 3800 stars from 24 hours to 11 minutes. And that was only the start!

"The first modification looked like this. I guess I didn't realize IDL had a SORT command. Heh, heh! No sense re-inventing the wheel, you know. Next time maybe I'll spend a few minutes poking around in the Index, if you catch my drift."

I tried to streamline your code and eliminate loops using as many array operations as I could. In IDL this can significantly increase the speed over using loops. In the code that I wrote below I used a procedure called WHERETOMULTI that converts 1D indices to 2D indices. You can do a search on this newsgroup for that program. Alternatively you can wait until IDL 6.0 comes out and use the nifty ARRAY_INDICES routine.

The following function returns the distance for the 6th nearest neighbor given x-y coordinates in the order in which the coordinates are given. So d6[i] is the distance to the 6th nearest neighbor for the star located at coordinates (x[i],y[i]).
FUNCTION CALCULATE_6TH_NEIGHBOR,x,y
; Calculate all of the distances
n = n_elements(x)
u = 1+bytarr(n)
dx = u#x-x#u
dy = u#y-y#u
d = sqrt(dx^2+dy^2)
dsort = sort(d)
wheretomulti,d, dsort, col, row
; Now get the sixth neighbor
d6 = fltarr(n)
for i = 0L,n-1 do begin
	index = where(col eq i)
	d6[i] = d[col[index[6]],row[index[6]]]
endfor
return,d6
END
On my 1.7 GHz PC with 1 GB of ram I processed 1000 x-y pairs in 15.9 seconds. I should mention that when I tried to process 10000 x-y pairs my machine was unable to allocate enough memory to create the arrays. So this may not have solved your problem but it might lead you in the right direction for how to remove loops to gain speed.

"I'd never seen those pound signs (#) before, and didn't know what the hell they were doing in the code, but they sure did something, because this thing was blazing. I was thinking, like, maybe I could still get over to Carolyn's apartment for a little nooky later tonight instead of having to pull another all-nighter. The standard excuse about 'observing tonight' was wearing a little thin, to tell you the truth, and I think Carolyn was starting to wonder (her mother had put this idea into her head, I'm sure) if I had any plans to graduate in the next 10 years or so.

"Anyway, about five minutes later, another suggestion shows up from some Russian dude named Pavel with a terrible accent (you can tell from all the typos in the code). He has this to say."

What you're discovering in the meantime is that, for large arrays (n*10E7 points) memory allocation is slower than loops. This is what's happening with Rob's code. The program, while great, expands the data size by at least a factor of 8 more than necessary (counting matrices and index expansion for SORT), hence the incredible memory consumption.

Try the following:
FUNCTION CALCULATE_6TH_NEIGHBOR, x, y
; Calculate all of the distances
n = n_elements(x)
u = 1+bytarr(n)
d6 = fltarr(n)

dx = u#x
dy = u#y
for i = 0L, n-1 do dx[0, i] = dx[*, i] - x
for i = 0L, n-1 do dy[0, i] = dy[*, i] - x
d2 = sqrt(dx^2+dy^2)

for i = 0L, n-1 do begin
	temp = d2[*, i]
	ind = sort(temp)
	d6[i] = temp[ind[6]]
endfor
return, d6
END
Not all loops are bad in IDL.

Why are these loops faster? Memory allocation is much smaller, thus contiguous chunks are easier available - faster; all loops are aligned by the row and kept short - this is important. In my test, speed is higher by a factor of 20 with identical results.

"But the kicker was when some guy named 'Smith' shows up. This guy is, like, some kind of genius or something. A big cheese in the IDL Expert Programmers Association, apparently. Very well known (although I'd never heard of him before, but what do I know).

"Anyway, he's got his shit squared away and he lays it all out for me. Even Carolyn, who is a social scientist, for God's sake, can probably understand it. His code takes my 3 day program down to a matter of a few seconds. I've never seen anything like it. It's a friggin' miracle, is what it is! Here is what he has to say."

This is an excellent test problem that demonstrates well the variety of tradeoffs one must make to get good performance out of IDL. Pavel's method demonstrates the vectorized version of a brute force technique, cast in terms of (potentially rather large) arrays. It is essentially the exact analog of your method, but designed to use array operations, which IDL is very good at (especially large arrays -- it's what it was designed for).

The reason why your method performed so poorly is that IDL "for" loops impose a fairly large (compared to other languages) overhead on each cycle. For the curious, this is likely because it must travel around the full interpreter loop on each iteration, possibly compiling things, processing background widget events, looking for keyboard input, processing signals, etc. -- very different from, e.g., a loop in C, which runs quite close to the hardware. As Pavel points out, if you can do a substantial amount of work in each loop iteration -- more than the large constant overhead imposed by that cycle -- "for" loops can give perfectly good performance (despite the quasi-fanatical facetious rantings to the contrary you may find on certain respected IDL resource sites).

So one way to get good speed to is to re-cast your problem in terms of the largest array operation you can, and then let IDL do what it does best: chew on those arrays. Let's see how this applies to your problem. Pavel had a few bugs in his code, and I like to build big arrays another way, so here's my version of the vectorized brute force search:
function nth_neighbor, x, y, k
  n = n_elements(x)
  dn = fltarr(n,/NOZERO)
  d=(rebin(transpose(x),n,n,/SAMPLE)-rebin(x,n,n,/SAMPLE))^2 + $
    (rebin(transpose(y),n,n,/SAMPLE)-rebin(y,n,n,/SAMPLE))^2 
  for i=0L,n-1 do dn[i] = sqrt(d[(sort(d[*,i]))[k],i])
  return, dn
end
What am I doing here? I just make a huge array to compare every point with every other point, all at once. I do this by making pairs of arrays which are the tranpose of each other and subtracting them. E.g. for 4 points, computing dx looks like:
     x0 x0 x0 x0     x0 x1 x2 x3 
     x1 x1 x1 x1     x0 x1 x2 x3 
dx = x2 x2 x2 x2  -  x0 x1 x2 x3 
     x3 x3 x3 x3     x0 x1 x2 x3 
So, you see, this 16 element array contains all of the x deltas from each point to each other point (in fact, it wastefully contains them all twice, since for distance purposes x1-x2 and x2-x1 are equivalent, plus it contains distances from each point to itself). If I do something similar for the y values, subtract and add each squared, I get a large array which contains the distances from each point to each other point (actually the square of the distances... I save time by skipping the square-root until it's needed at the end, since sorting on d^2 and d gives the same result). I sort them row by row, and pick out the nth member of the sorted list.

This method works well, and is fast compared to your method, but it has a deep flaw, which you've already run up against. It requires the use of arrays of size n^2 (16 in this small example). This gets big fast! E.g., for 10,000 points, I might need 4 such arrays of size 10,000 x 10,000, which is around 1e8 x 4 x 4=1.6 GB of memory! You're entirely limited by how much memory you have, or, more accurately, how fast your disks and virtual memory sub-system are. While very fast for small arrays which can fit in memory, it just doesn't scale up to large data sets. Now you understand why the code choked on 15,000 points.

The reason brute force fails, cursed by slowness (in your method), or huge storage needs (in Pavel's) is all those comparisons. If you have to compare every single point to every single other point, that's n*(n-1)/2 comparisons, at best.... e.g. n^2. But intuitively, why should we compare all the points? We know the points way over on the other side aren't going to be the closest, but we compute their distances and sort on them as if they ever had a shot anyway. But how can we decide if things are close or not without comparing all their distances? A seeming Catch-22.

One method is triangulation. IDL has a lovely function called TRIANGULATE which can compute a so-called Delaunay triangulation (or DT -- do read about it) which has some nice properties. One of these properties is that the nearest neighbors of a point are either connected to it directly by the DT, or connected through some closer point which is (mathematicians would say the nearest neighbors graph is a "sub-graph" of the Delaunay). Here's a function based on this property:
function nearest_neighbors,x,y,DISTANCES=nearest_d,NUMBER=k
  if n_elements(k) eq 0 then k=6
  
  ;; Compute Delaunay triangulation
  triangulate,x,y,CONNECTIVITY=c
  n=n_elements(x) 
  nearest=lonarr(k,n,/NOZERO)
  nearest_d=fltarr(k,n,/NOZERO)

  for point=0L,n-1 do begin 
     p=c[c[point]:c[point+1]-1] ;start with this point's DT neighbors
     d=(x[p]-x[point])^2+(y[p]-y[point])^2
     for i=1,k do begin 
        s=sort(d)               ; A wasteful priority queue, but anyway
        p=p[s] & d=d[s]
        nearest[i-1,point]=p[0] ; Happy Day: the closest candidate is a NN
        nearest_d[i-1,point]=d[0]
        if i eq k then continue
        
        ;; Add all its neighbors not yet seen
        new=c[c[p[0]]:c[p[0]+1]-1]
        nnew=n_elements(new) 
        already_got=[p,nearest[0:i-1,point],point]
        ngot=n_elements(already_got) 
        wh=where(long(total(rebin(already_got,ngot,nnew,/SAMPLE) ne $
                            rebin(transpose(new),ngot,nnew,/SAMPLE),1)) $
                 eq ngot, cnt)
        if cnt gt 0 then begin 
           new=new[wh]
           p=[p[1:*],new]
           d=[d[1:*],(x[new]-x[point])^2+(y[new]-y[point])^2]
        endif else begin 
           p=p[1:*]
           d=d[1:*]
        endelse 
     endfor
  endfor 
  if arg_present(nearest_d) then nearest_d=sqrt(nearest_d)
  return,nearest
end 
I won't go through all the details. Roughly, it just starts searching out to nearby points on the DT, expanding outwards until it has enough. For each of the n points, instead of computing distances to and sorting n points, it operates on a much smaller number. It gives you back all of the indices (and, optionally, distances) of the k nearest points, and has one quirk: for points on the boundary, the point itself is included first on the list. You could obviously test for this and reject them (since, with half their neighbors "missing", they're suspect anyway). One other nice feature: once you build the DT, you can cache it and use it again and again, if, e.g., you're finding the nearest neighbors interactively.

Here are some timings comparing the two, for finding the 6th/1st-6th neighbors of a random set of points:
     100 Points, 6 Nearest Neighbors
nxn:    0.0082000494
 DT:     0.030847073
Hmm... nxn does quite well there. But 100x100 is pretty small ;)
    1000 Points, 6 Nearest Neighbors
nxn:      0.79809201
 DT:      0.31871009
Starting to suffer, but we've still got a few MB of memory to spare.
    5000 Points, 6 Nearest Neighbors
nxn:       23.420803
 DT:       1.5785370
Here comes trouble. The relative timings will depend strongly on how much memory you have. He're we're just managing to fit about 400MB in memory, but danger looms ahead.
    7000 Points, 6 Nearest Neighbors
nxn:       152.13264
 DT:       2.3276160
With 800MB or so needed for that big nxn calculation on a 500MB laptop, we really hit the disk this time. Notice how the DT method doesn't miss a beat, just scaling up roughly as n.

Keep in mind that as the number of nearest neighbors you want gets bigger, the brute force method starts looking better and better (i.e. the problem gets harder and harder), but it still won't scale:
    1000 Points, 50 Nearest Neighbors
nxn:      0.81207108
 DT:       3.1622850
And now, just to show it can be done:
      100000 Points, 6 Nearest Neighbors
 DT:       32.612031
I left out nxn since I didn't have a spare 200GB of memory to throw around ;).

So, should we congratulate ourselves for having found a truly fast record-setting algorithm? No, probably not. My code is actually fairly sloppy in the inner loop, re-sorting things that are already mostly sorted, finding and comparing more DT-connected points than strictly necessary to get to the nearest set, concatenating and re-allocating small arrays like mad. The point is, I could afford to be sloppy, since the competition just couldn't scale. By working on the inner loop, I could probably squeeze another factor of 3 out of it in IDL. Carefully re-coded in C, you could easily see a factor of 50-100 speedup. But try something like this before you resort to that.

The take home lesson? IDL probably isn't the best tool if you need raw speed on problems it isn't optimized for (adding arrays and the like), but with a nice collection of tricks in your toolkit, you can take it places it wasn't really designed to go.

"So there you go. I wouldn't have believed it myself if I hadn't of seen it with my own two eyes. My advisor still can't believe it. He even introduced me as his 'prize graduate student' to the guest lecturer at the journal club yesterday. Imagine that. I'm beginning to think that with this kind of speed-up on my programs I might actually get outta here in a couple of years. Who would've thought that a week or so ago!?"