Set Operations on Arrays A and B

QUESTION: Given two arrays of positive values, A and B, how can I efficiently determine which elements in A and B are unique (the union), which elements in A are also in B (the intersection), and which elements in A are not in B (the difference)?

ANSWER: This answer was provided by Research Systems itself on the IDL newsgroup. Many thanks to the people at RSI who read the newsgroup and take the questions raised there seriously. Here is their post.

A somewhat belated reply to the numerous postings on finding the common elements of vectors:

Given vectors of the type...

   vect_1 = [1,2,3,4,5]
   vect_2 = [3,4,5,6,7]

What is the most efficient way to determine which values that occur in vect_1 also occur in vect_2 (i.e., the values [3,4,5] occur in both vect_1 and vect_2).

Below appear three IDL functions that operate on sets represented by arrays of positive integers. The SetIntersection function returns the common elements, SetUnion returns all unique elements in both arguments, and SetDifference returns the elements (members) in vect_1 but not in vect_2.

These routines are faster than previously published functions, e.g. Contain and Find_Elements.

Hope this helps,

Research Systems, Inc.

Program Update

Please note that the three functions below have been written in a more formal way and added to the Coyote Library. In the process, error handling has been added, and several program bugs have been fixed. They are named cgSetDifference, cgSetIntersection, and cgSetUnion.

   ; Set operators. Union, Intersection, and Difference (i.e. return
   ; members of A that are not in B.)

   ; These functions operate on arrays of positive integers, which need
   ; not be sorted. Duplicate elements are ignored, as they have no
   ; effect on the result.

   ; The empty set is denoted by an array with the first element equal to
   ; -1.

   ; These functions will not be efficient on sparse sets with wide
   ; ranges, as they trade memory for efficiency. The HISTOGRAM function
   ; is used, which creates arrays of size equal to the range of the
   ; resulting set.

   ; For example:

   ;   a = [2,4,6,8]
   ;   b = [6,1,3,2]

   ; SetIntersection(a,b) = [ 2, 6]       ; Common elements
   ; SetUnion(a,b) = [ 1, 2, 3, 4, 6, 8]  ; Elements in either set
   ; SetDifference(a,b) = [ 4, 8]         ; Elements in A but not in B
   ; SetIntersection(a,[3,5,7]) = -1      ; Null Set

   FUNCTION SetUnion, a, b
   IF a[0] LT 0 THEN RETURN, b    ;A union NULL = a
   IF b[0] LT 0 THEN RETURN, a    ;B union NULL = b
   RETURN, Where(Histogram([a,b], OMin = omin)) + omin ; Return combined set
   END

Maarten Sneep has provided an alternative algorithm for creating the union of two arrays, which can potentially be faster and more efficient for sparse arrays. The algorithm is as follows.

   FUNCTION SetUnion, a, b
   superset = [a, b]
   union = superset[Uniq(superset, Sort(superset))]
   RETURN, union
   END

   FUNCTION SetIntersection, a, b

   ; Find the intersection of the ranges.
   mina = Min(set_a, Max=maxa) 
   minb = Min(set_b, Max=maxb)
   minab = mina > minb
   maxab = maxa < maxb

   ; If the set ranges don't intersect, then result = NULL.
   IF ((maxa LT minab) AND (minb GT maxab)) OR $
      ((maxb LT minab) AND (mina GT maxab)) THEN RETURN, -1

   r = Where((Histogram(a, Min=minab, Max=maxab) NE 0) AND  $
             (Histogram(b, Min=minab, Max=maxab) NE 0), count)

   IF count EQ 0 THEN RETURN, -1 ELSE RETURN, r + minab
   END

   FUNCTION SetDifference, a, b  

      ; = a and (not b) = elements in A but not in B

   mina = Min(a, Max=maxa)
   minb = Min(b, Max=maxb)
   IF (minb GT maxa) OR (maxb LT mina) THEN RETURN, a ;No intersection...
   r = Where((Histogram(a, Min=mina, Max=maxa) NE 0) AND $
             (Histogram(b, Min=mina, Max=maxa) EQ 0), count)
   IF count eq 0 THEN RETURN, -1 ELSE RETURN, r + mina
   END

These three functions (in addition to requiring some error checking if you are really going to use them) inspired quite a discussion on the IDL newsgroup. Here is a (slightly) edited representation of that discussion.

   From: Mark Fardal (fardal@weka.astro.umass.edu)
   Subject: matching lists 
   Newsgroups: comp.lang.idl-pvwave
   Date: 2000/03/10 
    
   Hi,

   I have been looking at properties of particles in a simulation, and
   sometimes need to match up the particles in two different subsets.  I
   typically have (quantity A, index #) for one set of particles, and
   (quantity B, index #) for another set, and want to compare quantities
   A and B for the particles that are in both sets. 

   As of late last night I could not think of a good way to do this;
   WHERE inside a for-loop would be very slow.  Maybe I'm missing
   something easy, but in any case here's a solution inspired by the
   recently submitted SETINTERSECTION function.  Hope somebody finds
   it useful.

   Mark Fardal
   UMass

   ;+
   ; NAME: 
   ;        LISTMATCH
   ;
   ; PURPOSE: 
   ;   find the indices where a matches b
   ;   works only for SETS OF UNIQUE INTEGERS, e.g., array indices
   ;
   ;   for example: suppose you have a list of people with their ages and
   ;   social security numbers (AGE, AGE_SS), and a partially overlapping
   ;   list of people with their incomes and s.s. numbers (INCOME,
   ;   INCOME_SS).  And you want to correlate ages with incomes in the
   ;   overlapping subset.  Call
   ;      LISTMATCH, AGE_SS, INCOME_SS, AGE_IND, INCOME_IND
   ;   then AGE[AGE_IND] and INCOME[INCOME_IND] will be the desired
   ;   pair of variables.
   ;
   ; AUTHOR:
   ;   Mark Fardal
   ;   UMass (fardal@weka.astro.umass.edu)
   ;
   ; CALLING SEQUENCE:
   ;   LISTMATCH, a, b, a_ind, b_ind
   ;   
   ; INPUTS:
   ;   a and b are sets of unique integers (no duplicate elements)
   ; 
   ; OUTPUTS:
   ;   a_ind, b_ind are the indices indicating which elements of a and b
   ;   are in common
   ;
   ; RESTRICTIONS:
   ;   if the indices are not unique some matching elements may be skipped...
   ;   or is it worse than that?
   ; EXAMPLE:
   ;   a = [2,4,6,8]
   ;   b = [6,1,3,2]
   ;   listmatch, a, b, a_ind, b_ind
   ;    print, a[a_ind]
   ;       2       6
   ;    print, b[b_ind]
   ;       2       6
   ;   
   ;
   ; MODIFICATION HISTORY:
   ;   none
   ; BUGS:
   ;   tell me about them
   ; ACKNOWLEDGEMENTS: 
   ;   trivial modification of SETINTERSECTION from RSI
   ;
   pro listmatch, a, b, a_ind, b_ind
   minab = min(a, MAX=maxa) > min(b, MAX=maxb) ;Only need intersection of ranges
   maxab = maxa < maxb
   ;If either set is empty, or their ranges don't intersect: 
   ;  result = NULL (which is denoted by integer = -1)
   if maxab lt minab or maxab lt 0 then begin
     a_ind = -1
     b_ind = -1
     return
   endif

   ha = histogram(a, MIN=minab, MAX=maxab, reverse_indices=reva)
   hb = histogram(b, MIN=minab, MAX=maxab, reverse_indices=revb)

   r = where((ha ne 0) and (hb ne 0), count)
   if count gt 0 then begin
     a_ind = reva[reva[r]]
     b_ind = revb[revb[r]]
     return
   endif else begin
     a_ind = -1 
     b_ind = -1 
     return
   endelse

   end

   From: Craig Markwardt (craigmnet@cow.physics.wisc.edu)
   Subject: Re: matching lists 
   Newsgroups: comp.lang.idl-pvwave
   Date: 2000/03/10 
    
   I also submit CMSET_OP, a function I recently posted on my web page.
   (Actually, I'm not sure if Mark is referring to that by
   SETINTERSECTION).

   Advantages are: 
    * works on any numeric or string data type
    * works in order (n1+n2)*log(n1+n2) time or better, rather than n1*n2
    * uses the histogram technique for short integer lists as JD suggests
    * also does "union" and "exclusive or"
    * also does A and NOT B  or vice versa

   Disadvantages: 
    * it removes duplicates, treating the two lists strictly as sets.
    * returns values, not indices

   Craig

   http://cow.physics.wisc.edu/~craigm/idl/idl.html

   From: J.D. Smith (jdsmith@astro.cornell.edu)
   Subject: Re: matching lists 
   Newsgroups: comp.lang.idl-pvwave
   Date: 2000/03/12 

   The flag+shift method which I came up with a few years back (search for
   "Efficient comparison of arrays", circa 1997) was surrounded by a great deal of
   discussion about the best type of algorithm for this method. I'll quote from one
   of my postings at that time:

   ==============================================================================================
   On another point, it is important to note that the problem of finding where b's
   values exist in a (find_elements()) is really quite different from the problem
   that contain() attempts to address: finding those values which are in the
   intersection of the vectors a and b (which may be of similar sizes, or quite
   different).  The former is a more difficult problem, in general, which
   nonetheless can be solved quite rapidly as long as one vector is quite short. 
   But the time taken scales as the number of elements in b, as opposed to the
   comparative size of b (to the total elements in a and b) -- i.e. nearly constant
   with increasing length of b.  Anyway, it is important to understand the various
   scales, sizes and efficiencies in the problem you are trying to solve if you
   hope to come up with an effective solution.
   ==============================================================================================

   The point of which is that cmset_op, while a very complete and feature-packed
   collection of the various *value-based* set algorithms proposed over the years,
   does not solve the *index-based* set intersection operation originally
   requested.  This is a harder problem, also solveable for non-sparse sets with
   histogram and reverse indices, or with the full n-squared array comparison.  

   A sort-based method which solves this problem, but which also lack grace when
   dealing with repeated values (just selecting the last one which exists in the
   intersection), is as follows:

   ;; Return the indices of values in a which exists anywhere in b (one only for repeated values)
   function ind_int_SORT, a, b
      flag=[replicate(0b,n_elements(a)),replicate(1b,n_elements(b))]
      s=[a,b]
      srt=sort(s)
      s=s[srt] & flag=flag[srt]
      wh=where(s eq shift(s,-1) and flag ne shift(flag, -1),cnt)
      if cnt eq 0 then return, -1
      return,srt[wh]
   end

Editor's Note: The IDL Sort command, used in the IND_INT_SORT function above, does not preserve index order for those values that are "equal" in the sorting operation. (This is true at least on Windows and Macintosh platforms through IDL 6.3, and it might possibly be true for some versions of UNIX.) As a result, it is possible for the IND_INT_SORT function to throw array index out of bounds errors. One way to fix this problem, if you are certain you do not have any repeating indices is to return this:

   return, srt[wh] < srt[wh+1]

instead of this:

   return, srt[wh]

Another solution is the use the NASA Astronomy routine BSORT, which reorders duplicates to preserve the original order. This will compromise speed somewhat, but is a good alternative and works reliably.

   Which can be compared to the ARRAY, and HISTOGRAM methods:

   function ind_int_ARRAY, a, b
      Na = n_elements(a)           
      Nb = n_elements(b)
      l = lindgen(Na,Nb)
      AA = A(l mod Na)
      BB = B(l / Na)
      
      ;;compare the two matrices we just created
      I = where(AA eq BB,cnt)
      if cnt eq 0 then return,-1
      return,I mod Na
   end

   function ind_int_HISTOGRAM, a, b
      minab = min(a, MAX=maxa) > min(b, MAX=maxb) 
      maxab = maxa < maxb
      ha = histogram(a, MIN=minab, MAX=maxab, reverse_indices=reva)
      hb = histogram(b, MIN=minab, MAX=maxab)
      r = where((ha ne 0) and (hb ne 0), cnt)
      if cnt eq 0 then return, -1
      return,reva[reva[r]]
   end

   A time comparison of the three methods, revealed the following for unique
   integer vectors:

   1. ARRAY is almost always slowest due to the large memory requirement.  Only for
   very small input vectors (10 elements or so), will ARRAY beat SORT.  It is,
   however, the only method which correctly identifies repeated entries (not used
   in this test).

   2. SORT is faster than HISTOGRAM for sets sparser than about 1 in 20, otherwise
   histogram is faster, nearly independent of input vector size.  This will of
   course depend most critically on the amount of memory you have.  

   3. HISTOGRAM slows down rapidly with increasing sparseness (just more memory
   required).

   For fun, I also simulated the Social Security number test, with 1000x1000 number
   distributed randomly between 0 and 999 99 9999:

   SORT Method: 
   Average Time ===>    0.0045424044
   ARRAY Method: 
   Average Time ===>      0.61496135
   HISTOGRAM Method: 
   % Array requires more memory than IDL can address.
   % Execution halted at:  IND_INT3           27

   As I noted, the histogram will of course fail on trying to allocate the huge
   memory it needs for all those zeroes!  ARRAY only squeaked by, and increasing
   the test to 10,000 SSN's yields:

   SORT Method: 
   Average Time ===>     0.059468949
   ARRAY Method: 
   % Unable to allocate memory: to make array.

   and even the array method fails.  How high can sort go?  It was happy to do an
   index intersection of 1 million x 1 million SSN's ( finding 41734 overlapping
   indices) in 10 seconds.  

   Just to be fair to ARRAY, if a and b are of very different sizes, and one of
   them is quite small (say less than 5 elements or so), it can dominate over SORT
   and HISTOGRAM.

   So, if you want a generic solution which works in the same n log(n) time using a
   fixed amount of memory for any type of integer input data, sparse or not, use
   SORT.  If you know your data is non-sparse (better than 1 in 10 say), you can
   see a speedup of a few with HISTOGRAM.   If you are looking for the points of
   intersection in a huge array of a small set of values, you might consider
   ARRAY.  If you want to do strings or floats or other types of data, you cannot
   use HISTOGRAM.  And if you want information for repeated values in the same
   input vector, you're stuck with ARRAY.

   As always, your results will vary with individual machine and operating
   systems.  Be sure to test using your data and computer if you need to optimize
   for speed.

   JD

   From: Mark Fardal (fardal@weka.astro.umass.edu)
   Subject: Re: matching lists 
   Newsgroups: comp.lang.idl-pvwave
   Date: 2000/03/13 

   Hi,

   Somehow I knew that if I mentioned HISTOGRAM, that would stir up some
   real IDL programmers.  :->

   J.D.'s sort method seems like the winner.  The modest speed advantage
   of the histogram method in certain cases is not important.  If you are
   in a situation where just matching up the elements is the limitation,
   you are probably going to be in trouble doing any analysis with them
   (let alone reading them in).

   The problem of repeated elements, which is the only advantage of 
   WHERE_ARRAY, is not of any concern, at least to me.  The point of 
   the key variables a and b is that they are supposed to be unique 
   identifiers.  I would just like the routine not to break completely 
   in case the same element was copied into the arrays twice.  The
   sort method does fine in that respect (finds the last of the 
   duplicate elements in a and the first in b).

   The only flaw with the sort method is that sooner or later RSI is
   going to break its own SORT function, just like it does with all of
   its other code...

   --The standard where_array, as posted a few years back, and modified slightly for
   --the case of the null intersection, is attached.  It will work with floating
   --point and other data types also.  It works by inflating the vectors input to 2-d
   --and testing for equality in one go.  It will also handle the case of repeated
   --entries. 

   Hope WHERE_ARRAY does not become "standard", since it's clearly inferior
   to the sort method.

   For completeness, using the sort method inside the calling sequence 
   I originally posted would look like:

   pro listmatch, a, b, a_ind, b_ind
     flag=[replicate(0b,n_elements(a)),replicate(1b,n_elements(b))]
     s=[a,b]
     srt=sort(s)
     s=s[srt] & flag=flag[srt]
     wh=where(s eq shift(s,-1) and flag ne shift(flag, -1),cnt)
     if cnt ne 0 then begin
       a_ind = srt[wh]
       b_ind = srt[wh+1] - n_elements(a)
     endif else begin
       a_ind = -1 
       b_ind = -1 
       return
     endelse
   end

   Mark Fardal
   UMass

   From: J.D. Smith (jdsmith@astro.cornell.edu)
   Subject: Re: matching lists 
   Newsgroups: comp.lang.idl-pvwave
   Date: 2000/03/13 

   By the way, I found an implementation I had mentioned a while back on the news
   group but had forgotten about from the nasa lib called "match" which does pretty
   much the same thing.  It's probably less efficient, since it uses an auxiliary
   list of indices in addition to the flag vector, instead of just using the sort()
   results directly, and performs a few more "where" tests as a result.  But a
   similar idea, written first in 1986!  Match() is also more immune to changes in
   sort() than my routine, as a result of carrying around these additional index
   arrays.

   Take a look.

   JD

[Return to IDL Programming Tips]