comp.lang.idl-pvwave archive
Messages from Usenet group comp.lang.idl-pvwave, compiled by Paulo Penteado

Home » Public Forums » archive » Re: Efficiently perform histogram reverse indices like procedure on a string array?
Show: Today's Messages :: Show Polls :: Message Navigator
E-mail to friend 
Switch to threaded view of this topic Create a new topic Submit Reply
Re: Efficiently perform histogram reverse indices like procedure on a string array? [message #80983] Thu, 26 July 2012 14:33
Craig Markwardt is currently offline  Craig Markwardt
Messages: 1869
Registered: November 1996
Senior Member
On Wednesday, July 25, 2012 7:39:33 PM UTC-4, Bogdanovist wrote:
> I have an array of a data structure, one tag of which is a string identifier indicating which location the data belongs to. There are many thousands of data points, but only about a dozen or so unique locations.
>
> I make frequent use of the HISTOGRAM function with the reverse_indices in order to carve up data by some identifier, most commonly the time. In this case, I want to divide out the data by site efficiently. I can't use HISTOGRAM on strings, so I need some other approach. There are plenty of ways this can be done, but I'd like some views on the better and most efficient ways to do it.
>
> Take an example, say we have a simple string array
>
> foo=['a','b','c& #39;,'b','b','a& #39;,'a','c']
>
> To determine the list of unique strings we could do
>
> sfoo = foo[sort(foo)]
> print,sfoo[uniq(sfoo)]
>
> We can then repeatedly use WHERE to find the indices in the data array(s) corresponding to each site.
>
> Is there a quicker/better way to do this? Repeatedly calling WHERE seems inefficient (certainly HISTOGRAM is way faster when it is usable)

I prefer to do it slightly differently than your other suggestions.

I locate the breakpoints between different runs of strings like this,

ibreaks = where(sfoo[1:*] NE sfoo, ct)

This gives the interior breakpoints. In your case, ibreaks = [2,5], which is the point where 'a' changes to 'b', and 'b' changes to 'c'. Usually I add this little bit of extra post-processing,

if ct EQ 0 then begin
ibreaks = [0, n_elements(sfoo)]
endif else begin
ibreaks = [0, ibreaks+1, n_elements(sfoo)]
endelse

You need that little extra 'if' statement to handle the case where you have only one unique string, so there are no breaks at all.

The start of the ith run is indexed by ibreaks[i], and the end of the ith run is indexed by ibreaks[i+1]-1, where i goes from 0 through n_elements(sfoo)-1.

I.e. the ith run is given by sfoo[ibreaks[i]:ibreaks[i+1]-1]. Of course you can index back into the original array once you've done this.

Craig
Re: Efficiently perform histogram reverse indices like procedure on a string array? [message #80989 is a reply to message #80983] Thu, 26 July 2012 10:41 Go to previous message
Jeremy Bailin is currently offline  Jeremy Bailin
Messages: 618
Registered: April 2008
Senior Member
>> Use VALUE_LOCATE to find where in the list of unique indices the
>> elements belong to, and use that index as a number that you can run
>> HISTOGRAM on.
>>
>> (raise your hand everyone who saw that coming...)
>>
>> -Jeremy.
>
> Not me. I had no idea VALUE_LOCATE works on strings. Now that is cool!

Yup, it works on anything that can be sorted.

-Jeremy.
Re: Efficiently perform histogram reverse indices like procedure on a string array? [message #80990 is a reply to message #80989] Thu, 26 July 2012 10:30 Go to previous message
ben.bighair is currently offline  ben.bighair
Messages: 221
Registered: April 2007
Senior Member
On Wednesday, July 25, 2012 10:17:16 PM UTC-4, Jeremy Bailin wrote:
> On 7/25/12 9:09 PM, Bogdanovist wrote:
> > I have an array of a data structure, one tag of which is a string identifier indicating which location the data belongs to. There are many thousands of data points, but only about a dozen or so unique locations.
> >
> > I make frequent use of the HISTOGRAM function with the reverse_indices in order to carve up data by some identifier, most commonly the time. In this case, I want to divide out the data by site efficiently. I can't use HISTOGRAM on strings, so I need some other approach. There are plenty of ways this can be done, but I'd like some views on the better and most efficient ways to do it.
> >
> > Take an example, say we have a simple string array
> >
> > foo=['a','b','c& #39;,'b','b','a& #39;,'a','c']
> >
> > To determine the list of unique strings we could do
> >
> > sfoo = foo[sort(foo)]
> > print,sfoo[uniq(sfoo)]
> >
> > We can then repeatedly use WHERE to find the indices in the data array(s) corresponding to each site.
> >
> > Is there a quicker/better way to do this? Repeatedly calling WHERE seems inefficient (certainly HISTOGRAM is way faster when it is usable)
>
> Use VALUE_LOCATE to find where in the list of unique indices the
> elements belong to, and use that index as a number that you can run
> HISTOGRAM on.
>
> (raise your hand everyone who saw that coming...)
>
> -Jeremy.

Not me. I had no idea VALUE_LOCATE works on strings. Now that is cool!
Re: Efficiently perform histogram reverse indices like procedure on a string array? [message #80994 is a reply to message #80990] Wed, 25 July 2012 19:17 Go to previous message
Jeremy Bailin is currently offline  Jeremy Bailin
Messages: 618
Registered: April 2008
Senior Member
On 7/25/12 9:09 PM, Bogdanovist wrote:
> I have an array of a data structure, one tag of which is a string identifier indicating which location the data belongs to. There are many thousands of data points, but only about a dozen or so unique locations.
>
> I make frequent use of the HISTOGRAM function with the reverse_indices in order to carve up data by some identifier, most commonly the time. In this case, I want to divide out the data by site efficiently. I can't use HISTOGRAM on strings, so I need some other approach. There are plenty of ways this can be done, but I'd like some views on the better and most efficient ways to do it.
>
> Take an example, say we have a simple string array
>
> foo=['a','b','c','b','b','a','a','c']
>
> To determine the list of unique strings we could do
>
> sfoo = foo[sort(foo)]
> print,sfoo[uniq(sfoo)]
>
> We can then repeatedly use WHERE to find the indices in the data array(s) corresponding to each site.
>
> Is there a quicker/better way to do this? Repeatedly calling WHERE seems inefficient (certainly HISTOGRAM is way faster when it is usable)

Use VALUE_LOCATE to find where in the list of unique indices the
elements belong to, and use that index as a number that you can run
HISTOGRAM on.

(raise your hand everyone who saw that coming...)

-Jeremy.
Re: Efficiently perform histogram reverse indices like procedure on a string array? [message #80995 is a reply to message #80994] Wed, 25 July 2012 18:34 Go to previous message
ben.bighair is currently offline  ben.bighair
Messages: 221
Registered: April 2007
Senior Member
On Wednesday, July 25, 2012 7:39:33 PM UTC-4, Bogdanovist wrote:
> I have an array of a data structure, one tag of which is a string identifier indicating which location the data belongs to. There are many thousands of data points, but only about a dozen or so unique locations.
>
> I make frequent use of the HISTOGRAM function with the reverse_indices in order to carve up data by some identifier, most commonly the time. In this case, I want to divide out the data by site efficiently. I can't use HISTOGRAM on strings, so I need some other approach. There are plenty of ways this can be done, but I'd like some views on the better and most efficient ways to do it.
>
> Take an example, say we have a simple string array
>
> foo=['a','b','c& #39;,'b','b','a& #39;,'a','c']
>
> To determine the list of unique strings we could do
>
> sfoo = foo[sort(foo)]
> print,sfoo[uniq(sfoo)]
>
> We can then repeatedly use WHERE to find the indices in the data array(s) corresponding to each site.
>
> Is there a quicker/better way to do this? Repeatedly calling WHERE seems inefficient (certainly HISTOGRAM is way faster when it is usable)

Hi,

You can convert your strings to unique numbers - it's a bit awkward - and then you may find the the spacing between populated bins makes the whole thing drag when you do the histogram. But here goes...

foo = ['az','bs','cd','ba','ba','az','aa','c'] ; tricky strings
boo = strtrim(fix(byte(foo)),2) ; note the 'fix' in there
soo = strjoin(boo)
noo = long(soo)

There you go - numbers as unique as the strings you started with.

Cheers,
Ben
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: Is this a legitimate way of projection and subset?
Next Topic: cgcontour not outputting?

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ] [ PDF ]

Current Time: Wed Oct 08 13:53:38 PDT 2025

Total time taken to generate the page: 0.00774 seconds