Re: Faster way to search/split a string? [message #78448] |
Thu, 24 November 2011 06:41 |
rjp23
Messages: 97 Registered: June 2010
|
Member |
|
|
On Nov 24, 2:16 pm, Rob <rj...@le.ac.uk> wrote:
> On Nov 24, 9:15 am, wlandsman <wlands...@gmail.com> wrote:
>
>> Some thoughts (though I am not certain I understand the situation):
>
>> 1. If you don't need regular expressions then I believe that using STRPOS() is quicker than using STREGEX().
>> 2. If the ID is always in the first 20 characters of ALL_ROWS then I would created a new vector row_id = strmid(all_rows,0,20) and search on that.
>> 3. If the ID is always exactly 20 characters, you could use a program likehttp://idlastro.gsfc.nasa.gov/ftp/pro/misc/match.prototh en find the matching indices in row_id and id. This is similar to David's suggestion of sorting and using VALUE_LOCATE.
>> 4. You want to make only a single call to STRSPLIT() for all 10,000 rows. Since IDL V8.0, STRSPLIT returns a list data type when supplied with an array (since in principle each string element could have a different number of "columns"). If you have an earlier version of IDL -- or if this capability is not available in GDL, then I would use the vector capability of STRMID (http://www.idlcoyote.com/code_tips/strmidvec.html)
>
>> --Wayne
Ah after re-reading #4 a few times I realised it's what I was asking
strmid(variable_names, 0, transpose(strpos(variable_names, '_')))
Thanks :-)
|
|
|
Re: Faster way to search/split a string? [message #78450 is a reply to message #78448] |
Thu, 24 November 2011 06:16  |
rjp23
Messages: 97 Registered: June 2010
|
Member |
|
|
On Nov 24, 9:15 am, wlandsman <wlands...@gmail.com> wrote:
> Some thoughts (though I am not certain I understand the situation):
>
> 1. If you don't need regular expressions then I believe that using STRPOS() is quicker than using STREGEX().
> 2. If the ID is always in the first 20 characters of ALL_ROWS then I would created a new vector row_id = strmid(all_rows,0,20) and search on that.
> 3. If the ID is always exactly 20 characters, you could use a program likehttp://idlastro.gsfc.nasa.gov/ftp/pro/misc/match.proto then find the matching indices in row_id and id. This is similar to David's suggestion of sorting and using VALUE_LOCATE.
> 4. You want to make only a single call to STRSPLIT() for all 10,000 rows. Since IDL V8.0, STRSPLIT returns a list data type when supplied with an array (since in principle each string element could have a different number of "columns"). If you have an earlier version of IDL -- or if this capability is not available in GDL, then I would use the vector capability of STRMID (http://www.idlcoyote.com/code_tips/strmidvec.html)
>
> --Wayne
Thanks for that. Doing #2 made a huge difference along with David's
suggestion.
On a related note but a different problem (and I know people aren't
GDL experts here so apologies) I'm finding that strsplit in GDL is
much slower than in IDL. Has anyone come across this before? I've
tried searching the GDL help but it's obviously not as extensive as
here.
I also just wanted to check that I'm doing this the optimum way for
IDL.
If I have a string array like: array=[abc_001, abc_002, abc_003,
bcdef_001, bcdef_002, cdf_001_001, cdf_001_002]
I've assumed the fastest way to get the start section characters is to
do:
for loop=0, n_elements(array)-1 do begin
start_section[loop]=(strsplit(array[loop], '_', /extract))[0]
endfor
or is there a vectorised way to use strsplit in IDL? (limited to IDL 7
and/or GDL)
|
|
|
Re: Faster way to search/split a string? [message #78454 is a reply to message #78450] |
Thu, 24 November 2011 01:15  |
wlandsman
Messages: 743 Registered: June 2000
|
Senior Member |
|
|
Some thoughts (though I am not certain I understand the situation):
1. If you don't need regular expressions then I believe that using STRPOS() is quicker than using STREGEX().
2. If the ID is always in the first 20 characters of ALL_ROWS then I would created a new vector row_id = strmid(all_rows,0,20) and search on that.
3. If the ID is always exactly 20 characters, you could use a program like
http://idlastro.gsfc.nasa.gov/ftp/pro/misc/match.pro to then find the matching indices in row_id and id. This is similar to David's suggestion of sorting and using VALUE_LOCATE.
4. You want to make only a single call to STRSPLIT() for all 10,000 rows. Since IDL V8.0, STRSPLIT returns a list data type when supplied with an array (since in principle each string element could have a different number of "columns"). If you have an earlier version of IDL -- or if this capability is not available in GDL, then I would use the vector capability of STRMID ( http://www.idlcoyote.com/code_tips/strmidvec.html )
--Wayne
|
|
|
Re: Faster way to search/split a string? [message #78460 is a reply to message #78454] |
Wed, 23 November 2011 10:21  |
David Fanning
Messages: 11724 Registered: August 2001
|
Senior Member |
|
|
Vincent Sarago writes:
> I'm not sure to understand, maybe with examples it will be easier to me.
>
>
> tmp = stregex(all_rows, '^[a-zA-Z0-9]{20}',/extract) ; Array of all IDs
>
> test = uniq[sort(tmp)] ; Array determining witch ID is uniq
>
> for ii = 0, n_elements(test) - 1 do begin
>
> id = tmp[test[ii]]
>
> test = where(tmp eq id, count)
> if count ne 0 then begin
> tmp2 = strsplit(all_rows[test], '',/extract) ; Array of ID + Other but split
>
> void = all_rows[test] ; all row for uniq ID
> subset = stregex(b[ii], string(id[ii], format='("[^",a,"].+")'), /extract)
> ;do what you need to do here
>
>
>
> endif
>
> endfor
Yeah, sorry. Big holiday coming up and company arriving shortly.
Maybe next week. :-)
Cheers,
David
--
David Fanning, Ph.D.
Fanning Software Consulting, Inc.
Coyote's Guide to IDL Programming: http://www.idlcoyote.com/
Sepore ma de ni thui. ("Perhaps thou speakest truth.")
|
|
|
Re: Faster way to search/split a string? [message #78462 is a reply to message #78460] |
Wed, 23 November 2011 10:17  |
Vincent Sarago
Messages: 34 Registered: September 2011
|
Member |
|
|
I'm not sure to understand, maybe with examples it will be easier to me.
tmp = stregex(all_rows, '^[a-zA-Z0-9]{20}',/extract) ; Array of all IDs
test = uniq[sort(tmp)] ; Array determining witch ID is uniq
for ii = 0, n_elements(test) - 1 do begin
id = tmp[test[ii]]
test = where(tmp eq id, count)
if count ne 0 then begin
void = all_rows[test] ; all row for uniq ID
subset = stregex(void, string(id, format='("[^",a,"].+")'), /extract) ; only value (without ID)
;do what you need to do here
tmp2 = strsplit(subset, '',/extract)
endif
endfor
|
|
|
Re: Faster way to search/split a string? [message #78471 is a reply to message #78462] |
Wed, 23 November 2011 08:02  |
Jeremy Bailin
Messages: 618 Registered: April 2008
|
Senior Member |
|
|
On 11/23/11 7:13 AM, Rob wrote:
> I was hoping that someone might have a cleverer way of approaching
> this problem.
>
> The following command is the bootleneck in my code:
>
> row_of_data=strsplit(all_rows[(where(stregex(all_rows, id, /boolean)
> eq 1))[0]],' ', /extract)
>
> I have a large text file with lots of columns of data (which I don't
> know exactly what the columns are until I've read them in). There are
> then say 10000 rows of this data.
>
> This is all read into one large string array (all_rows) which contains
> each row as a single very long string.
> The first 20 characters of the row contain a unique id which I need to
> search the rows for and then extract the entire matching row. This row
> then needs to be split up into it's columns (space delimited).
>
> Hopefully that all makes sense.
>
> The problem is having to do this 10000 times, (once for for each id)
> is very slow and the time to do all of the other stuff done in the
> code, reading, writing, some maths, etc is being dominated by this one
> line.
>
> Any thoughts or suggestions?
>
> Cheers
>
> Rob
>
> p.s. This needs to be GDL compatible as well which I think most
> solutions would be anyway.
This screams like something that should be outsourced to an external
perl or python script. Dealing with strings is not IDL's forte, so why
not use a better tool for that part if it's the bottleneck?
-Jeremy.
|
|
|
Re: Faster way to search/split a string? [message #78473 is a reply to message #78471] |
Wed, 23 November 2011 07:37  |
rjp23
Messages: 97 Registered: June 2010
|
Member |
|
|
On Nov 23, 2:40 pm, David Fanning <n...@dfanning.com> wrote:
> I guess I would try just reading the IDs. Then sort the IDs,
> and use Value_Locate to find all the unique IDs at once. Then
> you could subset your string array and do the string extraction.
> Hard to tell if this would be faster.
>
>> p.s. This needs to be GDL compatible as well which I think most
>> solutions would be anyway.
>
> This is like telling most of us "be sure it works in the
> Martian atmosphere." How the hell would I know!? :-)
>
> Cheers,
>
> David
>
Thanks, I'll give it a go and see.
|
|
|
Re: Faster way to search/split a string? [message #78478 is a reply to message #78473] |
Wed, 23 November 2011 06:40  |
David Fanning
Messages: 11724 Registered: August 2001
|
Senior Member |
|
|
Rob writes:
> I was hoping that someone might have a cleverer way of approaching
> this problem.
>
> The following command is the bootleneck in my code:
>
> row_of_data=strsplit(all_rows[(where(stregex(all_rows, id, /boolean)
> eq 1))[0]],' ', /extract)
>
> I have a large text file with lots of columns of data (which I don't
> know exactly what the columns are until I've read them in). There are
> then say 10000 rows of this data.
>
> This is all read into one large string array (all_rows) which contains
> each row as a single very long string.
> The first 20 characters of the row contain a unique id which I need to
> search the rows for and then extract the entire matching row. This row
> then needs to be split up into it's columns (space delimited).
>
> Hopefully that all makes sense.
>
> The problem is having to do this 10000 times, (once for for each id)
> is very slow and the time to do all of the other stuff done in the
> code, reading, writing, some maths, etc is being dominated by this one
> line.
>
> Any thoughts or suggestions?
I guess I would try just reading the IDs. Then sort the IDs,
and use Value_Locate to find all the unique IDs at once. Then
you could subset your string array and do the string extraction.
Hard to tell if this would be faster.
> p.s. This needs to be GDL compatible as well which I think most
> solutions would be anyway.
This is like telling most of us "be sure it works in the
Martian atmosphere." How the hell would I know!? :-)
Cheers,
David
--
David Fanning, Ph.D.
Fanning Software Consulting, Inc.
Coyote's Guide to IDL Programming: http://www.idlcoyote.com/
Sepore ma de ni thui. ("Perhaps thou speakest truth.")
|
|
|