Extracting coordinates from a string


Consider the following: "MULTILINESTRING((10 10,10 40),(40 40,30 30,40 20,30 10))".
I want to transform this into: [[10,10],[10,40],[40,40],[30,30],[40,20],[30,10]].

My solution
I use the functions split() and replace()to format this. I get some dirty code and probably not the most efficient like my_str.split('((')[1].split('))')[1]...etc

Because I'm doing this on a huge dataset, I'm looking for an efficient way to do it.


Answer

If you're looking for clean code that doesn't do too much, I'd recommend a two step process involving the re module—

  1. split your string into smaller chunks on comma using str.split
  2. for each chunk, extract coordinates with re.findall

For performance, I'd recommend pre-compiling a regex-pattern using re.compile, since we'll be calling it repeatedly inside a loop.

>>> import re
>>> p = re.compile(r'\d+(?:\.\d+)?')
>>> [list(map(int, p.findall(x)) for x in mstring.split(',')]
[[10, 10], [10, 40], [40, 40], [30, 30], [40, 20], [30, 10]]

Note, mstring is your string data.


Details

\d+    # match one or more digits
(?:    # specify non-capturing group
\.     # literal period/decimal
\d+    
)?     # optional

Semantically, this regex will match integers OR floats (Ajax1234's solution currently only accounts for integers, and is guaranteed to be finish searching in fewer cycles).