Reading numpy structured from a text file¶
Numpy has a very nice feature: a structured array, that is array in which rows have some structure, and can store different types of data in each column.
>>> import numpy as np >>> arr = np.zeros(10, dtype=[['id', np.uint16], ['position', np.dtype('3float32')], ['momentum', np.dype('3float32')]])
We have defined a structured array in each row we store: id of a particle (unsigned int), its position (three floats) and momentum (again three floats).
You can easily select from this array:
>>> arr['position'] >>> arr['position'] >>> arr[arr['id']=1]['position']
This is a nice format because:
- Your data has structure. No more off-by-one errors: particle position is labeled.
- Very easy to load from binary files
Loading from text files is a entirely different matter — because writing to such arrays is kind of pain.
My requirements were:
- Array structure is the same as source file structure (order of fields is the same)
- Array structure is defined only in single place: that is dtype defintion
Solution is to:
- Read file line by line parsing contents to unstructured array.
- Create structured view
- Should be fast, that means no copying of large arrays.
Actual dtype used:
URQMD_DATA_DTYPE = [ ("time", np.float32), ("position", np.dtype("3float32")), ("energy", np.float32), ("momentum", np.dtype("3float32")), ("mass", np.float32), ("particle_type", np.float32), ("additional", np.dtype("5int32")), ]
Helper function that takes structured dtype, and turns it to dtype that has the same number of fields but is unstructured:
def serialize_dtype(dt): dt = np.dtype(dt) newdt =  for item in dt.descr: if len(item) == 2: count = 1, name, type = item else: name, type, count = item if len(count) > 1: raise ValueError() count = count for ii in range(count): newdt.append(type) return np.dtype(", ".join(newdt))
frame is a list of lines from text file.
parsed = np.zeros(len(frame), dtype=serialize_dtype(URQMD_DATA_DTYPE)) # Create array without structure for ii, line in enumerate(frame): data = [float(x) for x in line.split()] # Parse lines #-- ignoring wheher it is a float or int parsed[ii] = tuple(data) # Now numpy will convert single row to proper types parsed = parsed.view(URQMD_DATA_DTYPE) # Create a structured view (no copy!)
Sound simple but took me some time to get it right.