I’ve been doing tons and tons of tabular data manipulation using Python in the past few years, and it’s high time I shared some of the interesting patterns I’ve developed. I’m sure others have managed similar things, but I’m not about to do a literature search right now.
First, some rules of the road for dealing with data easily in Python:
- Read and write data using character-delimited text, like CSV or TSV. You can use Excel or others, but these tricks work best with text data. It also has the advantage of not having to load the entire file into memory at once. Very large files can be analyzed quickly line by line, if you can.
- Use the csv library from python:
- Name your columns. Take a chance to save your sanity and use a line to give a name to each column. You’ll thank yourself later. Also, nearly all of these tricks rely on it.
The Quick Load
The first trick gets your data loaded as an array of arrays. It’s mostly useful for smaller files (or larger memory), but it gets you working very quickly and doesn’t immediately rely on rule #3.
import csv def loadFile(fileName): result = [l for l in csv.reader(open(fileName,'rb'),delimiter=",")] return result
Note that I’m putting everything into functions. This also helps a bit since you might want to re-use some of this code in other places. I’m using what’s called a list comprehension. The CSV library, when you make a CSV reader, returns an iterable object, which means we can apply for each and list comprehensions over them. What’s a list comprehension? Imagine you had a list of numbers and wanted to call a function for each of them, creating a new array. You can do it very compactly in Python using list comprehensions:
import csv # l is a list of numbers, foo() does something to numbers def fooIt(l): return [foo(x) for x in l]
We’re going to be doing a lot with list comprehensions, so I’ll be explaining the really useful parts as I go. Now, what if you do have a header? We can probably do one better and make a list of dicts to work with:
import csv def loadFile(fileName): data = [l for l in csv.reader(open(fileName,'rb'),delimiter=',')] header = data result = [dict([(header[i],l[i]) for i in range(min(len(header),len(l)))]) for l in data[1:]] return result
Whoa, what was that!? There’s a lot to unpack there (heh) that’s really useful. Lets start from the inside out. Let’s say we have a line and a header and want to make tuples (an immutable list, which means that it’s also identifiable and usable as keys in a dict) that are header/value pairs for each column. Since we’re working with two arrays, we can’t actually simply apply a list comprehension over one or the other of them, as we won’t have a way to index into the other one. Instead we use range() to create a new list of integers that serve as an index list into the array. Since we’re coding defensively, we take the minimum of the two lengths of the header and the line so that we don’t go out of bounds:
def makeTuples(line, header): return [(header[i],line[i]) for i in range(min(len(header),len(line)))]
Why would we want to put header and line value pairs into tuples? We can do this to effectively create a DICT comprehension, as lists of two-tuples can be passed to dict() to create a dict. So by doing:
import csv def loadFile(fileName): data = [l for l in csv.reader(open(fileName,'rb'),delimiter=',')] header = data result = [dict(makeTuples(header,l)) for l in data[1:]] return result
We make a list of dicts which basically kinda-sorta work like a list of dynamic objects, which is pretty handy in Python. In fact, in order to render your data to JSON, all you have to do is this:
import json def toJSON(data): return json.dumps(data)