Data Processing with Python: Part 1

I’ve been doing tons and tons of tabular data manipulation using Python in the past few years, and it’s high time I shared some of the interesting patterns I’ve developed. I’m sure others have managed similar things, but I’m not about to do a literature search right now.

First, some rules of the road for dealing with data easily in Python:

  1. Read and write data using character-delimited text, like CSV or TSV. You can use Excel or others, but these tricks work best with text data. It also has the advantage of not having to load the entire file into memory at once. Very large files can be analyzed quickly line by line, if you can.
  2. Use the csv library from python:
    import csv
  3. Name your columns. Take a chance to save your sanity and use a line to give a name to each column. You’ll thank yourself later. Also, nearly all of these tricks rely on it.

The Quick Load

The first trick gets your data loaded as an array of arrays. It’s mostly useful for smaller files (or larger memory), but it gets you working very quickly and doesn’t immediately rely on rule #3.

import csv
def loadFile(fileName):
    result = [l for l in csv.reader(open(fileName,'rb'),delimiter=",")]
    return result

Note that I’m putting everything into functions. This also helps a bit since you might want to re-use some of this code in other places. I’m using what’s called a list comprehension. The CSV library, when you make a CSV reader, returns an iterable object, which means we can apply for each and list comprehensions over them. What’s a list comprehension? Imagine you had a list of numbers and wanted to call a function for each of them, creating a new array. You can do it very compactly in Python using list comprehensions:

import csv
# l is a list of numbers, foo() does something to numbers
def fooIt(l):
    return [foo(x) for x in l]

We’re going to be doing a lot with list comprehensions, so I’ll be explaining the really useful parts as I go. Now, what if you do have a header? We can probably do one better and make a list of dicts to work with:

import csv
def loadFile(fileName):
    data = [l for l in csv.reader(open(fileName,'rb'),delimiter=',')]
    header = data[0]
    result = [dict([(header[i],l[i])
                    for i in range(min(len(header),len(l)))])
                   for l in data[1:]]
    return result

Whoa, what was that!? There’s a lot to unpack there (heh) that’s really useful. Lets start from the inside out. Let’s say we have a line and a header and want to make tuples (an immutable list, which means that it’s also identifiable and usable as keys in a dict) that are header/value pairs for each column. Since we’re working with two arrays, we can’t actually simply apply a list comprehension over one or the other of them, as we won’t have a way to index into the other one. Instead we use range() to create a new list of integers that serve as an index list into the array. Since we’re coding defensively, we take the minimum of the two lengths of the header and the line so that we don’t go out of bounds:

def makeTuples(line, header):
    return [(header[i],line[i]) for i in range(min(len(header),len(line)))]

Why would we want to put header and line value pairs into tuples? We can do this to effectively create a DICT comprehension, as lists of two-tuples can be passed to dict() to create a dict. So by doing:

import csv
def loadFile(fileName):
    data = [l for l in csv.reader(open(fileName,'rb'),delimiter=',')]
    header = data[0]
    result = [dict(makeTuples(header,l)) for l in data[1:]]
    return result

We make a list of dicts which basically kinda-sorta work like a list of dynamic objects, which is pretty handy in Python. In fact, in order to render your data to JSON, all you have to do is this:

import json
def toJSON(data):
    return json.dumps(data)

JSON is of course a very useful interchange format that is easily understood by Javascript and other programming languages. Next time I’ll show how to be more selective in the data we load, so that we can scale better in memory.

Next: Data Processing with Python: Part 2


2 thoughts on “Data Processing with Python: Part 1

  1. Pingback: Data Processing with Python: Part 2 | Subluminal Messages

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s