Data Processing with Python: Part 2

As I’ve said, I’ve been doing tons and tons of tabular data manipulation using Python in the past few years, and I’m sharing some of the patterns I’ve developed. Please look at Part 1 to see some of the more basic stuff, and review the rules of the road. Below the fold, we will be talking about filtering data by column and row and doing processing without loading the whole file into memory.

Filtering by Column

Sometimes you only need a couple of columns. Let’s say you have an address book and want to map a person’s name to their address, but don’t care about phone numbers and whatever else. You can do that like this:

import csv
def loadAddresses(fileName):
    reader = csv.reader(open(fileName,'rb'),delimiter=",")
    header = reader.next()
    m = dict([(header[i],i]) for i in range(len(header))])
    result = dict([(l[m['name']],l[m['address']]) for l in reader])
    return result

The first thing to notice is that I didn’t actually ever load the entire file into memory at once. We load the header, create a header map, and then iterate over the rest of the file using a list comprehension. This function returns a dict with the ‘name’ and ‘address’ columns used as key/value pairs, which is realized by generating tuples for each row and constructing a dict from them. So if we have, say, the white pages in a file, and there are tons of rows, and lots and lots of columns, we never have more than two rows of data (the header and the current row) in memory at once. Need to do something more complex? Go ahead an use a for/each loop instead:

import csv
def loadAddresses(fileName):
    reader = csv.reader(open(fileName,'rb'),delimiter=",")
    header = reader.next()
    m = dict([(header[i],i]) for i in range(len(header))])
    result = {}
    for line in reader:
        result[l[m['name']]] = l[m['address']]
    return result
This does exactly the same thing, but is a little more explicit.

Filtering by Row

Let’s say that you have the white pages, and want to build a dict like above, but you only need the folks living in Manhattan. This is easily accomplished by adding a condition to the list comprehension, like this:
import csv
def loadAddresses(fileName):
    reader = csv.reader(open(fileName,'rb'),delimiter=",")
    header = reader.next()
    m = dict([(header[i],i]) for i in range(len(header))])
    result = dict([(l[m['name']],l[m['address']])
                   for l in reader if l[m['city']] == 'Manhattan'])
    return result

Again, if you prefer something a little less compact, you can do this:

import csv
def loadAddresses(fileName):
    reader = csv.reader(open(fileName,'rb'),delimiter=",")
    header = reader.next()
    m = dict([(header[i],i]) for i in range(len(header))])
    result = {}
    for line in reader:
        if l[m['city']] == 'Manhattan':
            result[l[m['name']]] = l[m['address']]
    return result

Type Mapping

My final trick for the day combines row filtering with the ability to map data to a particular type. This is especially useful for integers, floats, booleans, etc, but can be used with any class constructor that accepts a string as input:

import csv
def loadData(fileName):
    reader = csv.reader(open(fileName,'rb'),delimiter=",")
    header = reader.next()
    m = dict([(header[i],i]) for i in range(len(header))])
    # This is a set of x,y points with an id, name,
    # and visibility attribute.
    cols = {
        "id":int
        "x":float
        "y":float
        "name":str
        "visible":bool
    }
    result = {}
    for line in reader:
        # make a dict for each line, using the constructor in
        # the cols dict to build the right object.
        lineMap = dict([cols[c](line[m[c]]) for c in cols.keys])
        result[lineMap['id']] = lineMap
    return result

We’re making JSON-like dict objects again, and using the cols dict to map the columns we’re interested in to the datatypes that we want them to be. Since Python has functions as objects, we can just pass them around and call them once they’ve been dereferenced. That’s what we’re doing at “cols[c](value)”, dereferencing the function and calling it.

These techniques can be applied at the same time, for instance, to filter out data points that aren’t visible.

Advertisements

One thought on “Data Processing with Python: Part 2

  1. Pingback: Data Processing with Python: Part 1 | Subluminal Messages

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s