Data Processing with Python: Part 2

As I’ve said, I’ve been doing tons and tons of tabular data manipulation using Python in the past few years, and I’m sharing some of the patterns I’ve developed. Please look at Part 1 to see some of the more basic stuff, and review the rules of the road. Below the fold, we will be talking about filtering data by column and row and doing processing without loading the whole file into memory.

Continue reading

Data Processing with Python: Part 1

I’ve been doing tons and tons of tabular data manipulation using Python in the past few years, and it’s high time I shared some of the interesting patterns I’ve developed. I’m sure others have managed similar things, but I’m not about to do a literature search right now.

First, some rules of the road for dealing with data easily in Python:

  1. Read and write data using character-delimited text, like CSV or TSV. You can use Excel or others, but these tricks work best with text data. It also has the advantage of not having to load the entire file into memory at once. Very large files can be analyzed quickly line by line, if you can.
  2. Use the csv library from python:
    import csv
  3. Name your columns. Take a chance to save your sanity and use a line to give a name to each column. You’ll thank yourself later. Also, nearly all of these tricks rely on it.

Continue reading