Using pandas DataFrames to process data from multiple replicate runs in Python

Per a recommendation in my previous blog post, I decided to follow up and write a short how-to on how to use pandas to process data from multiple replicate runs in Python.

If you do research like mine, you’ll often find yourself with multiple datasets from an experiment that you’ve run in replicate multiple times. There are plenty of ways to manage and process data nowadays, but I’ve never seen it made so easy as it is with pandas.

Installing pandas

If you don’t already have pandas installed, download it at: http://pandas.pydata.org/getpandas.html

Then do the typical python package install process.

cd unzipped-pandas-folder-name/
python setup.py build_ext --inplace
sudo python setup.py install

Note: if you’re not using pandas in IPython Notebook, the build_ext --inplace part is unnecessary.

Using pandas

Below, I’ll show you the 23 lines of Python code that I use to read in, process, and plot all of the data from my experiments. After that, I’ll break the code block down line-by-line and explain what’s happening.

from pandas import *
import glob

dataLists = {}

# read data
for folder in glob.glob("experiment-data-directory/*"):
    
    dataLists[folder.split("/")[1]] = []
    
    for datafile in glob.glob(folder + "/*.csv"):

        dataLists[folder.split("/")[1]].append(read_csv(datafile))

# calculate stats for data
meanDFs = {}
stderrDFs = {}

for key in dataLists.keys():
    
    keyDF = (concat(dataLists[key], axis=1, keys=range(len(dataLists[key])))
            .swaplevel(0, 1, axis=1)
            .sortlevel(axis=1)
            .groupby(level=0, axis=1))
    
    meanDFs[key] = keyDF.mean()
    stderrDFs[key] = keyDF.std().div(sqrt(len(dataLists[key]))).mul(2.0)
    
    keyDF = None

# plot data
for column in meanDFs[key].columns:
    
    # don't plot generation over generation - that's pointless!
    if not (column == "generation"):
    
        figure(figsize=(20, 15))
        title(column.replace("_", " ").title())
        ylabel(column.replace("_", " ").title())
        xlabel("Generation")
        
        for key in meanDFs.keys():

            errorbar(x=meanDFs[key]["generation"], y=meanDFs[key][column], yerr=stderrDFs[key][column], label=key)
            
        legend(loc=2)

Here’s one graph from the end product.

Prey Fitness

Required packages

Along with the pandas package, the glob package is extremely useful for aggregating folders and files into a single list so they can be iterated over.

from pandas import *
import glob

Reading data with pandas

glob.glob(str) aggregates all of the files and folders matching a given *nix directory expression.

for folder in glob.glob("experiment-data-directory/*"):

For example, say experiment-data-directory contains 4 other directories: treatment1, treatment2, treatment3, and treatment 4. It will return a list of the directories in string format.

print glob.glob("experiment-data-directory/*")
>>> ["treatment1", "treatment2", "treatment3", "treatment 4"]

Similarly, glob.glob(folder + "/*.csv") will return a list of all .csv files in the given directory.

print glob.glob("treatment1/*.csv")
>>> ["treatment1/run1.csv", "treatment1/run2.csv", "treatment1/run3.csv"]

Finally, line 13 stores all of the pandas DataFrames read in by the pandas read_csv(str) function. read_csv(str) is a powerful function that will take care of reading and parsing your csv files into DataFrames. Make sure to have your column titles at the top of each csv file!

        dataLists[folder.split("/")[1]].append(read_csv(datafile))

Thus, dataLists maps "treatment1", "treatment2", "treatment3", "treatment4" to their corresponding list of DataFrames, with each DataFrame containing the data of a single run. More on how powerful DataFrames are below!

Statistics with pandas

This bit of code iterates over each treatment.

meanDFs = {}
stderrDFs = {}

for key in dataLists.keys():

And here’s where we see the real power of pandas DataFrames.

Line 21 merges the list of DataFrames into a single DataFrame containing every run’s data for that treatment.

keyDF = (concat(dataLists[key], axis=1, keys=range(len(dataLists[key])))

Line 22 makes it so the run data is grouped on a per-data-column basis instead of a per-run basis.

            .swaplevel(0, 1, axis=1)

Line 23 sorts the column names in ascending order. This is purely for aesthetic purposes.

            .sortlevel(axis=1)

Lastly, line 24 groups all of the replicate run data together by column.

            .groupby(level=0, axis=1))

Here’s an example of how this works in practice:

In [12]:  x
Out[12]:
          A         B         C
0 -0.264438 -1.026059 -0.619500
1  0.927272  0.302904 -0.032399
2 -0.264273 -0.386314 -0.217601
3 -0.871858 -0.348382  1.100491

In [13]:  y
Out[13]:
          A         B         C
0  1.923135  0.135355 -0.285491
1 -0.208940  0.642432 -0.764902
2  1.477419 -1.659804 -0.431375
3 -1.191664  0.152576  0.935773

In [14]: glued = pd.concat([x, y], axis=1, keys=['x', 'y'])

In [15]: glued
Out[15]: 
          x                             y                    
          A         B         C         A         B         C
0 -0.264438 -1.026059 -0.619500  1.923135  0.135355 -0.285491
1  0.927272  0.302904 -0.032399 -0.208940  0.642432 -0.764902
2 -0.264273 -0.386314 -0.217601  1.477419 -1.659804 -0.431375
3 -0.871858 -0.348382  1.100491 -1.191664  0.152576  0.935773

In [16]: glued.swaplevel(0, 1, axis=1).sortlevel(axis=1)
Out[16]: 
          A                   B                   C          
          x         y         x         y         x         y
0 -0.264438  1.923135 -1.026059  0.135355 -0.619500 -0.285491
1  0.927272 -0.208940  0.302904  0.642432 -0.032399 -0.764902
2 -0.264273  1.477419 -0.386314 -1.659804 -0.217601 -0.431375
3 -0.871858 -1.191664 -0.348382  0.152576  1.100491  0.935773

In [17]: glued = glued.swaplevel(0, 1, axis=1).sortlevel(axis=1)

In [18]: glued
Out[18]: 
          A                   B                   C          
          x         y         x         y         x         y
0 -0.264438  1.923135 -1.026059  0.135355 -0.619500 -0.285491
1  0.927272 -0.208940  0.302904  0.642432 -0.032399 -0.764902
2 -0.264273  1.477419 -0.386314 -1.659804 -0.217601 -0.431375
3 -0.871858 -1.191664 -0.348382  0.152576  1.100491  0.935773

By storing the data this way in pandas DataFrames, you can do all kinds of powerful operations on the data on a per-DataFrame basis. In lines 26 and 27, I compute the mean and standard error of the mean of every column (over an arbitrary number of replicates) for every treatment with just a couple lines.

    meanDFs[key] = keyDF.mean()
    stderrDFs[key] = keyDF.std().div(sqrt(len(dataLists[key]))).mul(2.0)
    
    keyDF = None

DataFrames have all kinds of built-in functions to perform standard operations on them en masse: add(), sub(), mul(), div(), mean(), std(), etc. The full list is located at: http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html

Plotting pandas data with matplotlib

The code below assumes you have a “generation” column that your data is plotted over. If you use another x-axis, it is easy enough to replace “generation” with whatever you named your x-axis.

for column in meanDFs[key].columns:
    
    # don't plot generation over generation - that's pointless!
    if not (column == "generation"):
    
        figure(figsize=(20, 15))
        title(column.replace("_", " ").title())
        ylabel(column.replace("_", " ").title())
        xlabel("Generation")
        
        for key in meanDFs.keys():

            errorbar(x=meanDFs[key]["generation"], y=meanDFs[key][column], yerr=stderrDFs[key][column], label=key)
            
        legend(loc=2)

You can access each column individually by indexing it with the name of the column you want, e.g. dataframe["column_name"]. Since DataFrames store each column’s data as a list, it doesn’t even take any extra work to pass the data to matplotlib to plot it.

Randy is a PhD candidate in Michigan State University's Computer Science program. As a member of Dr. Chris Adami's research lab, he studies biologically-inspired artificial intelligence and evolutionary processes.

Posted in python, statistics, tutorial Tagged with: , , , , ,
  • Jim

    Can you explain how you derived the StdERR? Basically what is it in relation to the std deviation? (keyDF.std().div(sqrt(len(dataLists[key]))).mul(2.0))

    • http://www.randalolson.com Randy Olson

      You can also use SciPy’s sem() function. Replace line 27 with:

      from scipy import stats as st

      stderrDFs[key] = keyDFs[key].aggregate((lambda x: st.sem(x, axis=1)))
      stderrDFs[key].mul(1.96) # if you want to plot the estimated 95% confidence interval

  • Jim
  • Bill Throndset

    nice code… I tried this, reading a few .csv files within a directory. It threw me an error which I am having trouble understanding;
    912 if len(objs) == 0:
    –> 913 raise Exception(‘All objects passed were None’)
    914
    915 # consolidate data

    however, if I print my dataLists, I see that my .csv files were indeed read;
    %run C:/Users/ggwth/Python/P4DA/iterate_randalolson.com
    {‘CSV\\20135142.csv': [], ‘CSV\\20135141.csv': [], ‘CSV\\20135143.csv': [], ‘CSV\\20135140.csv': []}

    Do you have any idea why it is throwing me that error?

    • http://www.randalolson.com Randy Olson

      Are any of the csv files empty? Try opening each csv file separately and see if pandas can parse them.

  • Bill Throndset

    They aren’t empty; I ran this code:
    import pandas as pd
    from pandas import *
    import glob

    #check each csv file to make sure it has data

    df = pd.read_csv(“Python/CSV/20135140.csv”)
    print df

    df2 = pd.read_csv(“Python/CSV/20135141.csv”)
    print df2

    df3 = pd.read_csv(“Python/CSV/20135142.csv”)
    print df3

    df4 = pd.read_csv(“Python/CSV/20135143.csv”)
    print df4

    dataLists = {}

    # read data
    for folder in glob.glob(“Python/CSV/*”):

    dataLists[folder.split(“/”)[1]] = []

    for datafile in glob.glob(folder + “/*.csv”):

    dataLists[folder.split(“/”)[1]].append(read_csv(datafile))
    print dataLists

    And got this output:

    %run C:/Users/ggwth/Python/P4DA/RobOlsonex

    Int64Index: 8341 entries, 0 to 8340
    Data columns (total 6 columns):
    EFT(h) 8341 non-null values
    V.PV [mL] 8341 non-null values
    pH.PV [pH] 8341 non-null values
    DO.PV [%DO] 8341 non-null values
    T.PV [C] 8341 non-null values
    N.PV [rpm] 8341 non-null values
    dtypes: float64(5), int64(1)

    Int64Index: 8341 entries, 0 to 8340
    Data columns (total 6 columns):
    EFT(h) 8341 non-null values
    V.PV [mL] 8341 non-null values
    pH.PV [pH] 8341 non-null values
    DO.PV [%DO] 8341 non-null values
    T.PV [C] 8341 non-null values
    N.PV [rpm] 8341 non-null values
    dtypes: float64(5), int64(1)

    Int64Index: 8341 entries, 0 to 8340
    Data columns (total 6 columns):
    EFT(h) 8341 non-null values
    V.PV [mL] 8341 non-null values
    pH.PV [pH] 8341 non-null values
    DO.PV [%DO] 8341 non-null values
    T.PV [C] 8341 non-null values
    N.PV [rpm] 8341 non-null values
    dtypes: float64(5), int64(1)

    Int64Index: 8341 entries, 0 to 8340
    Data columns (total 6 columns):
    EFT(h) 8341 non-null values
    V.PV [mL] 8341 non-null values
    pH.PV [pH] 8341 non-null values
    T.PV [C] 8341 non-null values
    N.PV [rpm] 8341 non-null values
    F.PV [sL/h] 8341 non-null values
    dtypes: float64(5), int64(1)
    {‘CSV\\20135142.csv': [], ‘CSV\\20135141.csv': [], ‘CSV\\20135143.csv': [], ‘CSV\\20135140.csv': []}

    So it is confusing to me why it throws this error:

    912 if len(objs) == 0:
    –> 913 raise Exception(‘All objects passed were None’)

    If you are interested, I’d be happy to send you the 4 .csv files!

    • http://www.randalolson.com Randy Olson

      Yeah, email me the Python code and csv files and I’ll take a quick look.

  • bujji

    Before entering into the main.. If we want to first itself check whether the input file is a csv file or not, then what is the code?

    • http://www.randalolson.com Randy Olson

      This code will only work with csv files currently, because the glob.glob() call only grabs csv files. Do you have non-csv files that you’re working with?

  • Simon

    Hi Randy, thanks for the awesome code, i find it’s really helpful. I’m a transport modeller and often using a lot of simulation runs to generate statistical significance results. This piece of code of yours rocks. however, i have one minor question: What if you only want to take the average for certain columns in all the csv files.

    For example, i have three columns, called “A”, “B”,”C”, A is the road name, B is the coordinates and C is the traffic flow. they all same for 100 simulation runs, with the same number of rows. Can i take the average just for columns “C” ? i have tried using :
    keyDF.groupby([“A”, “B”]).mean()

    but it gives me error. can you give me some suggestions ? i will really appreciated it !

    Many thanks !
    Simon

    • http://www.randalolson.com Randy Olson

      Yup! All you have to do is:


      keyDF["C"].mean()

      That grabs the “C” column and averages its values.

      • Simon

        Hi Randy, thanks for the quick reply. i have tried to implement the ‘keyDF[“C”].mean()” as you suggested. But the results not come out as i expected. here is the code i have implemented:

        import csv
        import glob
        from optparse import OptionParser
        from pandas import *

        dataLists = {}

        dataLists

        # read data
        for folder in glob.glob(“./results/*”):

        dataLists[folder.split(“/”)[1]] = []

        for datafile in glob.glob(folder + “/*.csv”):

        dataLists[folder.split(“/”)[1]].append(read_csv(datafile))

        for key in dataLists.keys():
        keyDF = (concat(dataLists[key], axis=1, keys=range(len(dataLists[key])))
        .swaplevel(0, 1, axis=1)
        .groupby(level=0, axis=1))
        DF = keyDF[“C”].mean()

        DF.to_csv(“test.csv”)

        i think i implemented the code wrong, the results i get is something like this:
        A,0,77.33738675966667
        A,1,99.208896702
        A,2,91.78981542166667
        B,0,135.96458894333333
        B,1,119.22963823000002
        B,2,126.14770441
        C,0,277.6703634276667
        C,1,265.66402293133336
        C,2,236.73396117766666

        I’m not exactly sure what’s inside the ‘keyDF’ after the ‘for’ loop ?

        • Simon

          Hi Randy,

          i think i’m a bit confused while i was writting my previous post. To clarify it, i think the final results i want to get is something like this, A is road name, B is location, C is traffic flow:

          A B C
          ‘1112:1341 lat:51.59; long:-0.19 540
          ‘1112:87 lat:58.64; long:-0.25 300
          ‘1158:227z lat:53.62; long:-0.34 600

          the column C is the average of all 30 number of simulation runs. By doing the ‘keyDF[“C”].mean()”, i think it will only give the average of whole column, which is not what i want. I want the average of 30 simulation runs for the flow on each road. how can i do this ?

          Simon

          • http://www.randalolson.com Randy Olson

            Email me your raw data & current code and I’ll hack at it for a few minutes. :-)

  • Simon

    you are awesome Randy :-) !

About this blog

The data visualizations on this blog are the result of my “data tinkering” hobby, where I tackle a new data analysis problem every week. If I find something interesting, I report my findings here to share with the world.

If you would like to use one of my graphs on your website or in a publication, please email me.

Archives

Enter your email address to subscribe to this blog and receive notifications of new posts by email.