I attended a Software Carpentry workshop hosted by Titus Brown and Greg Wilson this week and was introduced to, among many other things, a piece of software that I’ve been looking for ever since I started my graduate program: IPython Notebook. It can easily be installed with the majority of the other packages you’ll need in the Anaconda Python distribution.

I do the majority of my post-experiment data analysis in Python nowadays, since it’s one of the few sanely-designed scripting languages out there with all the functionality I need. What I’ve been missing is a seamless user interface where I can both take notes about my research *and* perform my data analysis in the same location. IPython Notebook finally provides that.

Ever since I announced my conversion from RTF files to IPython Notebook as my primary means of taking research notes, I’ve received a lot of flack about how Python doesn’t support advanced statistical tests, such as bootstrapping confidence intervals, Mann-Whitney Wilcoxon RankSum tests, and ANOVA tests. After a day of searching with my lab mates, I finally turned up all the libraries I need:

- Bootstrapped confidence intervals: https://pypi.python.org/pypi/scikits.bootstrap
- MWW RankSum test: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.ranksums.html
- ANOVA: http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.f_oneway.html

If you can’t find a Python library for a statistical test you need, post here and we’ll try to find it. The IPython notebook has plenty of uses beyond a research notebook, too. For example, Titus Brown recently posted the IPython notebook that he used to generate all of the graphs in one of his recent papers. Imagine the implications for science if scientists actually start showing the code they used to generate their graphs! (No more hiding that outlier point on the side of the graph…)

I’ll be putting up some tutorials and examples of how to use IPython Notebook for exploratory statistical data analysis soon, so stay posted!

Note that it is ‘IPython’, an abbreviation for ‘Interactive Python’, not ‘iPython’.

Fixed, thanks! I’d always seen it as “iPython.”

Thanks for the kind words, Randal! I demoed the notebook two weeks ago during my visit to MSU (at Titus’ invitation) where I gave a couple of talks both on the entire IPython project and on its parallel computing capabilities. Sorry I missed you, but I’m very happy to see this kind of hands-on response from users!

I just wanted to let you know that for statistical machinery, in addition to the basics contained in scipy.stats, you’d probably find both Pandas and Statsmodels quite useful. The provide a fair amount of tools for data analysis and statistics, and both projects are very open to new contributors.

Pandas looks right up my alley. Thank you for pointing me to it. Now if we can just get bootstrapping merged into scipy.stats, I could bury my custom library. 🙂

Have you looked into rpy2? It provides a direct interface with the R libraries. A couple colleagues and I have it loading, running stats on, and plotting data in IPython Notebook, but the high-level interface is a little wonky. On the plus side, the low-level interface is really straightforward: rpy2.robjects.r(“any R command”).

Thank you so much for making such a great tool for scientific computing. I’m not afraid to say that IPython Notebook has significantly changed how I do my research.

I hope we have another chance to meet soon!

P.S. Do you have a preferred medium for feedback about IPython Notebook?

Hi Randal,

Yes, we’ve looked at rpy2: Jonathan Taylor, a friend from the stats dept at Stanford, just coded up the functionality to embed R in whole cells cleanly. I’m right in the middle of finishing up the syntactic support for that, and once we get it merged (a week or so, I hope), you’ll be able to type in one cell:

%%R –inputs=X,Y –outputs=r

a=lm(Y~X)

… etc: rest of R code here

and it will run all your R code nicely, using your python variables X and Y, and leaving you with an output variable ‘r’.

So give it a few weeks, and we’ll have solid R integration.

As for feedback, yes: ideally we have our discussions on our development mailing list. Bug/code-specific conversations tend to happen on the corresponding ticket or Pull Request on github, but for the ‘big picture’ discussions, the -dev list is the right venue. I only caught your blog by accident via a tweet of Titus’, but since I use twitter very rarely, that’s not a reliable channel in general.

That sounds perfect. I think that’ll pretty much erase the final item on the list of “reasons not to use Python/IPython Notebook for stats” that I’ve heard from my colleagues.

Looking forward to the release. Perhaps I’ll put together another hands-on demo for the R integration in a few weeks, then.