The new version 2.13.0 of R has just been released and with the update comes the pain of re-installing all the packages from the old installation on the new one.
Stackoverflow to the rescue! This posting provides a simple two step process of first writing a list of packages into a file on the disk in the old version, installing the new version and then comparing the exported list to the currently installed packages in the new version with setdiff. I just went through the process and have to say that it is deadeasy! Below the code …
#--run in the old version of R
packages <- installed.packages()[,"Package"]
INSTALL NEW R VERSION
#--run in the new version
for (p in setdiff(packages, installed.packages()[,"Package"]))
Is Psychology ready for reproducible research?
Today the typical research process in psychology looks generally like this: we collect data; analyze them in many ways; write a draft article based on some of the results; submit the draft to a journal; maybe produce a revision following the suggestions of the reviewers and editors; and hopefully live long enough to actually see it published. All of these steps are closed to the public except for the last one – the publication of the (often substantially) revised version of the paper. Journal editors and reviewers evaluate the written work submitted to them, they trust that the analyses described in the submission are done in a principled and correct way. Editors and reviewers are the only external part of this process who will have an active influence on what analyses are done. After the publication of an article the public has the opportunity to write comments or ask the authors for the actual datasets for re-analysis. Often however, getting access to data from published papers is hard, if not often impossible (Savage & Vickers, 2009; Wicherts, Borsboom, Kats, & Molenaar, 2006). Unfortunately only the gist of the analyses are described in the paper and neither exact verification nor innovative additional analyses are possible.
What could be a solution for this problem? An example from computer science provides a concept called “literate programming” which was advocated by one of the field’s grandmasters, Donald Knuth, in 1984. Knuth suggested that documentation (comments in the code) should be just as important as the actual code itself. This idea was reflected nearly 20 years later when Schwab et al. (2000) formulated a concept that “replication by other scientists” is a central aim and guardian for intellectual quality; they coined the term “reproducible research” for such a process.
Let’s move the research process to a more open, reproducible structure, in which scientific peers have the ability to evaluate not only the final publication but also the data and the analyses.
Ideally, research papers would have code for analyses which are commented in detail and are submitted in tandem with drafts as well as the original datasets. Anybody, not only a restricted group of select reviewers and editors, could reproduce all the steps of the analysis and follow the logic of arguments on not only the conceptual level but at an analytic level as well. This openness facilities easy reanalysis of data also. Meta-analysis could be done more frequently and with greater resolution as the actual data are available. Moreover, this configuration would allow us collectively to estimate effects in the population and not restrict our attention to independent small samples (see Henrich, Heine, & Norenzayan, 2010 for a discussion of this topic).
What do we need to achieve this? From a policy perspective, journals would have to add the requirement for data and code submission together with the draft of each empirical paper. Some journals already provide the option to do that (e.g., Behavior Research Methods) in the supplemental material section on a voluntary base, some require the submission of all necessary material to replicate the reported results (e.g., Econometrica), however most do not offer such a possibility (it is of course possible to provide such materials through private or university web sites, but this is a haphazard and decentralized arrangement).
Tools are a second important part of facilitating this openness. Three open source (free of cost) components could provide the bases for reproducible research:
- R (R Development Core Team, 2010) is widely recognized (cite) as the “language of statistics” and builds on writing code instead of a “click and forget” type of analysis that other software packages encourage. R is open source, comes with a large number of extensions for advanced statistical analysis and can be run on any computer platform, including as a Web based application (http://www.R-project.org).
- LaTeX was invented to provide a tool for anybody to produce high quality publications independent of the computer system used (i.e. one could expect the same results everywhere, http://www.latex-project.org/).
- Sweave (Leisch, 2002) connects R and LaTeX providing the opportunity to write a research paper and do the data analysis in parallel, in a well documented and reproducible way (http://www.stat.uni-muenchen.de/~leisch/Sweave/).
The power of these different tools comes from the combination of their being open source, their widespread adoption (across a wide range of fields in sciences), and the fully transparent means by which data analysis is conducted and reported. It levels the playing field and means that anybody with an Internet connection and a computer can take part in evolving scientific progress.
John Godfrey Saxe famously said that: “Laws, like sausages, cease to inspire respect in proportion as we know how they are made.” We should strive that this is not true for psychology as a science.
Across the street at the Revolution blog a nice example of using R with data from the cloud (see another post on this topic here) shows us the distribution of fouls during the just finished World Cup in a nice barchart. Even more interesting than the fact that Holland rules this category is the way the data are collected from a Google spreadsheet page.
With the following simple code line:
teams <- read.csv("http://spreadsheets.google.com/pub?key=tOM2qREmPUbv76waumrEEYg&single=true&gid=0&range=A1%3AAG15&output=csv")
We can read a specific part from a spreadsheet hosted on Google into our local R environment. Some deatils: "&gid=" (sheet number) and "%range=" (cell ranges: A1%3A ) and "&output=csv" to download in CSV format.
With some more lines, using the awsome ggplot2
qplot(names(FOULS), as.numeric(FOULS), geom="bar", stat='identity', fill=Fouls) + xlab('Country') + ylab('Fouls') + coord_flip() + scale_fill_continuous(low="black", high="red") + labs(fill='Fouls')
Two things to note:
c('Fouls') is a handy way to address columns in a R data frame by name
scale_fill_continuous(low="black", high="red") takes care of the color coding of the bars in reference to the number of fouls
Easy and straight forward - ah - great job Spain 🙂 !!
Jeroen Ooms did for R what Google did for editing documents online. He created several software packages that help running R with a nice frontend over the Internet.
I first learned about Jeroen’s website through his implementation of ggplot2 – this page is useful to generate graphs with the powerful ggplot2 package without R knowledge, however it is even more helpful to learn ggplot2 code with the View-code panel function which displays the underlying R code. If you are into random effect models another package connected to lme4 will guide you step by step through model building.
I think this is a great step forward for R and cloud computing!
I really liked Lattice for generating graphs in R until I saw what ggplot2 can do …
One of the big differences between the two is the theory on which ggplot2 is based upon. There are clear modular building blocks that can be applied in a consistent manner on any graph generated. Both packages are extremely versatile but at the end of the day I think ggplot2 provides a clearer structure and hence more flexibility …
The learningR people have a long series of posts where they provide ggplot2 code for nearly all the graphs in the Lattice book … worth taking a look at!
Here is an in depth
From: The R Flashmob Project
Subject: R Flashmob #2
You are invited to take part in R Flashmob, the project that makes the
world a better place by posting helpful questions and answers about the
R statistical language to the programmer’s Q & A site stackoverflow.com
Please forward this to other people you know who might like to join.
Q. Why would I want to join an inexplicable R mob?
A. Tons of other people are doing it.
Q. Why else?
A. Stackoverflow was built specifically for handling programming questions.
It’s a better mousetrap. It offers search (and is well indexed by search engines),
tagging, voting, the ability to choose the “best” answer to a question, and the ability to
edit questions and answers as technology progresses. It has a karma system to
reward people who are happy to help and discourage MLJs (mailing list jerks).
Q. Do the organizers of this MOB have any commercial interest in stackoverflow?
A. None at all. We’re just convinced it is the best way to help and promote R. All
the content submitted to stackoverflow is protected by a Creative Commons
CC-Wiki License, meaning anyone is free to copy, distribute, transmit, and
remix the information on stackoverflow. All the content on stackoverflow is
regularly made available for download by the public.
INSTRUCTIONS – R MOB #2
Start Date: Tuesday, September 8th, 2009
10:04 AM – US Pacific
11:04 AM – US Mountain
12:04 PM – US Central
1:04 PM – US Eastern
6:04 PM – UK
7:04 PM – Continental W. Europe
5:04 AM (Weds) – New Zealand (birthplace of R)
Duration: 50 minutes
(1) At some point during the day on September 8th, synchronize your watch to
(2) The mob should form at precisely 4 minutes past the hour and not beforehand.
(3) At 4 minutes past the hour, you should arrive at stackoverflow.com, log in,
and post 3 R questions. Be sure to tag the questions “R”. See the posting
guidelines at http://stackoverflow.com/faq to understand what makes a good
(4) Follow R Flashmob updates at http://twitter.com/rstatsmob
(5) Post twitter messages tagged #rstats and #rstatsmob during the mob,
providing links to your questions.
(6) During the R MOB, you can chat with other participants on the #R channel
on IRC (freenode). To do this, install the Chatzilla extension on Firefox.
Click “freenode” on the main screen. Then type /join #R in the field at the
bottom of the screen. Then chat.
(7) If you finish posting your three questions within the 50 minutes, stick
around to answer questions and give “up votes” to good questions and answers.
(8) IMPORTANT: After posting, sign the R Flashmob guestbook at
(9) Return to what you would otherwise have been doing. Await
instructions for R MOB #3.
Dan Goldstein posted a short overview of Inference which allows working with R code in Microsoft Office and Excel.
I want to point at Sweave which does an excellent job in connecting R to LaTeX. Here is a short demo of Sweave which also connects the approach of Sweave to the ‘literate programming‘ idea of Donald Knuth (Father of ‘The Art of Computer Programming’ and TeX).
The basic idea is to combine programming (an analysis in the case of R) with documentation into one process. This seems to be useful when one goes back to an older analysis and tries to find out what was done some months ago.
Additionally you find a longer interview with Paul van Eikeren on the same topic here.
David Smith has a very nice code example in which he sets the color of title word in a plot to the actual grouping color. Code can be found here. This seems extremely useful for posters and presentations. I doubt however that journals would pick up on that …