Skip to content


Where you end up, having a PhD

The illustrated guide from Kindergarten to PhD …
http://matt.might.net/articles/phd-school-in-pictures/#resources

Posted in Science.


two good things come together

http://docs.latexlab.org/docs
LaTeX and Google Docs together in one nice (free)application – this is a brilliant idea, which gives you the power of LaTeX combined with the excellent collaboration possiblities of Google Docs – if I would have a button it would say “I like” …

Here is a screenshot from the project page – with LaTeX code on one side and the final output on the other …

Posted in LaTeX, Tech.


R and the World Cup

Across the street at the Revolution blog a nice example of using R with data from the cloud (see another post on this topic here) shows us the distribution of fouls during the just finished World Cup in a nice barchart. Even more interesting than the fact that Holland rules this category is the way the data are collected from a Google spreadsheet page.

With the following simple code line:
teams <- read.csv("http://spreadsheets.google.com/pub?key=tOM2qREmPUbv76waumrEEYg&single=true&gid=0&range=A1%3AAG15&output=csv")

We can read a specific part from a spreadsheet hosted on Google into our local R environment. Some deatils: "&gid=" (sheet number) and "%range=" (cell ranges: A1%3A ) and "&output=csv" to download in CSV format.

With some more lines, using the awsome ggplot2

library(qqplot2)
FOULS=t(DF2)[,c('Fouls')]
qplot(names(FOULS), as.numeric(FOULS), geom="bar", stat='identity', fill=Fouls) + xlab('Country') + ylab('Fouls') + coord_flip() + scale_fill_continuous(low="black", high="red") + labs(fill='Fouls')

We can produce the following chart:

Two things to note:
c('Fouls') is a handy way to address columns in a R data frame by name
scale_fill_continuous(low="black", high="red") takes care of the color coding of the bars in reference to the number of fouls

Easy and straight forward - ah - great job Spain :-) !!

Posted in R, Statistics.


R goes cloud

Jeroen Ooms did for R what Google did for editing documents online. He created several software packages that help running R with a nice frontend over the Internet.
I first learned about Jeroen’s website through his implementation of ggplot2 – this page is useful to generate graphs with the powerful ggplot2 package without R knowledge, however it is even more helpful to learn ggplot2 code with the View-code panel function which displays the underlying R code. If you are into random effect models another package connected to lme4 will guide you step by step through model building.
I think this is a great step forward for R and cloud computing!

Posted in Methods, R, Statistics, Tech.


LaTeX looks more scientific

There are long discussions on the benefits of LaTeX over Word, but this statement from a (not too serious) paper of Andrew Gelman (a Professor of Statistics and Political Sciences at Columbia University) hits the spot:

Posted in LaTeX.


Accepting to fail (in the name of science)

Very interesting article in WIRED on accepting failure and how ignoring it changes the way scientists make progress (or not).

Good theme for new years resolutions …

In the meantime: Happy Holidays!

Posted in Science.


How WEIRD subjects can be overcome … a comment on Henrich et al.

Joe Henrich published a target article in BBS talking about how economics and psychology base their research on WEIRD (Western, Educated, Industrialized, Rich and Democratic) subjects.

Here is the whole abstract:

Behavioral scientists routinely publish broad claims about human psychology and behavior in the world’s top journals based on samples drawn entirely from Western, Educated, Industrialized, Rich and Democratic (WEIRD) societies. Researchers—often implicitly—assume that either there is little variation across human populations, or that these “standard subjects” are as representative of the species as any other population. Are these assumptions justified? Here, our review of the comparative database from across the behavioral sciences suggests both that there is substantial variability in experimental results across populations and that WEIRD subjects are particularly unusual compared with the rest of the species—frequent outliers. The domains reviewed include visual perception, fairness, cooperation, spatial reasoning, categorization and inferential induction, moral reasoning, reasoning styles, self-concepts and related motivations, and the heritability of IQ. The findings suggest that members of WEIRD societies, including young children, are among the least representative populations one could find for generalizing about humans. Many of these findings involve domains that are associated with fundamental aspects of psychology, motivation, and behavior—hence, there are no obvious a priori grounds for claiming that a particular behavioral phenomenon is universal based on sampling from a single subpopulation. Overall, these empirical patterns suggests that we need to be less cavalier in addressing questions of human nature on the basis of data drawn from this particularly thin, and rather unusual, slice of humanity. We close by proposing ways to structurally re-organize the behavioral sciences to best tackle these challenges.

I would like to make three suggestions that could help to overcome the era of WEIRD subjects and generate more reliable and representative data. These suggestions will mainly touch contrasts 2, 3 and 4 elaborated by Henrich, Heine and Norezayan. While my suggestions tackle these contrasts from a technical and experimental perspective they do not provide a general solution for the first contrast on industrialized versus small scale societies. Here are my suggestions: 1) replications in multiple labs, 2) internet based experimentation and 3) drawing representative samples from a population.
The first suggestion, replication in multiple labs, foremost touches aspects like replication, multiple populations and open data access. For a publication in a journal a replication of an experiment in a different lab would be obligatory. The replication would then be published with the original, e.g., in the form of a comment. This would ensure that other research labs in other states or countries are involved and very different parts of the population could be sampled. Also results of experiments would be freely available to the public and the data sharing problem in Psychology, as described in the target article, but also in other fields like Medicine (Savage & Vieckers, 2009) would be a problem of the past. Of course such a step would be closely linked with certain standards on the one hand in building experiments and on the other hand in storing data. While a standard way to build experiments seems unlikely there are many methods available in computer science to store data in a reusable, for example through the usage of XML (Extensible Markup Language).
The second suggestion is based on the drawing of representative samples from the population. As described in the target article, research often suffers from a restriction to extreme subgroups from the population, from which generalized results are drawn. However, there is published work that overcomes these restrictions. As an example I would like to use the Hertwig, Zangerl, Biedert and Margraf (2008) paper on probabilistic numeracy. The authors based their study on a random-quote sample from the Swiss population including indicators as language, area where participant is living, gender and age. To fulfill all the necessary criteria 1000 participants were recruited using telephone interviews. Such studies are certainly more expensive and somewhat restricted to simpler experimental setups (Hertwig et al., used telephone interviews based on questionnaires).
The third suggestion adds additional data collection in a second location: the Internet. The emphasis in the last sentence should be set on ‘add’. Data collection solely Internet based is of course possible, already often performed and published in high impact journals. Online experimentation is technically much less demanding than ten years ago due to the availability of ready made solutions for questionnaires or even experiments. The point I would like to make here should not be built on a separation of lab and online based experiments. My suggestion combines these two research locations and enables a researcher to profit from the many benefits arising. A possible scenario could include running an experiment in the laboratory first to guarantee, among other things, high control on the situation in order to show an effect with a small, restricted sample. In a second step the experiment is transferred to the Web and run online, admittedly giving away some of the control but providing the large benefit of having access to a diverse, large samples of participants from different populations easily. As an example I would like to point to a recent blog and related experiments started by Paolacci and Warglien (2009) at the University of Venice, Italy. These researchers started replicating well known experiments from the decision making literature like framing, anchoring or the conjunction fallacy with a service called the Mechanical Turk provided by Amazon. This service is based on the idea of crowdsourcing (outsourcing a task to a large group of people) and lets a researcher have easy access to a large group of motivated participants.
Some final words on the combination and possible restrictions of the three suggestions. What would a combination of all three suggestions look like? It would be a replication of experiments, using representative samples of different populations in online experiments. This seems useful from a data quality, logistics and prize point of view. However, several issues were left untouched in my discussion, such as the question of independence of the second lab for replication studies, the restriction of representative samples to one country (as opposed to multiple comparisons as routinely found in, e.g., anthropological studies), the differences between online and lab based experimentation or the instances where equipment needed for an experiments (e.g., eye trackers or fMRI) does not allow for online experimentation. Keeping that in mind the above suggestions draw an idealized picture of how to run experiments and re-use the collected data, nevertheless I would argue that such steps could help to reduce the percentage of WEIRD subjects in research substantially.

References
Hertwig, R., Zangerl, M.A., Biedert, E., & Margraf, J. (2008). The Public’s Probabilistic Numeracy: How Tasks, Education and Exposure to Games of Chance Shape It. Journal of Behavioral Decision Making, 21, 457-570.

Paolacci, G., & Warglien, M. (2009). Experimental turk: A blog on social science experiments on Amazon Mechanical Turk. Accessed on November 17th 2009:

Savage, C.J., & Vickers, A.J. (2009). Empirical Study of Data Sharing by Authors Publishing in PLoS Journals. PLoS ONE 4(9): e7078.doi:10.1371/journal.pone.0007078

Posted in Decision Making, Methods, Science.


Lattice versus ggplot2

I really liked Lattice for generating graphs in R until I saw what ggplot2 can do …
One of the big differences between the two is the theory on which ggplot2 is based upon. There are clear modular building blocks that can be applied in a consistent manner on any graph generated. Both packages are extremely versatile but at the end of the day I think ggplot2 provides a clearer structure and hence more flexibility …

Hadley Wickham (the author of ggplot2) has a book out on ggplot2 at Springer. Some sample chapters can be downloaded from his webpage.

The learningR people have a long series of posts where they provide ggplot2 code for nearly all the graphs in the Lattice book … worth taking a look at!
Here is an in depth

Posted in R, Statistics.


R flashmob

From: The R Flashmob Project
Subject: R Flashmob #2

You are invited to take part in R Flashmob, the project that makes the
world a better place by posting helpful questions and answers about the
R statistical language to the programmer’s Q & A site stackoverflow.com

Please forward this to other people you know who might like to join.

FAQ

Q. Why would I want to join an inexplicable R mob?

A. Tons of other people are doing it.

Q. Why else?

A. Stackoverflow was built specifically for handling programming questions.
It’s a better mousetrap. It offers search (and is well indexed by search engines),
tagging, voting, the ability to choose the “best” answer to a question, and the ability to
edit questions and answers as technology progresses. It has a karma system to
reward people who are happy to help and discourage MLJs (mailing list jerks).

Q. Do the organizers of this MOB have any commercial interest in stackoverflow?

A. None at all. We’re just convinced it is the best way to help and promote R. All
the content submitted to stackoverflow is protected by a Creative Commons
CC-Wiki License, meaning anyone is free to copy, distribute, transmit, and
remix the information on stackoverflow. All the content on stackoverflow is
regularly made available for download by the public.

INSTRUCTIONS – R MOB #2
Location: stackoverflow.com
Start Date: Tuesday, September 8th, 2009
Start Time:
10:04 AM – US Pacific
11:04 AM – US Mountain
12:04 PM – US Central
1:04 PM – US Eastern
6:04 PM – UK
7:04 PM – Continental W. Europe
5:04 AM (Weds) – New Zealand (birthplace of R)
Duration: 50 minutes

(1) At some point during the day on September 8th, synchronize your watch to

http://timeanddate.com/worldclock/personal.html?cities=137,75,64,179,136,37,22

(2) The mob should form at precisely 4 minutes past the hour and not beforehand.

(3) At 4 minutes past the hour, you should arrive at stackoverflow.com, log in,
and post 3 R questions. Be sure to tag the questions “R”. See the posting
guidelines at http://stackoverflow.com/faq to understand what makes a good
question.

(4) Follow R Flashmob updates at http://twitter.com/rstatsmob

(5) Post twitter messages tagged #rstats and #rstatsmob during the mob,
providing links to your questions.

(6) During the R MOB, you can chat with other participants on the #R channel
on IRC (freenode). To do this, install the Chatzilla extension on Firefox.
Click “freenode” on the main screen. Then type /join #R in the field at the
bottom of the screen. Then chat.

(7) If you finish posting your three questions within the 50 minutes, stick
around to answer questions and give “up votes” to good questions and answers.

(8) IMPORTANT: After posting, sign the R Flashmob guestbook at

http://bit.ly/6F8B2

(9) Return to what you would otherwise have been doing. Await
instructions for R MOB #3.

Posted in R, Statistics.


Flashlight paper draft

We submitted our Flashlight paper today. Find a draft at the address below:
Schulte-Mecklenbeck, Michael , Murphy, Ryan O. and Hutzler, Florian,Flashlight – an Online Eye-Tracking Tool(July 13, 2009). Available at SSRN: http://ssrn.com/abstract=1433225

The software will be uploaded to the project page http://schulte-mecklenbeck.com/flashlight today, too.

Have fun playing with it!

Posted in Papers.