January 11, 2019

A starter kit for reproducible research with R

This accompanies my draft short article for the UKES Bulletin: A reproducible workflow for evaluation reports.

Here are just two files which together get you started with a bare-bones reproducible research project.

  • Install and install R and RStudio.
  • Download this little example of a reproducible source file and an accompanying example Excel file describing the heights and ages of a bunch of boys and girls in a school.
  • Save them together in a new folder.
  • Double-click the source file to open it in RStudio. The source file has a couple of commands to read the Excel file, clean it and make a table and a chart, as well as plain text which becomes the headings and body of the document.
  • Press the blue Knit” button in RStudio, you will get a beautiful Word document.
  • Try editing the text and then press Knit” again to see what happens.
R reproducibleResearch
January 11, 2019

A reproducible workflow for evaluation reports

This is a draft of a short article for the UKES Bulletin.

Problems with trying to reproduce or repeat tables and charts in reports?

Most evaluators have to produce at least a few tables and graphics in their evaluation reports. Here are some problems that might be familiar to you.

… when then the client says:

  • ah, those 20 graphics at the end, can you change the font to Arial?
  • ah, those 20 graphics at the end, can you make sure they are based only on data from North Region?
  • in the text, it says there were 3789 refugees, but in the table the total is 3787? The Ambassador is totally hung up on this, we need a 100% definitive answer.
  • we’ve just received some updated data with three more cases, could you just update the whole report to take them into account?

… when the evaluator thinks:

  • oh no, I cleaned all the data by editing the spreadsheet, it took me all day to correct all the village names, there were so many different spellings of the same place, and now they’ve sent me a new version of the spreadsheet and I’ll have to do it all again!
  • oh no, I have a dozen copies of the data which I’ve cleaned and summarised in various ways and now I can’t find which version gave me the table on p. 22.
  • I hope I don’t have to hand over this report to someone else because it would take forever to explain how I did it. If I died tomorrow, no-one would be able to work it out.

If you don’t have these problems, you don’t need this article. If you do, read on.

The reproducible workflow” as a solution

A good way to reduce some of these problems is to use a reproducible” workflow. Personally, this workflow has saved me lot of time and tears — though it did take a while to learn. And if I was an evaluation commissioner working on a project where the tables and charts, and perhaps statistical analyses, were central, I’d want my evaluator to follow a reproducible workflow.

Reproducibility” has been a buzzword and a hashtag in the quantitative sciences for a while now1, but it’s not so well-known amongst evaluators or evaluation commissioners.

Here’s how I describe the workflow to clients. There are lots of variants of the workflow, but basically two types: Nerd and Non-nerd. I’ll explain the Nerd Version and its advantages first.

Reproducible workflow, Nerd Version

All tables, graphics and statistical analyses in a reproducible report are produced from a source file”. This is a text document which looks pretty much like the finished report2, it contains the actual text and headings etc. of the report, but in the place of each table or chart there are text instructions for producing that table or chart. At the start of the document there is also some hidden text which gives instructions about which data file is needed, and how to perform any data cleaning, recoding etc. Then each time you save a new version of the report, a statistics program called R”3 (r-project.org) loads the raw data, cleans it, does calculations, replaces the code with the corresponding tables and graphics, and produces a Word, PDF or web document as required.

Some advantages of the reproducible workflow

  • Transparency & verifiable accuracy the client or others can, if desired, use the source file” to repeat these calculations, see exactly how they are arrived at, and verify for errors. It’s an audit trail for data.
  • Reduction of errors because there is no manual cutting-and-pasting or editing of data in the original data files, or manual editing of tables or graphics: the original data files (e.g. responses to questionnaires) are not touched at all by human hand” but are loaded from scratch each time, and are cleaned by the software, following instructions written by the evaluator.
  • Faster progress through the project because the evaluator can produce preliminary and pre-final report versions for discussion even before all data collection is completed.
  • More consistent presentation because it is easy to make global changes to the way data is presented (e.g. to change the font size on dozens of graphics with one click).
  • Less work because sets of similar graphics or tables can be produced from a single template using a loop”.
  • Less work because updating the report with new data takes seconds, not hours.
  • Freshness of external data because the latest versions of any data from other sources, e.g. World Bank data, will be loaded live from online repositories to ensure freshness”.
  • Openness and extensibility because the client can extend the analyses in later months or years — or hand them over to others if desired — without having to start from scratch.

Reproducible workflow, Non-nerd Version

If you are not a nerd, and don’t dream of ever becoming one, you can stick to the tools you use already (Excel, Tableau, SPSS, Stata or whatever) and adopt a semi-reproducible workflow, and get many of the same advantages.

I’m not a nerd. How can I make my next report a bit more reproducible?

  • Create a READ ME file somewhere a the top of your project folders which explains in plain language an outline of the steps a human would have to take to load the correct data, process it, and produce the tables and charts in your report. This file shouldn’t be chronological like a diary; it’s a sequence of tasks for reconstructing the key parts your report, step by step, (starting by giving the names and locations of the spreadsheets or other data files which you are using). It’s like a source file” but just for humans.
  • Never manually edit your original data file(s). Make a copy of the data and clean the copy manually, making a note, at least in outline, of what you did, in your READ ME file. Make sure your READ ME file specifies which is the one definitive version of the original data file as well as the name of the cleaned copy.
  • Make sure you have continuous backups of your data and calculations, e.g. by working within a folder which is synchronised with Dropbox or similar service. There is no point having reproducible instructions if you lose the instructions (or even the data).

Resources for the reproducible workflow, Nerd Version

There are lots of resources to help you and all of the important things you need are free!

  1. http://ropensci.github.io/reproducibility-guide/sections/introduction/

  2. Most people use a format called Markdown

  3. Or some alternative like Python

R reproducibleResearch research
October 1, 2018

Articles and presentations related to Theorymaker




Theorymaker poster — UKES, 2018

Longer blog posts

Theorymaker resources

Draft articles

evaluation theorymaker visualisation

This blog by Steve Powell is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, syndicated on r-bloggers and powered by Blot.
Privacy Policy