January 11, 2019

A reproducible workflow for evaluation reports

This is a draft of a short article for the UKES Bulletin.

Problems with trying to reproduce or repeat tables and charts in reports?

Most evaluators have to produce at least a few tables and graphics in their evaluation reports. Here are some problems that might be familiar to you.

… when then the client says:

  • ah, those 20 graphics at the end, can you change the font to Arial?
  • ah, those 20 graphics at the end, can you make sure they are based only on data from North Region?
  • in the text, it says there were 3789 refugees, but in the table the total is 3787? The Ambassador is totally hung up on this, we need a 100% definitive answer.
  • we’ve just received some updated data with three more cases, could you just update the whole report to take them into account?

… when the evaluator thinks:

  • oh no, I cleaned all the data by editing the spreadsheet, it took me all day to correct all the village names, there were so many different spellings of the same place, and now they’ve sent me a new version of the spreadsheet and I’ll have to do it all again!
  • oh no, I have a dozen copies of the data which I’ve cleaned and summarised in various ways and now I can’t find which version gave me the table on p. 22.
  • I hope I don’t have to hand over this report to someone else because it would take forever to explain how I did it. If I died tomorrow, no-one would be able to work it out.

If you don’t have these problems, you don’t need this article. If you do, read on.

The reproducible workflow” as a solution

A good way to reduce some of these problems is to use a reproducible” workflow. Personally, this workflow has saved me lot of time and tears — though it did take a while to learn. And if I was an evaluation commissioner working on a project where the tables and charts, and perhaps statistical analyses, were central, I’d want my evaluator to follow a reproducible workflow.

Reproducibility” has been a buzzword and a hashtag in the quantitative sciences for a while now1, but it’s not so well-known amongst evaluators or evaluation commissioners.

Here’s how I describe the workflow to clients. There are lots of variants of the workflow, but basically two types: Nerd and Non-nerd. I’ll explain the Nerd Version and its advantages first.

Reproducible workflow, Nerd Version

All tables, graphics and statistical analyses in a reproducible report are produced from a source file”. This is a text document which looks pretty much like the finished report2, it contains the actual text and headings etc. of the report, but in the place of each table or chart there are text instructions for producing that table or chart. At the start of the document there is also some hidden text which gives instructions about which data file is needed, and how to perform any data cleaning, recoding etc. Then each time you save a new version of the report, a statistics program called R”3 (r-project.org) loads the raw data, cleans it, does calculations, replaces the code with the corresponding tables and graphics, and produces a Word, PDF or web document as required.

Some advantages of the reproducible workflow

  • Transparency & verifiable accuracy the client or others can, if desired, use the source file” to repeat these calculations, see exactly how they are arrived at, and verify for errors. It’s an audit trail for data.
  • Reduction of errors because there is no manual cutting-and-pasting or editing of data in the original data files, or manual editing of tables or graphics: the original data files (e.g. responses to questionnaires) are not touched at all by human hand” but are loaded from scratch each time, and are cleaned by the software, following instructions written by the evaluator.
  • Faster progress through the project because the evaluator can produce preliminary and pre-final report versions for discussion even before all data collection is completed.
  • More consistent presentation because it is easy to make global changes to the way data is presented (e.g. to change the font size on dozens of graphics with one click).
  • Less work because sets of similar graphics or tables can be produced from a single template using a loop”.
  • Less work because updating the report with new data takes seconds, not hours.
  • Freshness of external data because the latest versions of any data from other sources, e.g. World Bank data, will be loaded live from online repositories to ensure freshness”.
  • Openness and extensibility because the client can extend the analyses in later months or years — or hand them over to others if desired — without having to start from scratch.

Reproducible workflow, Non-nerd Version

If you are not a nerd, and don’t dream of ever becoming one, you can stick to the tools you use already (Excel, Tableau, SPSS, Stata or whatever) and adopt a semi-reproducible workflow, and get many of the same advantages.

I’m not a nerd. How can I make my next report a bit more reproducible?

  • Create a READ ME file somewhere a the top of your project folders which explains in plain language an outline of the steps a human would have to take to load the correct data, process it, and produce the tables and charts in your report. This file shouldn’t be chronological like a diary; it’s a sequence of tasks for reconstructing the key parts your report, step by step, (starting by giving the names and locations of the spreadsheets or other data files which you are using). It’s like a source file” but just for humans.
  • Never manually edit your original data file(s). Make a copy of the data and clean the copy manually, making a note, at least in outline, of what you did, in your READ ME file. Make sure your READ ME file specifies which is the one definitive version of the original data file as well as the name of the cleaned copy.
  • Make sure you have continuous backups of your data and calculations, e.g. by working within a folder which is synchronised with Dropbox or similar service. There is no point having reproducible instructions if you lose the instructions (or even the data).

Resources for the reproducible workflow, Nerd Version

There are lots of resources to help you and all of the important things you need are free!

  1. http://ropensci.github.io/reproducibility-guide/sections/introduction/

  2. Most people use a format called Markdown

  3. Or some alternative like Python

R reproducibleResearch research
October 1, 2018

Articles and presentations related to Theorymaker

(There’s a newer version in my work-in-progress book on causal mapping)




Theorymaker poster — UKES, 2018

Longer blog posts

Theorymaker resources

Draft articles

evaluation theorymaker visualisation
August 29, 2018

Welcome to the Wiggle Room

Inspired by Judea Pearl’s new The Book of Why”, and also by Nicky Case’s work on explorables, I’ve started work on a explorable web app.

Pearl talks about wiggling” variables in a causal network to look at their causal effects. As someone brought up on Meccano, this really makes sense to me. It’s a philosophically interesting idea too.

So my new app is called The Wiggle Room. You can explore the causal effect of one or two influence variables on a consequence variable. Or you can just click the blue button to look at examples.

There are two twists:

  1. There are two main sliders - to show not only the level under intervention but the base level — what would happen without the intervention. The intervention is actually the difference or delta” between the two. And the effect on the consequence variable is also a difference — you can see this on the graphs. This helps us to think in differences” which I believe is essential for understanding causal networks like theories of change.

  2. The variables are not modelled using continuous numbers. Instead, they are intensity variables which I’ve also called lo-hi variables” elsewhere. They vary between a vague minimum and a vague maximum, a bit like a percentage. In the Wiggle Room, I call them percentages because these are familiar to most people.

The Wiggle Room offers pre-sets to construct the different possible functions between (sets of) influence variables and one confluence variables, in the form of influence shapes” for specific links and combination shapes” which govern how these influences are combined.

So these are the features:

Each variable (the one or two influence variables, as well as the consequence variable) can have one of several Types:

  • Type:
    • intensity or true-false, possibly also probabilistic true-false (e.g. a 75% probability that it is true)
    • combined with: normal or with negative, i.e. including a negative bottom half”.
    • = 4 (or 6) possibilities.
    • plus, the base” or non-intervention level of each variable can be specified and optionally the intervention level. In the case of endogenous variables, these levels are calculated rather than specified.

Each influence can have different shapes and strengths:

  • Shape:
    • linear, slow-start, quick-start, threshold U-shaped, S-shaped (there could be others, these are the most obvious)
    • combined with: normal or reversed.
    • = 12 possibilities.
  • Strength: 0-100
    • if you have two influence variables, reducing the strength of one makes it less important than the other
    • if no influence variable has a strength of 100%, there is some ambiguity left about the level of the consequence variable. So another residual” slider appears for the consequence variable, see below.

Plus, the consequence variable (in addition to Type) can have:

  • Combination (how the influence variables combine):
    • soft add, hard add, multiply, smallest, largest, average and similarity)
    • the consequence variable can also be flipped” so 0% becomes 100% and vice-versa.
    • = 14 possibilities
    • strength of (unspecified) residual influences. This is only relevant if the maximum strength of the influence variables is less than 100, i.e. they don’t control it completely. The influence variables share influence between them, so if there are two, with strengths of 80 and 40, the former gets twice the influence of the latter and there is 20 left to explain” with residual influences.

For the second influence variable (on the left) you can also specify what controls it:

  • Control
    • External factor
    • Our intervention (this links the variable to the first variable)
    • External intervention (in this case, you can explore not only what if we do x” but also what if they do y”)

I hope these various settings are fairly intuitive. But they give a bewildering variety of different combinations.

So next time you see someone hasn’t specified the nature of the links leading into a variable in a theory of change, and you’re supposed to guess it, ask them which of these possibilities it is!

Intensity variables

These variables are mostly intensity variables: they can vary between a rough minimum, which we can think of as 0%, and a rough maximum, which we can think of as 100%. (You can also try with variables which have a minimum of -100%) In the Wiggle Room you can also find true-false variables, which we see as being a special case of intensity variables. You can think of them as just having the values 0% and 100%. You can also interpret a value like 50% as meaning a 50% chance of the variable being true.

The variable types also do not need to be restricted to intensity and true-false variables. But these have the significant advantage (especially for people working with theories of change) that interventions and their effects can be conveniently expressed as percentages of the total range of the variable in question.


In practice, intensity variables can be constructed presented using rubrics’: by describing (using rich and concrete language) four or five of the different levels they can take, from minimum (0%) to maximum (100%). The use of rubrics is described well elsewhere1. This is a more fundamental and important task than the usual practice of trying to pin down a variable with just a brief title (e.g. inter-ethnic trust”) and then defining it post-hoc and implicitly via the indicators which have been selected for it.

Calculating effects in a causal network: Theorymaker

I think this is a unique explanation of causal effects because it does not rely on statistics - it’s about the causal connections which explain the statistics. Evaluators often have direct (if unreliable) information about causal effects which have nothing to do with correlations. Stakeholders talk to us about causal links and only rarely about correlations. We need to have a way to process this kind of information, and statistics can’t help us.

In the Wiggle Room we only look at simple Theories, single steps in causal networks from one or two variables to another. In a larger causal network like a Theory of Change, there will be several such connections. With intensity variables, we can easily calculate how causal effects ripple through a network. If an intervention on a variable A has a 50% influence on B, and the resulting effect on B has a 50% influence on C, the direct causal influence of the intervention A on C will be precisely 25%.

So I’m integrating the same algorithm into the experimental version of my existing Theory of Change visualisation tool Theorymaker.

In the existing version you can express causal relationships using indentation:


… and you can add styling like this…

Happiness; colour=orange3

In the new version, you can actually build a causal model using the same ideas.

Happiness; combination=largest
 Relationships; base=.5; intervention=.8
 Money; combination=multiply
  Luck; base=.5; intervention=.5
  Qualifications; base=.5; intervention=.5

Adding e.g. combination=multiply after a variable changes the way that variable combines its influences.

Logic gates

Rick Davies asked about implementing logic gates. All of them are possible with this app.

So a true-false variable like the ones in logic gates is realised just by considering the values 0% and 100% of an intensity variable. (If you want, you can consider the values in between as probabilities of true-false values or fuzzy set membership. I think the generalisation of these combinations to intensity variables is really interesting in terms of Theories of Change and also much more applicable in the real world than just on/off situations.)

In Causal Explorer, you can select the type true-false” for your variables, but the only thing (at the moment) which changes is the axis labels change to false and true. I might consider a more logic-gate-like display. Then:

  • NOT: with a single influence variable, select reversed” in the consequence variable section.
    • for all the others, you need to click Include second influence variable” and then choose a value from How do the two influence variables combine?”
      • AND: multiply or smallest (and select reversed” for NAND)
      • OR: soft add, hard add, or largest (and select reversed” for NOR)
      • XNOR: similarity (and select reversed” for XOR)

In Theorymaker3, you can type something like this (try copying and pasting the whole lot)

variable: type=true-false #this switches all the variables to true-false rather than intensity: the bar at the bottom of the variables changes to show just two values. 
variable: base=hide # hide the base values

a; combination=similarity # or whichever of the above combinations you want to try
 b; intervention=0 # fiddle with these (set them to 0 or 1) to see what happens
 c; intervention=1 # fiddle with these (set them to 0 or 1) to see what happens

I’ve also chosen functions which are more general in the sense that they can combine any number of variables, not just two (though in Causal Explorer you can’t visualise more than two).


These building blocks are certainly not capable of constructing all possible functions, far from it. Alternatives or additions to the influence shapes and the combination shapes are perfectly possible. (The R version of Theorymaker provides the possibility to describe a function directly without any reference to these shapes.)

Most importantly, these tools are primarily for visualisation, exploring and getting a feel for” possible causal relationships. They shouldn’t be used to try to get spurious certainty (“look, I put in a few vague ideas and numbers came out!”) where there is none. In most real-world cases, a causal chain of more than one or two links will most often have its effects reduced to almost nothing by noise. These tools are supposed to help model this kind of uncertainty.


… to Rick Davies for suggesting to add S-shaped / sigmoid to the list of variable shapes, and for the discussion also with Martin Klein which led me to document the logic-gate possibilities.

  1. King, J., McKegg, K., Oakden, J., & Wehipeihana, N. (2013). Rubrics: A method for surfacing values and improving the credibility of evaluation. Journal of Multidisciplinary Evaluation, 9(21), 11–20. Retrieved from http://journals.sfu.ca/jmde/index.php/jmde_1/article/viewFile/374/373

evaluation theorymaker visualisation

This blog by Steve Powell is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License, syndicated on r-bloggers and powered by Blot.
Privacy Policy