2. The value of trials
It is one thing having junk departments turning out junk sociologists, but quite another to be turning out junk engineers. If you think this is a point of no importance, imagine the next time you enter a lift... (Brignell 2000, p.12)
Quite properly, our society demands that scientific, design and engineering products are fully tested and made safe before they are unleashed for public use. No one wants, for themselves or others, to fly in a plane that might crash or use a household appliance that might electrocute them. No one wants to eat a food or take a medicine that will poison them. The relative safety of these items is assured by testing them rigorously, and the fact that some planes still crash and some medicines have unintended consequences is an argument for better testing not for no testing. Perhaps in some areas of social science we might agree with Brignell (above) that testing does not really matter since there are no real consequences. Actually, I do not agree for the ethical reasons advanced in section 4.1. But in education there can surely be no doubt that it is intolerable that wide-ranging policy interventions take place routinely, using huge public budgets and affecting the lives and futures of millions of people without anything like the level of testing used for marketing pet food or soap powder.
Governments, particularly in the US, have apparently become increasingly interested in at least talking about the quality of evidence regarding the effectiveness of alternative practices, programmes and policies. The US Congress’s Committee on Education and the Work Force, for example, has been ‘concerned about the wide dissemination of flawed, untested educational initiatives that can be detrimental to children’ (Boruch and Mosteller 2002, p.1). History suggests that this concern is not misplaced, since there are many examples of education interventions that have been widely disseminated on the basis that they are driven by good intentions, seem plausible and are unlikely to do any harm, yet when they have been rigorously evaluated have been found to be ineffective or positively harmful. Such interventions include: the ‘Scared Straight’ programme, which aimed to deter delinquent children from a life of crime, was well received and widely implemented, yet in seven good quality trials was found to increase delinquency rates (Petrosino et al. 2000), and also the Bike-Ed training programme to reduce bicycle accidents among children, which was actually found to increase the risk of injury (Carlin et al. 2000). And, of course, there have been much larger national interventions that have never been tested properly against other uses of the same money or even against what they replaced (including the introduction of the 11+ examination, the creation of specialist schools, or the new diplomas for 14-19 education).
In clinical medicine, the randomised controlled trial (RCT) is well established as the best way of identifying the relative impact of alternative interventions on predetermined outcomes. The salience of this research design is largely due to the random allocation of participants to the alternative treatments in relatively controlled conditions, such that any difference in outcomes between the groups is due either to chance, the likelihood of which can be quantified, or due to the treatment difference. The RCT is most easily applied to the measurement of the efficacy of simple, well-defined interventions, delivered in an ideal research setting, and where there is a short-term impact on an objectively measured outcome. The classic application of the RCT design in clinical medicine is the drug trial where, for example, the relative efficacy of two alternative antihypertensive drugs can be established by recruiting a sample of hypertensive patients, randomly allocating half to receive drug A and half to receive drug B, and then measuring the blood pressure of the patients at some predetermined follow-up point(s). The difference in their mean blood pressure at follow-up should be an unbiased estimate of the difference in treatment efficacy of the two drugs, and appropriate statistical methods can be used to determine how likely it is that this observed difference is due to chance.
In education, this approach to research is almost certainly under-used, but valuable and easily possible (Fitz-Gibbon 2001). Goodson (1999) randomised Year 2 pupils to undergo either formal or informal testing, and found that the pupils performed better when tested in the informal normal working environment, rather than formal test conditions. Butler (1988) randomised students to three groups, which after testing received either just their numerical grade, or a more detailed comment on their performance, or both of these. Those receiving just comments performed better in subsequent tests, particularly among the sub-group of lower achievers. These studies have provided clear answers to important questions, and are examples of where RCTs can be used as the most effective method in empirically driven knowledge development. The fact that the results have not always been allowed to affect national policies on assessment looks curious in light of current government demands for just this kind of evidence of what works, to form the basis of evidence-based policy-making. It is crucial that policy-makers learn to appreciate the difference in policy and practice terms between evidence generated by a trial and that generated by passive designs (Cook and Gorard 2007). Understanding this difference also helps us to understand the threats to validity, strengths and weaknesses of other designs.
Working towards an experimental design can be an important part of any research enterprise, even where an experiment is not envisaged or even possible. Sometimes a true experiment, such as a large randomised controlled trial, is not necessary, and sometimes it is not possible. An experiment is not necessary in a variety of research situations, including where the research question does not demand it, or where a proposed intervention presents no prime facie case for extended trialling. An experiment may also not be possible in a variety of research situations, including where the intervention has complete coverage, or has already been implemented for a long time, or where it would be impossible to allocate cases or clusters at random. However, a ‘thought experiment’ is always possible, in which the researchers consider no practical or ethical constraints except answering the research question as clearly as possible. Knowing the format and power of experiments gives us a yardstick against which to measure what we do instead, and even helps us to design what we do better. In then having to compromise from this ‘ideal’ to conduct the actual research, the researcher may come to realise how much more they could be doing. Another example is a natural experiment where we design an 'experiment' without intervention, using the same design as a standard experiment but making use of a naturally occurring phenomenon. There might then be more natural experimental designs, more practitioner experiments, and surely more studies with appropriate comparison groups rather than no explicit comparison at all (a situation which reviews show is the norm for UK academic research in education, see below). There might also be more humility about the quality of the findings emanating from the compromise design.
Cook, T. and Gorard, S. (2007) What counts and what should count as evidence, pp.33-49 in OECD (Eds.) Evidence in education: Linking research and policy, Paris: OECD
2.1 Need for comparators
A local paper ran a front page story claiming that Cardiff was the worst area in Wales for unpaid television licences - it had 'topped the league of shame for the second year running'. The evidence for this proposition was that there were more people in Cardiff caught using TV without a licence than in any other 'area' of Wales (and it is important for readers to know that Cardiff is the largest of city in Wales). Not surprisingly the next worst area in the league of shame was Swansea (the second city of Wales), followed by Newport, then Wrexham, and so on. Everyone that I have told this story to laughs at the absurdity of the claim, and points out that the claim would have to be proportionate to the population of each area. Cardiff may then still be the worst, but at present we would have to assume that, as the most populous unitary authority in Wales, Cardiff would tend to have the most of any raw-score indicator (including, presumably, the number of people using TV with a licence). Why does this matter? It matters because very similar propositions to the newspaper story are made routinely in social science research, and rather than being sifted out in peer review, they are publicised and often feted.
One of the most pervasive, and hard to eliminate, errors in simple data analysis is the omission of a crucial comparator. This allows writers to present one set of results as though they were in contrast to another, as yet, unspecified set. If done smoothly, many readers will never notice the error. Studies with missing comparators are widespread, almost by design, in a lot of what is termed 'qualitative' research. Social exclusion, for example, is commonly investigated through a consideration of the supposedly excluded group by itself, giving the reader no idea of how different the experience of this group actually is from the implicit 'included' group (who are often not even defined).
As a simple example of the power of this error in dealing with numeric data, look at the following question: 'A large survey discovered that fewer than 5% of 21 year-olds who had passed one or more A-levels were unemployed. Why is this not necessarily evidence that passing A-levels helps people to avoid unemployment?'. When I used this as part of an examination for a cohort of 245 second and third year Social Science undergraduates I received some very imaginative replies about the difficulties of establishing comparable qualifications for A-levels, and alternate definitions of unemployment depending on whether full-time undergraduates themselves could be included in the study. Only two candidates pointed out that they would, as a matter of course, require the equivalent rate of unemployment for 21 year-olds without A levels. That is the power of the missing comparator. So widespread has this error become that it can almost be accounted a technique, used most prominently by politicians and by the media in reporting crises in public policy (e.g. Ghouri 1999). One of the main values of trials is therefore that they force us to focus on a fair comparison.
Fitz-Gibbon, C. (2003) Milestones en route to evidence-based policies, Research Papers in Education, 18, 4, 313-329
2.2 Need for good match
It is widespread in the literature, even in the minority of cases where a comparator of any sort is used, that the comparator is inappropriate. Primary school pupils in the UK are taught about the concept of a fair test, but the idea seems to have evaded most education researchers. Of course, in some studies the comparators are pre-determined or outside the control of the researchers. In these cases it is essential that researchers consider and report the ways in which the comparison might not be fair and so how it might affect or undermine the results.
For example, Hammond and Yeshanew (2007) looked at differences in attainment between schools who took part in a detailed feedback process called PASS, run by NFER, and those who did not take part or who received other kinds of feedback. They concluded that ‘Schools who participated in PASS showed a significant difference (p<0.05) in attainment compared to those who received feedback as part of another project’ (p.103) and so entitled their paper ‘The impact of feedback’. These authors therefore make two common mistakes (see section 2.3). They assume that differences are caused by (impact of) the feedback even though the causal mechanism remains untested, and they use a p-value based on sampling theory to decide on the substantive importance of the difference, even though they also report that ‘no actual samples have been drawn’ (p.102). They seem to think that multi-level modelling, the complex correlation technique they use, means that they have tested causality and that they can ignore the assumptions of sampling theory. However, perhaps the most important flaw in their study is that the schools taking part in PASS differed from the others in at least two important respects. They volunteered to take part in the scheme, and they paid NFER to do so. This means that pre-existing differences in motivation and resources need to be considered as possible explanations of any subsequent differences found in attainment. But the authors are silent on these. These errors are not unusual, and the paper is used here simply as an illustration of the wider phenomenon that goes largely unremarked in review. In addition, it might be considered relevant that this evaluation of PASS was conducted and written by employees of NFER, the organisation selling use of PASS to schools in the first place.
It is possible to try and make better matched comparisons than this, even when the comparators or groups are not determined by the researchers. One approach growing in popularity is termed propensity score matching. This approach builds statistical profiles to determine whether individuals will be part of the treatment group or not (Rosenbaum and Rubin 1983), but like its competitors (such as the contextualisation used by DCSF in England for supposedly value-added comparisons of school performance) it is considerably weaker than the randomisation routinely used in trials (see section 3.1, and see section 6.1 for a radical alternative).
One of the many weak counter-arguments to the use of trials is that unlike science and medicine, the social world of education is so complex that a host of interactions (such as pupil reactions to the sex or age of the teacher) and personal characteristics (such as motivation) cannot be controlled for. The argument therefore suggests that randomised controlled trials and the like cannot be used in most areas of education. It is all too complicated. This argument is quite wrong and betrays a shocking lack of understanding of the workings of trials. Leaving aside the issue of whether genetics and particle physics really are much easier to study than social science, the matching control of RCTs is completely different in approach to propensity scores or contextualisation. They are randomised controlled (section 3.1). Instead of deliberately matching the two (or more) groups in an experiment in terms of known characteristics and so being open to the charge of a poor match in terms of unknown characteristics, RCTs do not have to consider any characteristics (known or unknown). If we take a large number of cases and allocate them randomly to two groups we are more likely to end up with unbiased groups than if we try to manufacture them in another way. RCTs use randomisation and large numbers of cases precisely because of the complexity of educational interactions. The complexity of the social world is an argument for using RCTs rather than the reverse.
2.3 Need for causal sequence
One of the main claims of trial designs, their raison d’etre in fact, is that they provide a good unbiased test of a causal claim. Intriguingly, despite the lack of experimental work in education (section 1) and the apparent decline of intervention studies over time, the proportion of papers making causal claims has actually grown over time, at least in the US (Robinson et al. 2007). The growth has been greater in studies with no intervention. In particular, complex statistical approaches such as HLM (multi-level modelling) and structural equation modelling are routinely misunderstood by researchers as testing causation, whereas of course they are subject to the same strictures are simple correlations. For example, Malacova (2007) created a post hoc multi-level model related to single-sex teaching. The paper is called the ‘Effects of single-sex education on progress in GCSE’ and many times talks of the effect of single-sex teaching – as in ‘The effect of school type is highly significant’ (p.246). There are many problems with this paper, including the fact that ‘The data are based on the entire population of schools’ (p.238) and so Malacova is misusing the idea of statistical significance. However, the key problem is that the conclusion (and title) cannot be warranted by the methods used (see section 4.2). Multi-level modelling is no more than complex correlation. Both of these mistakes, misuse of probability based statistics and confusing correlation with cause and effect, are widespread and this paper is used merely as an illustration.
Of course, none of this matters if the concept of causation, on which the apparent pre-eminence of experimental methods rests, is an illusion. It is not possible to detect a cause empirically or prove that one exists philosophically. Effects cannot be deduced from observing causes, nor causes from observing effects (seeing a light bulb going off does not, by itself allow the observer to deduce whether it has been switched, whether there is power failure or the bulb is broken for example, Salmon 1998). It is even possible to imagine and describe social life without reference to causes. Since this is so, and we cannot see, smell, hear, measure or register causes directly it may be unwise to assume that they exist. In fact, an argument could be advanced that this is the most parsimonious, and therefore the most scientific, explanation of our observations. We can never directly sense a cause. We merely induce their existence from our experience of the association of two or more events, and this is nothing more than a habit of mind - immutable though it appears (Hume 1962). A cause is therefore 'when the occurrence of one event is reason enough to expect the production of another' (Heise 1975). A very similar process is observed in both classical and operant conditioning, where the association of two things leads the conditioned subject to behave in the presence of one thing as though it implied the presence of the other.
Causes are seen by some respected commentators as pre-scientific. Pearson (in Goldthorpe 2001) as early as 1892 was calling the idea of causes a 'mere fetish', which was holding up the advance of correlational techniques in statistics. Russell (in McKim and Turner 1997) argued in 1968 that physics no longer sought causes as they simply do not exist. According to him, causality is a relic of a bygone age, like the theory that infections were caused by demons invading the body perhaps. The best we can apparently hope for is the identification of 'relatively invariant functional relationships among measurable properties'. So Russell, like Pearson, would argue that scientific laws are idealised correlations. Mathematical statements or systems of equations can describe systems but they cannot express either intention or causality. If we drop a ball in a round bowl it will come to rest in the centre. We may predict this, and say that this was 'caused' by gravity, but we can see neither the cause nor the gravity, and the cause itself could not be expressed mathematically. This becomes clearer if we drop two balls in the bowl. We can model the final resting places of both balls mathematically, but we cannot use this to decide which ball is 'causing' the other to be displaced from the centre of the bowl. The events are mutually determined and this system of mutual determination is what the equations express (Garrison 1993). Mathematics can be used to show that systems are, or are not, in equilibrium, and to predict the actual change in the value of one variable(s) if another variable(s) is changed. However, this prediction works both ways. If y=f(x) then there will be a complementary function such that x=f'(y). Which variable is the dependent one (on the left-hand, predicted side) is purely arbitrary. Nothing in mathematics can overcome this. Non-causal mutuality (or concomitance) could be a perfectly reasonable and reasonably useful interpretation of many such sets of events.
A perfectly plausible alternative is one based purely on random events. A large table of pseudo-random numbers can contain arithmetic sequences, and passages of repetition, without us denying their essential randomness. The sequence '0 1 2 3 4 5 6 7 8 9' is as likely to be generated randomly as any other sequence of ten digits, such as '3 2 7 5 8 8 4 5 1 9'. Both are equally 'random' in the sense that we mean when describing such tables. In the same way perhaps the apparent regularities and repetitions that we observe more generally would be expected in a large (possibly infinitely large) universe. On this, admittedly rather extreme view, all scientific propositions are like the behaviour of a pigeon in a Skinner box repeating pointless actions in face of an accidental reinforcement schedule. However, this view, while intellectually coherent, means the end of scientific endeavour and, by definition, is not one that can be logically espoused by anyone engaged in publicly relevant research. Similarly, an economist believing that market indicators were actually following a 'random walk' could not earn a living as a predictor of these indicators, except as a charlatan.
Another position worthy of consideration in relation to the existence of causes is that they exist alongside non-caused events. One version of this stance was taken by those advancing the teleological argument for the existence of a god. Their argument was that everything has a cause, so it is possible to follow the causal chain back to the first cause which was, for the want of a better term, god. Ignoring the simple counter-argument that the existence of a first cause actually refutes the first premise (i.e. that everything has a cause), it is clear that such advocates are allowing both causes and non-caused phenomena to exist in the same universe. The same approach is now followed by economists who present evidence for rational choices as a causing agent. These choices, such as those involved in human capital theory, do not appear to work for individuals but only at aggregated levels. One interpretation therefore is that individuals operate using idiosyncratic processes that only appear to be rational when grouped. More overtly, this position was adopted in the twentieth century by physicists and others believing that events at some levels are random (uncertain) while at higher levels of analysis they are patterned. In social science this belief appears in models in which the predictable components of behaviour are seen as causal in nature, and the unpredicted (and unpredictable) parts are seen as random error terms or individual whimsy (Pötter and Blossfeld 2001).
An alternative view is that this position, while as logically possible as a random universe, is invalid for the practising social scientist. The number of potential explanations for any finite set of observations is actually infinite (created by simply adding more and more redundant clauses to a proposition for example). We overcome this practical problem, and foster cumulation, by concentrating only on the simplest explanations available. These are the most parsimonious, seeking to explain the observations we make without using additional propositions for which there is not already evidence. They are also the easiest to test, and to falsify in the Popperian model. We have no direct evidence to decide between explanations based on causes or on random events (Arjas 2001), so to use either one of them in an explanation involves making an assumption. To explain a set of observations using both involves making two assumptions, and is therefore unparsimonious. We have enough trouble establishing whether causes exist or not. To allow them to exist alongside unrelated phenomena makes most social scientific propositions completely untestable (for the falsification of a purported cause can always be gainsaid by the 'whimsy' element). Perhaps this is why social science shows so little practical progress over time.
Uncertainty could also be merely unpredictability, and it would be arrogant to assume that if we cannot yet predict a set of events then there is no more predicting to be done. Chaos theory is clearly causal but it allows for unpredictability due to complications in computation from the initial states (Gleick 1988). This unpredictability could stem from our inability to predict causatory events, or from our misunderstanding of the basic randomness of events. Both explanations are plausible, but currently untestable. Using both processes together is unnecessary, and trying to combine them into one description often leads to logical difficulties anyway. For example if sub-atomic events are really random, but have an effect on larger processes which are themselves causes, then following the causal chain argument the larger 'causes' are themselves randomly determined and therefore random. And if 'random' events can have a cause then they are not random, by definition.
The problem with causation is not that there are events that it cannot explain, but that it is itself impossible to observe. Therefore, there is no value in mixing it up with a model such as intention (Gambetts 1987) which is also perfectly capable of explaining decisions by itself but which is also not open to observation by social scientists. Given that there is no way of deciding between them empirically, either causation or intention can be adopted (it makes little practical difference which at this stage). There is no empirical justification for working with both at the same time (any more than there is for working with causation and randomness). Rather, in a causal explanation, an intention or an individual choice can be an outcome (of social or family background for example) as well as a cause. The argument is actually about the nature of the cause (or effect), not about whether it is a cause. When psychologists argue the nature/nurture controversy, or sociologists debate the relative importance of structure and agency, for example, they are simply arguing about what the relevant causes are.
One way of viewing causation is as a stable association between two elements. Where one is present the other is also, and when one is absent the other is also. It is the constant conjunction that suggests that all possible futures will be like all pasts (Hume 1962). This view of causation has two main problems: we know that it opens us to superstition, and it does not allow for intermittent association. Skinner's accidental reinforcement schedule is a powerful reminder of the dangers of allowing causal models to be based only on association. Skinner's intermittent reinforcement schedule shows us how difficult it might be to shake such causal models once they have been accepted. We can be easily fooled by association (hence the common caveats about correlations in standard textbooks), especially where these associations involve large numbers and are backed by expertise or apparent authority (Brighton 2000). This point was made recently by Johnson (2001) in relation to the false distinction in the US between 'causal-comparative' studies using analysis of variance techniques and 'correlational' studies. Comparative models do not provide positive evidence of causation in non-experimental designs. It is, perhaps, simply their increasing complexity and the apparent authority of the statisticians who understand them that makes others prepared to accept this falsehood.
In evaluating whether a possible causal theory makes sense, de Vaus (2001) suggests in addition to explaining the co-variation and time sequence, and being plausible, that the proposed dependent variable must be capable of change. While the sex of the student could affect the outcome of a job interview, the reverse could not be true. Sex would be unchanged by the interview. In fact, we can go further than saying the dependent variable must be capable of change. It must be able to be changed by the independent variable. If there is a relationship between the level of poverty among sixteen-year-olds and their examination results, then the only causal model that makes sense in the short-term is one where poverty affects examination results. A possible characteristic of a good causal model is an explanatory process or theory that takes these restrictions on plausibility into account. If causation is a generative process then something must be added to the statistical association between an intervention and an outcome for the model to be convincing. The cause must be tied to some process that generates the effect. The standard example is the clear relationship between smoking and lung cancer. The statistical conjunction and the observations from laboratory trials were elucidated by the isolation of carcinogens in the smoke, the pathological evidence from diseased lungs and so on. From this complex interplay of studies and datasets emerges an explanatory theory - the kind of theory that generates further testable propositions. This is the key role for theory-building in research.
This brings us back to the role of experiments. Another way of viewing causation is via the effect of an intervention. If causes are not susceptible to direct observation, but what they 'cause' is effects, then at least those effects must be observable. Causes are really only susceptible to testing by intervening and measuring, the technique of randomised controlled trials and related designs. We should therefore probably follow the principle of 'no causation without manipulation'. This is the approach used by Pavlov in so far as classical conditioning involved a causal model of learning and extinction. Koch used a very similar approach of intervening and treatment removal to show causation in infections (Cox and Wermuth 2001). Unfortunately in a social science where the subject of study is people we cannot usually expose the same people both to the treatment and not, as might be possible by using two near identical cases in Physics for example. We therefore use statistical approaches (such as random allocation to groups) to overcome this limitation. And this, of course, may be why probabilistic models of causation emerge. They may reflect, not the reality of the study, but the practical limitation of our experimental designs when dealing with people.
Having resolved this, in practical terms cause/effect is still difficult to isolate. Given the design bias, and sampling and measurement errors in all our work we may end up with estimates rather than simple, almost mechanical, cause and effect models. While perhaps disappointing to some, this is actually inevitable. Our role as researchers is to minimise the bias and the sampling and measurement errors. Statistics, as popularly conceived, can only help with the least important of these - the sampling error. Overcoming the rest of the error, the bulk of it in any design, is to do with rigour. Rigour should transcend any specific design, approach or method. It is certainly not the prerogative of experiments.
Gorard, S. (2002a) The role of causal models in education as a social science, Evaluation and Research in Education, 16, 1, 51-65
|How to reference this page:
||Gorard, S. (2007) Experimental Designs. London: TLRP. Online at http://www.tlrp.org/capacity/rm/wt/gorard (accessed