home  news  search  vre  contact  sitemap
 Capacity building resources

Experimental Designs

Stephen Gorard


Stephen is ...


1. Introduction
            1.1 We all need trials
            1.2 The basic idea
2. The value of trials
            2.1 Comparators
            2.2 Matching groups
            2.3 Causal models
3. Simple trial designs
            3.1 Sampling
            3.2 Design
4. Related issues
            4.1 Ethical considerations
            4.2 Warranting conclusions
            4.3 The full cycle of research
5. Resources for those conducting trials
            5.1 Reporting trials
            5.2 Tips for those in the field
6. Alternatives to trials
            6.1 Regression discontinuity
            6.2 Design experiments
7. Some examples of trials
            7.1 STAR
            7.2 High/Scope
            7.3 ICT
8. References

How to reference this page

4. Related issues

4.1 Ethical considerations

A key ethical concern for those conducting or using publicly-funded education research ought to be the quality of the research, and so the robustness of the findings, and the security of the conclusions drawn. Until recently, very little of the writing on the ethics of education research has been concerned with quality. The concern has been largely for the participants in the research process, which is perfectly proper, but this emphasis may have blinded researchers to their responsibility to those not participating in the research process. The tax-payers and charity-givers who fund the research, and the general public who use the resulting education service, have the right to expect that the research is conducted in such a way that it is possible for the researcher to test and answer the questions asked. Generating secure findings for use could involve a variety of factors including care and attention, sceptical consideration of plausible alternatives, independent replication, transparent prior criteria for success and failure, use of multiple complementary methods, and explicit testing of theoretical explanations through randomised controlled trials or similar experimental designs.

While perhaps overplayed in importance by some writers there will be at least some ethical considerations in any piece of research. Consider this example. NHS Direct is a telephone helpline set up to relieve pressure on other UK National Health Service activities. Callers can ask for help and advice, or reduce their anxiety about minor injuries or repetitive illness, without going to their General Practitioner or to hospital out-patients. Research reported by Carter (2000) found serious shortcomings in this new service. The evidence was collected by making a large number of fake calls to test the consistency, quality and speed of the advice given. In ethical terms, is this OK?

One argument against this study is that it has misused a procedure intended to relieve pressure on an already pressurised and potentially life-saving public service. By conducting the research via bogus calls, it is at least possible that individuals have suffered harm as a consequence. One argument for the study would be that realistic (and therefore 'blind', see above) evaluations are an essential part of improving public services, and that the longer-term objective of the study was to produce an amelioration of any shortcomings discovered. If, for the sake of argument, NHS Direct was actually a waste of public funds it would be important to find this out at an early stage and redirect its funding to other approaches. This, in a nutshell, is the major issue facing ethics and research. Researchers will not want to cause damage knowingly, but is it worth them risking possible harm to some individuals for a greater overall gain? As with most decisions I am faced with, I do not have a definite answer to this one. Or rather, my definite answer is 'it depends'.

It depends, of course, on the quality of the research being conducted. Most observers would agree with this on reflection, but it is seldom made explicit in any discussion of ethics. It would, for example, be entirely reasonable to come to opposite conclusions about the example above dependent on the quality of the study. If calling the help-line for research purposes runs a risk of replacing other genuine callers then it has to be considered whether the value of the research is worth that risk. The risk can only be judged against the purpose and rigour of the research. If, for example, the study found that the line was working well, then no more research is needed (and the study has served its evaluative purpose). If the study found problems, and as a result these could be ameliorated (although it is clearly not the full responsibility of the researcher if they are not), then the study could claim to be worthwhile. The one outcome that would be of no use to anyone is where the research is of insufficient quality to reach a safe and believable conclusion either way. In this case, all of the risk has been run for no reason and no gain. From this it would not be too much of a stretch to say that, in general, poor research leading to indefinite answers tends to be unethical in nature, while good trustworthy research tends to be more ethical.

In many fields in which we wish to research our influence over ethical situations is marginal. One may have to 'befriend' convicted serial killers, however repugnant the task, in order to find out about their motivations (if this is felt to be important to know). Our control over the quality of our work is generally much greater than our control over ethical factors. Thus, ethically, the first responsibility of all research should be to quality and rigour. If it is decided that the best answer to a specific research question is likely to be obtained via an experimental design for example, then this is at least part of the justification in ethical terms for its use. In this case, an experiment may be the most ethical approach even where it runs a slightly greater risk of 'endangering' participants than another less appropriate design. Pointless research, on the other hand, remains pointless however 'ethically' it appears to be conducted. Good intentions do not guarantee good outcomes. Such a conclusion may be unpalatable to some readers, but where the research is potentially worthwhile, and the 'danger' (such as the danger of wasting people's time) is small relative to the worth, my conclusion is logically entailed in the considerations above. I am, of course, ruling out entirely all actions, such as violence or abuse, that we would all agree are indefensible in any research situation.

Reinforcement for this conclusion comes from a consideration of the nature of funding for research. Whether financed by charitable donations or public taxation, research must attempt to justify the use of such public funds by producing high quality results. If the best method to use to generate safe conclusions to a specific question is an experiment (for example), then there should be considerable ethical pressure on the researcher to use an experiment.

The application of experimental designs from clinical research to educational practice does, however, highlight specific ethical issues (Hakuta 2000). In a simple experiment with two groups, the most common complaint is that the design is discriminatory. If the control group is being denied a treatment in order for researchers to gain greater knowledge about it, this could be deemed unethical. But this approach is only unethical if we know which group is to be disadvantaged. In most designs, of course, the whole purpose is to decide which treatment is better (or worse). We need evidence of what works before the denial of what works to one group can be deemed discriminatory. Perhaps a study would only be unethical if you couldn't find anyone who believed that the experimental group is not advantaged. In our current state of relative ignorance about public policy and human behaviour, it is as likely that the treatment will be the inferior approach for some, as that doing nothing to find out what works will damage the chances of others. An analogy for our present state of affairs might be the development of powered flight. All aeroplanes and flying machines designed around 1900 were based on the same Newtonian aerodynamics in theory. In testing, some of them flew and some crashed, despite the belief of all designers that their own machine would work. It was only the testing that sorted one group from the other. To strain the analogy a little, one could hardly argue that it would be more ethical for us all to fly in planes that had not been tested. For some reason, most discussions of ethical considerations in research focus on possible harm to the research participants, to the exclusion of the possible harm done to future users of the evidence which research generates. They almost never consider the wasted resources, and worse, used in implementing treatments and policies that do not work (see Torgerson and Torgerson 2001). In the UK it is legally impossible to market a new powder for athlete's foot without testing it, but we spend billions of pounds on public policies for crime, housing, transport and education that affect millions of people without any real idea of whether they will work. How ethical is that?

On the other hand, is it fair to society (rather than just the control group) to use an intervention without knowing what its impact will be? Would it be reasonable, for example, to try not jailing people sentenced for violent crimes simply to see if this led to less re-offending (de Leon et al. 1995)? Again the answer would have to be - it depends. What we have to take into account is not simply what is efficient or expedient but what is right or wrong. This judgement depends on values, and values are liable to change over time. In fact, doing the work of research can itself transform our views of what is right and wrong. If an alternative punishment to prison led to less violent crime, who would object (afterwards)? Would we have oxygen treatments for neonates, or drugs for heart diseases, if we were dominated by short-term ethical considerations? Ideally, we should test all public and social interventions before using them more widely. The problems above are also shared with disciplines like history (archaeology, palaeontology, astronomy etc.), but the difference here is that history (like the others) is constrained to be non-experimental and is, in effect, making the best of what is possible. Social science research has no such general constraint about experiments (although it applies to some research questions).

Is deception of the participants in an experiment OK? Should we always tell the truth? Should we encourage others to behave in ways they may not otherwise (by making racist statements for example)? What is the risk to the participants? Can we assure confidentiality? Moral judgements such as these require deliberation of several factors, and there is seldom a clear-cut context-free principle to apply. There are two main contradictory principles in play here: respect for the welfare of participants, and finding the truth. The right to 'know' is an important moral after all, even where the consequences might hurt some individuals (such as those with a commercial interest in our ignorance). We can never fully ignore the consequences of our study and we need to be tentative in our claims, as even experiments lead only to possible knowledge. Nevertheless, we also need virtues such as honesty to behave as researchers, to publish results even when they are painful or surprising (and the question 'could you be surprised by what you find?' is for me one criterion of demarcation between research and pseudo-research), and the courage to proceed even if this approach is unpopular.


Gorard, S. (2002b) Ethics and equity: pursuing the perspective of non-participants, Social Research Update, 39, 1-4

4.2 Warranting conclusions

Some of the criticism of education research in the US, UK and elsewhere during the 1990s was concerned with relevance. But education is an applied field of research. I do not find, as I review evidence for different projects, much published research that has no relevance to some important or useful component of education. The criticism should more properly be directed to the poor quality of much research, where even though the findings may have relevance they still cannot be used safely. In response to these perceived deficiencies, formal capacity-building activities have tended to focus on solutions in terms of methods, such as having more quantitative work, more systematic reviews, or more experiments. These, to my mind, are not the answer in themselves. The answer for me lies in genuine curiosity, coupled with outright scepticism. These characteristics lead a researcher to suit methods to purpose, try different approaches, replicate and triangulate, and to test their findings. It leads them to consider carefully the logic and hidden assumptions on the path from evidence to conclusions, automatically generating caveats and multiple plausible interpretations from the standard query – ‘if my conclusions are actually incorrect, then how else could I explain what I have found?’. Some improvement may come from researcher development, but, somewhat pessimistically for an educator, I have come to believe that the role of capacity-building is limited here (Gorard 2005).

Research itself is quite easy. Everyone (even an infant) does it every day by gathering information to answer a question and so solve a problem (e.g. to plan a rail journey, Booth et al. 1995). In fact most of what we 'know' is research-based, but reliant on the research of others (such as the existence of Antarctica). Where we have no other choice we may rely on our judgement of the source of that information (an atlas may be more reliable than memory, the rail enquiries desk may be more reliable than last year's timetable). But where we have access to the research findings on which any conclusions are based we can also examine their quality and the warrant that connects the two. Similarly when we present our own research findings, we need to give some indication, via caveats, of the extent to which we would be prepared to bet on them being true, or the extent to which we would wish others to rely on them being true. This is part of our 'warrant'. Obviously, producing high quality research is important but even high quality work can lead to inappropriate conclusions.

Huck and Sandler (1979) remind readers of a silly example in order to make an important point about warrants. An experimental psychologist trains a flea to jump in response to hearing a noise. Every time the noise is made the flea jumps. They then cut the legs off the flea, and discover that it no longer jumps when the noise is made. Conclusion: cutting off the legs has affected the flea's hearing. Of course, this is clearly nonsense but, as with the politician's error, it is likely that we have all been persuaded by similar conclusions. If a physiologist cuts out a piece of someone's brain, and the person can no longer tell us about a memory (or perform a skilled action) that they were able to previously, then is this evidence that the specific memory or skill was 'stored' in that section of brain? Many such claims have been made, and early maps of brain function were based on just this approach. However, the same effect of inability to report recall of memory (or skill) could have been achieved by cutting peoples' tongue out, or removing their heart. All three operations may prevent memory recall for different reasons without showing that the part of the body removed in each case is the site of the memory.

Brignell (2000) provides another example. The chemical industry routinely uses a chemical called 'dihydrogen monoxide'. While tremendously useful, this chemical often leads to spillages, and finds its way into our food supply. It is a major component of acid rain, and a cause of soil erosion. As a vapour it is a major greenhouse gas. It is often fatal when inhaled, and is a primary cause of death in several UK accidents per year. It has been found in the tumours of terminally ill patients. What should we do about it? In a survey the clear majority of respondents believed that water, for that is what it is, should be either banned or severely regulated. All of those statements about water are basically 'true', yet clearly none of them mean that water should be banned. Now replace water with another, less abundant, chemical. How do you feel about banning it now? You have no obvious reason to change your mind. Yet you will probably have accepted just such evidence as we have about water to accept the banning of other chemicals. Do you see how difficult, but also how important, the warrants for research conclusions are? In both the flea and the water example the problem was not principally the research quality (or put another way the problem was separate from any reservations we may have about quality). The problem was that the conclusions drawn were not logically entailed by the research evidence itself.

The warrant of an argument can be considered to be its general principle - an assumption that links the evidence to the claim made from it (Booth et al. 1995). Claims must be substantive, specific, and contestable. The evidence on which they are based ought to be precise, sufficient, representative, authoritative, and clear to the reader (as far as possible). In logical terms, if we imagine that our simplified research evidence is that a specific phenomenon (A) has a certain characteristic (B), then our evidence is that A entails B. If we want to conclude from this that phenomenon A therefore also has the characteristic C, then the third component of our syllogism (the classic form of our argument) is missing or implying. This third component is that everything with characteristic B also has characteristic C. Thus, our complete syllogism is:
This A is B
All B are C
Therefore, this A is also C.
While the first part (A is B) may be likened to the evidence in a research study (e.g. water can be fatal), and the third (A is C) is the conclusion (e.g. water should be banned), then the second (B is C) is like the warrant (e.g. everything that can be fatal should be banned). In research this step is often missed, as it is tacitly assumed by the author and the reader. However, where the research is intended to change the views of others it is necessary to make the warrant explicit. It can be challenged, but unlike a challenge to the evidence it is not about quality but rather about the relevance of the evidence to the conclusion. In the water example the warrant is clearly nonsense. Water can be fatal, but we cannot ban everything that could be fatal. But accepting that this warrant is nonsense also means that no evidence, however good, can be used with this precise format of argument to justify banning anything at all.

The warrant may be part of the research design but it is independent of any particular method of data collection (de Vaus 2001). Methods cannot be judged in isolation from the questions they are intended to illuminate (National Research Council 2002). The results should be disclosed to critique, and the conclusions drawn based on an explicit coherent chain of reasoning which rules out all plausible counter-explanations, and is intended to be persuasive to a sceptical reader (rather than playing to a gallery of existing 'converts', for example). The first question to be asked of any evidence presented in support of a model of a social process is 'but what else might this mean?'. The ability to discern rival explanations, while varying considerably between individuals, probably grows with practice (Huck and Sandler 1979). It is a key skill for good research (but manifestly not a necessary one for 'success' in a research career). But, perhaps more importantly, it is a key skill for everyone to have as a consumer of research - so we won't get fooled again. One way of improving this skill is to learn to recognise common forms of misleading argument. For example, the 'fallacy of affirming the consequent' is quite commonly encountered in social science. The fallacy argues that if A is true then B will follow. Then if B appears it is taken by some researchers to mean that A is true. While seductive there is no logic to this argument unless it starts more strongly with 'only if'. Otherwise exactly the same argument can be made with Z (or anything else) substituted for A.

The boxing off of plausible rival explanations is therefore generally at the heart of effective warrants. For any real system of variables there are nearly infinite models that could explain them (Glymour et al. 1987), in the same way that an infinite number of equations can join any two points on a graph. Therefore, no one can consider them all possible theories to explain any finding - so that in social science, as in natural science, every 'law' that is ever proposed is quite literally false. The purpose of the warrant is show readers that the proposed explanation is the best we have at this point in time. As we have seen, a useful short-cut is to employ parsimony to eliminate many of the potential alternatives (cf. the canon attributed to Morgan 1903, ‘In no case may we interpret an action as the outcome of the exercise of a higher psychical faculty, if it can be interpreted as the outcome of one which stands lower in the psychological scale’, p. 53). It is, for example, simpler, and usually safer for a doctor to diagnose a complaint of headache, neck stiffness, fever and confusion as meningitis, rather than as a combination of brain tumour, whiplash, tuberculosis and acute poryphyria. Of course, the latter could be correct, but parsimony encourages us to eliminate the more mundane and simplest explanations first. We therefore limit our potential explanations to those that employ the fewest (ideally none) assumptions for which we have no direct evidence. This boxing off of plausible rival explanations is what a trial design leads us to, making a warranted causal claim easier to sustain. Thinking about warrants is also partly what the idea of thought experiments is about (section 1).


Gorard, S. (2002c) Fostering scepticism: the importance of warranting claims, Evaluation and Research in Education, 16, 3, 136-149

4.3 The full cycle of research

The power of the experiment comes not from the design alone but from the power of the questions to which experiments can be addressed. Such designs should therefore be additional to, not replacements for, other recognised modes such as detailed case studies and secondary analysis. My summary would be that experiments can be powerful but they are not 'magic bullets'. Research is not high quality just because it is experimental. If it is high quality and experimental then it is probably as good as we are ever going to achieve in social science research.

It is helpful to consider the research enterprise as a cycle of complementary phases and activities, because this illustrates how all methods can have an appropriate place in the full cycle of research. Experimental designs, like in-depth work or secondary analysis, have an appropriate place in the cycle of research from initial idea to development of the results. The main reason to emphasise experiments at this point in time is not because they are more important than other phases in the cycle, but because they represent a stage of work that is largely absent in education research. If nearly all of education research were currently conducted as laboratory experiments then I would be one of the commentators pleading for more and better in-depth work or secondary analysis, for example. Other weak points in the cycle are currently the systematic synthesis of what we already know in an area of work, the design or engineering of what we already know into usable products for policy and practice, and the longer-term monitoring of the real-world utility of these products (Gorard with Taylor 2004, Gorard et al. 2004).

Randomised trials can be expensive both in monetary terms, and more particularly in terms of their demands on research subjects and researchers. It is, therefore, morally dubious to conduct a trial until there is a reasonable basis on which to believe that the intervention is likely to be effective (and also perhaps morally dubious to deny the treatment to the control group once that basis has been established!). In the context of drug trials, basic pre-clinical science and further applied pharmacological research precedes small-scale trials. Only a minority of potential new treatments emerge as being of sufficient promise (and safety) to warrant definitive testing in a large-scale clinical trial.

The Medical Research Council (MRC, 2000) model for complex health education interventions suggests that interventions are most likely to be successful if they are based on sound theoretical concepts (Campbell et al. 2000). In this model, the first phase would involve the initial design of an intervention based on current theoretical understanding, with an explicit underlying causal explanation for its proposed effect. The second phase involves the formative evaluation of that intervention, using qualitative approaches such as interviews, focus groups, observation and case studies to identify how the intervention is working, the barriers to its implementation, and how it may be improved. The third phase is a feasibility study of the intervention, or full pilot study, involving both measurement and in-depth feedback. This phase also sees the main development of the alternative treatments or controls. The fourth phase is the trial itself, and the fifth might be the scaling up and 'marketing' of the results.

Traditionally, trials have required that the interventions being tested are standardised and uniformly delivered to all participants. However, since educational interventions are so dependent on the quality of delivery, the value of trials predicated on 'ideal' conditions is limited. For example, some smoking education interventions have been found to work well in efficacy trials, when delivered by enthusiastic teachers with ample curriculum time, yet when implemented in actual practice they have not been found to be effective, and the researchers have not necessarily known why (Nutbeam et al. 1993). It is therefore better to take a pragmatic approach, with the intervention delivered in the trial in a lifelike way. This approach sacrifices standardisation for realism, and means that the natural variability in delivery that occurs between practitioners must be recorded and monitored by in-depth means (perhaps video recording) as well as by more traditional outcome measures. In summary, the ‘trial design ensures that an unbiased estimate of the average effect of the intervention is obtained, while the qualitative research provides useful further information on the external factors that support or attenuate this effect' (Moore 2002, p.5).


Ercikan, K. and Wolff-Michael, R. (2006) What good is polarizing research into qualitative and quantitative?, Educational Researcher, 35, 5, 14-23

Gorard, S. (2006) Towards a judgement-based statistical analysis, British Journal of Sociology of Education, 27, 1, 67-80


How to reference this page: Gorard, S. (2007) Experimental Designs. London: TLRP. Online at (accessed )

Creative Commons License TLRP Resources for Research in Education by Teaching and Learning Research Programme is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License




homepage ESRC