home  news  search  vre  contact  sitemap
 Capacity building resources

Experimental Designs

Stephen Gorard


Stephen is ...


1. Introduction
            1.1 We all need trials
            1.2 The basic idea
2. The value of trials
            2.1 Comparators
            2.2 Matching groups
            2.3 Causal models
3. Simple trial designs
            3.1 Sampling
            3.2 Design
4. Related issues
            4.1 Ethical considerations
            4.2 Warranting conclusions
            4.3 The full cycle of research
5. Resources for those conducting trials
            5.1 Reporting trials
            5.2 Tips for those in the field
6. Alternatives to trials
            6.1 Regression discontinuity
            6.2 Design experiments
7. Some examples of trials
            7.1 STAR
            7.2 High/Scope
            7.3 ICT
8. References

How to reference this page

3. An introduction to trial designs

3.1 Sampling

There are two elements of sampling when conducting or assessing a trial. The first concerns the size of the sample, in much the same way as when conducting a survey or other study. The second concerns the random allocation of the overall sample into two or more treatment groups.

How large should a sample be? There are several methods to help decide on an appropriate sample size for any trial, but my general advice is to have as large a sample as possible. The sample must be large enough to accomplish what is intended by the analysis. Small samples can lead to the loss of potentially valuable results, and are equivalent to a loss of power in the test used for analysis. It is also the case that the actual number in your sample is not always a great determinant of your time or cost.

If you are looking for a difference between the groups in your trial (even if no difference exists in fact, you need to have a sample of sufficient size to convince others that you could have found the difference if there was one), then your success or failure is determined mainly by four things. First, there is the effect size of the phenomenon you are studying (or, of course, its rarity). In social science research effect sizes are often very small. For example, studies of the impact of social work interventions have struggled to find evidence of any beneficial effect at all. Studies of the impact of schools on student examination results suggest that around 85-95% of the variation in results is due to the prior attainment and characteristics of the individual students. Only 5-15% at maximum is due to the impact of teachers, departments and schools, and any error component. Therefore, looking at differences between schools in terms of curricular development or management style involves examining small differences within what is already a fairly small difference between schools. In both cases, you would need a very large sample in order to have a chance of finding an impact of social work or schools. The smaller the effect size the larger the sample you need to find it.
Second, there is the variability of the phenomenon you are studying. The more variable is the thing (or things) you are studying, then the larger the sample needed. Imagine you were trying to find the average height of a group of people. If they are all of the same height then you only need a sample of one to be perfectly accurate in your measurement, but the more variation there is in the heights of this population the more people you need to measure to make sure the first few are not extreme scores. As another example, if you are interested in comparing the examination results by sex in two schools the results may be quite similar in many respects. The difference between the highest and lowest achievers in either school is likely to be much larger than the differences between the schools or between the sexes. If boys and girls are gaining fairly similar results in both schools, then the effect size you are looking for (difference between sexes) is small in comparison to the overall variability of your chief variable (examination results).

Third, there is the 'power' of the statistical test that you use to discern the pattern. In summary, power is an estimate of the ability of the test you are using to separate the effect size from random variation. Fourth, there is the sample size.

To summarise: successful identification of social patterns is assisted by a strong effect, measures of low variability, using a powerful test, and by having a large sample. A change in any one of these factors is equivalent to a change in any others. Increasing the effect size therefore has the same effect as using a more powerful test, or decreasing the variability of the measure. However, of these four aspects only the sample size is clearly under the control of the researcher. Research questions are driven by importance, relevance, curiosity, serendipity and autobiography. Researchers do not decide what to research because of its variability or its effect size. Similarly, you will generally use the most powerful test that your design allows. Selecting a large sample is therefore the only chance you have to influence directly your chances of success. Note that even if you were to find no pattern this lack of pattern will only be convincing to your audience if the sample was large enough to have found one if it did exist.

As noted above, your resources for the research, including the time and money available, are probably a strong influence on your chosen sample size but do try not to exaggerate their importance. On the other hand, if your consideration of other factors suggest that you need to use a sample size that simply cannot be achieved with the resources available to you, then the study must be modified. Do not go ahead in the knowledge that your sample size is totally unsatisfactory for the work you are doing.

An increase in the size of your sample is equivalent to an increase in the power of any statistical test or model that you use. Power is a measure of the test’s ability to separate out genuine effects from random variation. In theory, you can try and estimate the sample size required more precisely via a ‘power analysis’. In practice, power analyses are somewhat unrealistic – needing to be conducted in advance for each possible variable in the study, and requiring that the variance and effect size of each measure is known in advance (somehow). If a statistician devised a new and more powerful test whose use with existing data was able to settle debates that social scientists had been having for decades, or conversely to throw doubt on other more established explanations, they would be rightly famous. Improvements in method are thus often the precursor to an improvement in knowledge. Yet you could achieve exactly the same effect as this by using a larger sample than you anticipated or than is normal in your field. Torgerson and Torgerson (2008) suggest as a pragmatic compromise approach that for any effect size sought, a minimum sample of 32 cases divided by the square of the effect size will provide about 80% power (i.e. detect the effect if present about 80% of the time). Of course, if the proposed analysis divides each treatment group into further sub-groups, such as males and females, then an even larger sample is needed (see Gorard 2003, chapter 3).

The total number of cases can be sought as volunteers, in organisational units (see below), or sampled randomly from a known population. In the latter case only, the results from the trial can be generalised to the more general known population. In the former case, the trial result is only directly valid for the groups taking part. This is a point of some confusion both for new analysts and for readers of trial reports, because generalising to a population as is also done with survey data sounds similar to the statistical analysis conducted with the trial treatment groups (see below). In fact, only the latter was what was intended by pioneers like Fisher. To recap – randomly sampling the cases to take part in trial (before allocation to groups) allows the results of the trial to be generalised to all those cases with a non-zero chance of having been selected (the population). Randomly allocating the cases in the trial (whether volunteers or sampled) to different treatment groups allows us to see if the treatments have had any differential impact. Randomly allocating the cases in the trial to different treatment groups avoids selection bias, and allows us to generalise the results to all cases (but not to any wider population). In both stages of sampling, missing cases are a major source of bias, and need to be avoided as far as possible (see section 5). It is usually possible to allocate all cases agreeing to take part to one of the treatment groups, and as long as the outcomes for anyone dropping out are known and they are included in the final analysis little harm is done (intention to treat/teach analysis).

Many times in education our samples are of organisational units such as classes or schools rather than individual learners. Sampling groups of cases in this way is known as cluster sampling, and as long as the units are selected randomly then the same logic of experiments can be adopted. Using a clustered sample implies not so much a difference in selection procedures as a difference in defining population units. The cases we are interested in often occur in natural clusters such as institutions. So we can redefine our population of interest to be the clusters (institutions) themselves and then select our sample from them using one of the above procedures. The institutions become the cases, rather than the individuals within them. This has several practical advantages. It is generally easier to obtain a list of clusters (employers, schools, voluntary organisations, hospitals etc.) than it is to get a complete list of the people in them. If we use many of the individuals from each cluster in our selected sample, we can obtain results from many individuals with little time and travel, since they will be concentrated in fewer places.

For example in a survey of teachers we might select a random sample of 100 of the 25,000 schools in England and Wales, and then use the whole staff of teachers in each of these selected schools. It is important that the odds of a cluster being selected are in proportion to the number of individuals they represent (i.e. schools with more teachers should be more likely to be picked). Despite this complication in the calculation (and the need to have at least some information about each cluster), this approach is growing in popularity. Its chief drawback is the potential bias introduced if the cases in the cluster are too similar to each other. People in the same house tend to be more similar to each other than to those in other houses, and the same thing applies to a lesser extent to the hamlets where the houses are (people in each post-code area may tend to be similar), and to the regions, and nations where they live (and so on). This suggests that we should try to sample more clusters, and use appropriately fewer cases in each cluster. As usual, the precise compromise between resource limitations and the ideal is a judgement of the researcher. Being aware of, and recording, this judgement is probably the most important safeguard against the undue influence of bias. More complex techniques for dealing with clusters include Bayesian methods, multi-level modelling, and robust variance estimation (Gorard 2007).

3.2 Basic design

The logic of an experiment relies on the only difference between the groups being due to the treatment. Under these conditions, the experiment is said to lead to valid results. There are several threats to this validity in experiments. Some of these are obvious, some less so. An often cited, but still useful, summary of many of these potential threats comes from Campbell and Stanley (1963) and Cook and Campbell (1979). These are conveniently grouped under eight headings, discussed briefly here. It is important to remember that all of the problems facing experiments apply with equal or even greater force to all other research designs. The experiment is currently the most theoretically-based and considered design available, and it has led to considerable research cumulation in many fields of endeavour of the kind that other, perhaps weaker, designs have yet to achieve.

Some people taking part in experiments may have other experiences during the course of the study that affect their recorded measurement but which are not under experimental control. An example could be a fire alarm going off during the exposure to one of the treatments (e.g. during the maths lecture for one of the groups above). Thus, an 'infection' or confounding variable enters the system and provides a possible part of the explanation for any observed differences between the experimental groups.

By design, the post-treatment measure (or posttest) is taken at some time after the start of the experiment or, put more simply, experiments require the passage of time. It is possible therefore that some of the differences noted stem from confounding factors related to this. These could include ageing (in extreme cases), boredom, and practice effects. Time is important in other ways. If, for example, we are studying the effect of smoking prevention literature among 15 year-olds, when is the payoff? Are we concerned only with immediate cessation or would we call the treatment a success if it lowered the students' chances of smoking as adults? To consider such long-term outcomes is expensive and not attractive to political sponsors (who usually want quick fixes). A danger for all social policy research is therefore a focus on short-term changes. Even where the focus is genuinely on the short term, some effects can be significant in size but insignificant in fact because they are so short-lived. Returning to the smoking example, would we call the treatment a success if it lowered the amount of smoking at school for the next day only?

Experimenters need to watch for what has been termed a 'Hawthorne' effect. A study of productivity in a factory (called Hawthorne) in the 1920s tried to boost worker activity by using brighter lighting (and a range of other treatments). This treatment was a success. Factory output increased, but only for a week or so before returning to its previous level. As there was apparently no long term benefit for the factory owners, the lighting level was reduced to the status ante. Surprisingly, this again produced a similar short-term increase in productivity. This suggests that participants in experiments may be sensitive to almost any variation in treatment (either more or less lighting) for a short time. The simple fact of being in an experiment can affect participants' behaviour. If so, this is a huge problem for the validity of almost all experiments and is very difficult to control for in a snap-shot design. It can be seen as a particular problem for school-based research, where students might react strongly to any change in routine regardless of its intrinsic pedagogical value (and the same issue arises with changes of routine in prisons and hospitals). Of course, the Hawthorne effect could be looked at in another way (e.g. Brown 1992). If you were not interested in generating knowledge in your research, but literally only concerned with what works, then adopting Hawthorne-type techniques deliberately could be seen as a rational approach. Since production increased both when lighting levels were increased and when they were decreased, some of the factory owners were naturally delighted with the results (although this part of the story is seldom told in methods textbooks).

The very act of conducting a test or taking a measure can produce a confounding effect. People taking part may come to get used to being tested (showing less nervousness perhaps). Where the design is longitudinal they may wish to appear consistent in their answers when re-tested later, even where their 'genuine' response has changed. A related problem can arise from the demand characteristics of the experimenter who can unwittingly (we hope) indicate to participants their own expectations, or otherwise influence the results in favour of a particular finding. Such effects have been termed 'experimenter effects' and they are some of the most pernicious dangers to validity. In addition, apparently random errors in recording and analysing results have actually been found to favour the experimental hypothesis predominantly (Adair 1973). If the researcher knows which group is which and what is 'expected' of each group by the experimental hypothesis then they can give cues to this in their behaviour.

Traditionally, this effect has been illustrated by the history of a horse that could count (Clever Hans). Observers asked Hans a simple sum (such as 3+5), and the horse tapped its hoof that number of times (8). This worked whether the observers were believers or sceptics. It was eventually discovered that it only did not work if the observer did not know the answer (i.e. they were 'blind', see below). What appeared to be happening was that the horse was tapping its hoof in response to the question, and after tapping the right number of times it was able to recognise the sense of expectancy, or frisson of excitement, that ran through the observers waiting to see whether it would tap again. The horse presumably learnt that however many times it tapped if it stopped when that moment came it would then receive praise and a sugar lump. Social science experiments generally involve people both as researchers and as participants. The opportunities for just such an experimenter effect (misconstruing trying to please the experimenter as a real result) are therefore very great. If we add to these problems, the other impacts of the person of the researcher (stemming from their clothes, sex, accent, age etc.) it is clear that the experimenter effect is a key issue for any design.

'Contamination' can also enter an experimental design through changes in the nature of the measurements taken at different points. Clearly we would set out to control for (or equalise) the researcher used for each group in the design, and the environment and time of day at which the experiment takes place. However, even where both groups appear to be treated equally the nature of the instrument used can be a confounding variable. If the instrument used, or the measurement taken, or the characteristics of the experimenter change during the experiment this could have differential impact on each group. For example, if one group contains more females and another more males and the researcher taking the first measure is male and the researcher taking the second measure is female then at least some of the difference between the groups could be attributable to the nature of same and difference sex interactions. Note that this is so even though both groups had the same researcher on each occasion (i.e. they appeared to be treated equally at first sight).

In most experiments the researcher is not concerned with individuals but with aggregate or overall scores (such as the mean score for each group). When such aggregate scores are near to an extreme value they tend to regress towards the mean score of all groups over time almost irrespective of the treatment given to each individual, simply because extreme scores have nowhere else to go. In the same way perhaps that the children of very tall people tend to be shorter than their parents, so groups who average zero on a test will tend to improve their score next time, and groups who score 100% will tend towards a lower score. They will regress towards the mean irrespective of other factors. If they show any changes over time these are the only ones possible, so random fluctuations produce 'regression'. This is a potential problem with designs involving one or more extreme groups.

As with any design, biased results are obtained via experiments in which the participants have been selected in some non-random way. Whenever a subjective value judgement is made about selection of cases, or where there is a test that participants must 'pass' before joining in, there is a possible source of contamination. This problem is overcome to a large extent by the use of randomisation both in selecting cases for the study and in allocating them to the various treatment and control groups, but note the practical difficulties of achieving this.

A specific problem arising from the extended nature of some experiments is dropout among participants, often referred to by the rather grim term 'subject mortality'. Even where a high quality sample is achieved at the start of the experiment this may become biased by some participants not continuing to the end. As with non-response bias, it is clearly possible that those people less likely to continue with an experiment are systematically different from the rest (perhaps in terms of motivation, leisure time, geographic mobility and so on). Alternatively, it is possible that the nature of the treatment may make one group more likely to drop out than another.

Perhaps the biggest specific threat to experiments in social science research comes from potential diffusion of the treatments between groups. In a large-scale study using a field setting it is very difficult to restrict the treatments to each experimental group, and it is therefore all too easy to end up with an 'infected' control group. Imagine the situation where new curriculum materials for Key Stage Two Geography teaching are being tested out in schools with one experimental group of students and their results compared to a control group using more traditional curriculum material. If any school contains students from both groups it is almost impossible to prevent one child helping another with homework by showing them their 'wonderful' new books. Even where the children are in different schools this infection is still possible through friendship or family relationships. Cross-infection in these circumstances can come from the teachers themselves who tend to be collaborative and collegial, and very keen to send their friends photo-copies of the lesson plans that they have just been given. For these teachers, teaching the next lesson is understandably more important than taking part in a national trial. On the other hand, if the experimental groups are isolated from each other, by using students in different countries for example, then we are introducing greater doubt that the two groups are comparable anyway. Similar problems arise in other fields, perhaps most notably the sharing of drugs and other treatments in medical trials.

As you can imagine, given these and other potential limitations of experimental evidence, there will always be some room for doubt about the findings even from a properly conducted experiment. It is important, however, to note two points. First, there are some things we can do with our basic design to counter any possible contamination, such as making as much of it as blind as possible so that the researchers analyse and enter data without knowing which group each case is in. Second, the experiment remains the most completely theorised and understood method in social science. With its familiarity comes our increased awareness of its limitations, but other and newer approaches will have as many and more problems. Worse, other designs will have dangers and limitations that we are not even aware of yet.

The basic experimental design takes care of several possible threats to validity. The random allocation of participants to groups reduces selection bias, so that the only systematic difference between the groups is the treatment, and the control group gives us an estimate of the differences between pre and post-test regardless of the intervention. Designs usually get more complex to control for any further threats to internal validity. In psychology in particular some very large, and sometimes rather unwieldy, approaches are used. A 'factorial design' uses one group for each combination of all the independent variables, of which there may be several. So for an experiment involving three two-way independent variables there would be eight conditions plus at least one control group. The effects of these variables would be broken down into the 'main effects' (of each variable in isolation) and the 'interaction effects' (of two or more variables in combination).

As you may imagine the analysis of such advanced designs becomes accordingly more complex also. For despite the fact that undergraduates are routinely taught these designs, they do not always, in my experience, either appreciate or understand them. And they even more rarely use them properly. I came across an entire student cohort of psychologists who were 'sharing' the syntax instructions (i.e. a computer program) to run a multivariate analysis of variance with their dissertation data. The syntax was given to them by a member of staff who appeared to believe that it could be used without explanation, and for all and any experimental designs. None of the students I spoke to had the faintest idea what the numbers generated by this program meant.

Factorial designs are anyway sometimes used in situations when they are not necessary (perhaps only because 'we have the technology'). When faced with considerable complexity in the topic of an investigation I feel that a more helpful response is to seek greater simplicity of approach rather than greater sophistication. For example, it is clear that the pretest phase in an experiment can sensitise people for their subsequent post-test (an experience/instrumentation effect). So we could use at least four groups and alternate both the treatment and whether there is a pretest or not. A simpler variant with the same advantage is the posttest only design (Table 3.1). If the sample is large enough it is possible to do away with the pretest and assume that the randomly allocated groups would have had equivalent mean scores before treatment. As this is even simpler than the basic design we can be even more confident that it is only the intervention which causes any difference between groups. Problems are quite often solved in this way, via simplification of the process.

Table 3.1 - The posttest only experimental design






Group A





Group B





Since the researcher can have a social impact on the outcomes of an experiment, this needs to be controlled for in the design, if possible, and made visible in the reporting of results. There are various standard techniques to overcome the experimenter effect, though it is doubtful that all would be available for use in a small-scale student project. To start with it is important that the participants are 'blind' in that they do not know the precise nature of the experiment until it is complete. Ideally the experimenter should also be 'blind' in not knowing which group any participant belongs to (and this is also some protection against the ethical quandary of running a real-life experiment when you already believe, but have no publishable evidence, that one treatment is better than another). This double-blind situation is sometimes maintained by means of a placebo (the name deriving from drug trials) in which everyone appears to undergo the same treatment even though some of the treatment is phoney or empty (equivalent to a sugar pill rather than a drug). Finally, if practical, it is better to have a 'triple blind' situation in which the person coding and analysing the data does not know until later which is the experimental group.

Another way of achieving the same end is to automate the experiment and thereby minimise social contact (often not possible of course). Another is to sub-contract the experiment to someone else who does not know the details. You could, for example, offer to conduct an experiment for a colleague in return for them conducting yours. Other ways of minimising experimenter bias include getting more than one account of any observation, by using several people as observers and looking at the inter-rater reliability of all measurements taken, or by a triangulation of methods wherein the experimental findings are checked against evidence from other sources. All of these are good, and many can be used in combination.




How to reference this page: Gorard, S. (2007) Experimental Designs. London: TLRP. Online at (accessed )

Creative Commons License TLRP Resources for Research in Education by Teaching and Learning Research Programme is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 2.0 UK: England & Wales License




homepage ESRC