Millard Biology statistics made simple using Excel Neil Millard Spreadsheet programs such as Microsoft Excel can transform the use of statistics in A-level science Statistics is an area that most A-level biology students (and their teachers! ) find difficult. The formulae are often complicated, the calculations tedious, degrees of freedom mysterious, and probability tables confusing. But in fact students need no longer grapple with any of these. In real life, biologists and statisticians rarely use calculation and tables these days, but instead use statistical packages such as Maintain or SPAS.

But it isn’t even necessary to buy an expensive statistics package, since spreadsheet software such as Excel has most of the common statistical tests built-in. When using statistics, the first hurdle is to decide which statistical test to use. Figure 1 (overleaf) is a flow chart showing when to use the various tests described in this article. There are many other possible statistical tests, but this flow chart should be more than sufficient for A-level biology students. It briefly summarizes the Excel formulae and how to interpret the results, so it can be used as a handy guide on its own once the student is familiar with the tests.

This flow chart should be used when designing an experiment, not after the experiment is complete. This will ensure that the correct kind of data are collected so that the statistical test will be valid. The rest of the article describes in detail how to carry out these tests using Excel and how to interpret the results. It is divided into five sections: ABSTRACT Modern spreadsheet software, such as Microsoft Excel, can transform the use of statistics in biology. Instead of being difficult to do and to interpret, statistical tests become simple to do and much easier to interpret.

This article describes when and owe to carry out many of the most common tests (including mean, standard deviation, confidence limits, correlation, regression, t-test, xx-test and NOVA) using Excel. 1 Descriptive statistics mean, median, mode standard deviation, standard error, confidence interval scatter graphs, bar graphs error bars, lines Pearson coefficient, Superman coefficient linear regression 2 Graphing data 3 Association statistics Frequency statistics xx-test xx-test of association Most school biology experiments will involve some kind of measurement, such as time, length, mass, temperature, absorbency, etc. ND in a well-designed experiment there should be a number of repeats (or replicates) of each measurement. Once some measurements have been collected the first Job is usually to summaries them using descriptive statistics. Excel has formulae for the three measures of the centre of a distribution of replicates. The arithmetic mean is given by the formula: -AVERAGE (range) The median is given by the formula: -MEDIAN (range) And the mode is given by the formula: -MODE (range) School Science Review, December 2001 , 83(303) Figure 1 Flow chart used to choose an appropriate statistical test. These formulae are illustrated in Figure 2. In many cases the quantities measured in biology will show a normal distribution, and so the mean is the most appropriate statistic to use. It is also the one students are most likely to know already, and to be able to do by hand. The median and mode are less likely to be needed for experimental data, but some A-level specifications require a knowledge of them. It is unfortunate that Excel uses the word ‘average’ for ‘mean’, as some textbooks use average as a general term to refer to any measure of the centre of a distribution.

A tactician will tell you that there is no point in calculating a mean without also calculating some measure of the variation or spread of the measurements, but shows five different measures of the spread, and shows how easy they are to calculate using Excel. The range is given by the Excel formula: =MAX (range) – MIN (range) This is used in calculations, but has little use as a descriptive statistic since it is not in the same units as the measurements.

The standard deviation (SD) is given by the Excel formula: =STEED (range) This is common (since it is fairly easy to calculate by hand) and it gives a good indication of the variability of a set of data. However it is not the best statistic to use when comparing different sets of data, especially if the data sets are different sizes. The standard error of the mean (SE) is given by the formula: =STEED (range) / SORT (COUNT (range)) This gives an indication of the confidence of the mean, and is often used as an error measurement simply because it is small rather than for any good statistical reason.

The 95% confidence interval (C’) is given by the formula: ?CONFIDENCE (0. 05, STEED (range), COUNT (range)) This is the simplest, but least useful. The variance is given by the Excel formula: =VARY (range) Figure 2 Eight descriptive statistics. The MODE formula returns #N/A because no values are duplicated, so there is no modal value in these data. Note that Excel will always return the results of a calculation to about 8 decimal places. This is usually meaningless, and cells with calculated results should always be formatted to a more sensible precision (Format menu > Cells > Number tab > Number). 25 The value of 0. 5 is used to give the 95% (0. 95) confidence interval, and different values can be used for different levels of confidence, such as 0. 1 for a 99% confidence interval. There is a 95% probability that the true mean lies within В± CLC from the measured mean, and the upper and lower values of this range are called the confidence limits. Of these five, the 95% confidence interval is the most useful measure of the dispersion of data around the mean, and also the easiest to understand. It is not as well known as the others because it is so difficult to calculate, but using Excel it is no more difficult to calculate than the others.

It is the preferred on a graph. Students should always be encouraged to calculate a CLC whenever they lactate a mean, and to refer to it whenever they evaluate their data. If the CLC is small compared to the mean then the mean is reliable, but if the CLC is large compared to the mean then the mean is unreliable. In Figure 2 the two sets have the same meaner but different spreads, and the statistics all show that the data in group A have a smaller spread and are therefore more reliable than those in group B. Then the red arrow at the end of the X Values box.

Then highlight the cells containing the X data in the spreadsheet and press the red arrow again. Repeat for the Y Values box. In Chart Options, the most important tasks are to type in suitable titles for the graph and the two axes. You can also turn off grisliness and legend, which makes the chart look better. 4 In Graph Location, Just press Finish. This puts the chart beside the data so you can see both. Excel graphs are quite flexible and almost everything about them can be changed. Just double-click (or sometimes right-click) on the part you want to change.

For example, you can move and re-shape the graph; change the background color (white is usually best); change the shape and size of the markers points); Join the points; change the axes scales and tick marks; or add a trend line or error bars. Students should be discouraged from using AD or shadow effects, which only serve to obscure the graph trend. It is worth taking some time to get the graph right, because you can use an existing graph as a template. Simply type the new data in place of the existing data, and the graph automatically changes. The sheet can then be saved as a new file.

Graphs are an important part of data analysis and are closely connected to statistics, since the choice of graph is connected to the choice of statistical test, as the flow hart in Figure 1 shows. If you are investigating an association between two variables, then you should plot a scatter graph; if you are comparing different sets of data, you should plot a bar graph; and if you are collecting frequency data, then you may plot a bar or pie chart, or a graph may not be appropriate. In Excel it is quite easy to plot these graphs, as well as many other types. First enter the data into columns or rows, and select them.

Then click on the chart wizard (or Insert menu > Chart). This wizard has four steps: 1 In Graph Type, select the type you want and press Next. Choose Column’ for bar charts or ‘XX (Scatter)’ for line and scatter graphs. Do not choose ‘Line’, which plots the data against row number. This is a very common mistake. 2 In Source Data, if the sample graph looks about right, then Just press Next. If it looks wrong, you can correct it by clicking on the series tab, and Error bars If you are plotting averages on a scatter or bar graph, then error bars are a very good way to illustrate the confidence of the data on the graph.

Again, they are awkward to do by hand, but quite easy with Excel, and students should be encouraged to use error bars as a matter of course. Error bars usually show В± C’, although you could also plot them from SD or SE. Double-click on any data point or bar to get the Format Data Series dialogue box, and choose the Y Error Bars tab. Click intervals. Repeat for the Custom – box, and then press K. Error bars are useful for the evaluation part of student investigations. Small error bars suggest reliable data; large error bars suggest dubious data.

A line of best fit should pass through the error bars, and a good question to address in an evaluation is ‘Could I draw a different line through my error bars? In other words, do the data support a different conclusion? ). Figure 3 shows a graph where a curve has been drawn, but in fact a straight line would also pass through the error bars, so a linear relation is also supported by the data. 26 Figure 3 A scatter graph showing error bars. A curved line of best fit has been drawn through the data points, but in fact a straight line can also be drawn within the error bars, so a linear relationship is not ruled out.

Lines Scatter graphs often have lines, which either Join the data points or form a smooth ‘line of best fit’ (or trend line) through the middle of the points. The choice depends on the circumstances, but generally, if there should be a continuous smooth relation between X and Y, then a trend line is appropriate; otherwise the points should be Joined by straight lines. Trend lines are best drawn on the graph by hand, unless you want a linear regression line (see below). To Join the points with lines: double-click on any data point, select the Patterns tab and click on Line-Automatic.

It is not usually a good idea to have Excel draw a curved or smoothed line, as these curves can be highly misleading and can create spurious peaks and troughs for which there is no evidence. Superman rank-order correlation coefficient (RSI) for data that are not normally distributed (non-parametric data). Both vary from +1 (perfect correlation) through O (no correlation) to -1 (perfect negative correlation). In Excel the Pearson coefficient can be found by two alternative formulae: ?SORREL (range 1, range 2) -PEARSON (range 1, range 2) A common task in data analysis is to investigate an association between two variables.

This can be a correlation to see if two variables vary together, or a regression to see how one variable affects another. We’ll see how to do each of these There is no direct formula for the Superman coefficient, but it can be calculated by first making two new columns for the ranks of the original data. For each of the two variables the largest value is given a rank of 1, the next largest a rank of 2, and so on. This can most simply be done by hand, or for large data sets, by using Excels ?RANK command.

The Superman coefficient is then simply the Pearson coefficient calculated on the rank data, ignoring the original data. Both coefficients are demonstrated in Figure 4. This shows measurements on the size of breeding pairs of penguins to see f there is a correlation between the sizes of the two sexes. The Superman coefficient RSI (0. 77) is more conservative than the Pearson coefficient r (0. 88), but both show a strong positive correlation. Correlation A correlation tells us whether the two variables vary together, I. E. As one goes up the other goes up (or goes down).

The most common tests for correlation are the Pearson product-moment correlation coefficient (r) for normally-distributed (parametric) data, and the Linear regression Regression is used when we have reason to believe that changes in one variable cause the changes in the other. A correlation is not evidence for a causal relationship, but very often we are aware of a causal relationship and we design an experiment to School Science Review, December 2001 , 83(303) 27 Figure 4 Two types of correlation coefficient. The data are the lengths of a leg bone (in mm) in penguin mating pairs.

The Pearson coefficient r can be calculated directly from the data, but the Superman coefficient RSI must be calculated from the ranks of the data. The ranks can either be entered by hand or calculated using Excels ?RANK formula. Investigate it further. The simplest kind of causal relationship is a straight-line allegations, and this can be analyses using linear regression. This fits a straight line to the data using a least squares method, and gives the values of the slope and intercept that define the line (m and c in the equation y = mix + c).

There are several different ways of calculating the slope and intercept of a linear regression line in Excel, but the simplest is to plot a scatter graph and use the ‘Trending’ feature of the graph. Right-click on any data point on the graph, select Add Trending, and choose Linear. Click on the Options tab, and select Display equation on chart. You can also hose to set the intercept to be zero (or some other value), and to display re (the square of the Pearson correlation coefficient). The full equation with the slope and shows data obtained from counting a yeast cell suspension in a hammertoe’s and in a colorimeter.

We expect a linear causal relationship between cell density and turbidity, so this is a good occasion to use regression, and we can use the equation to predict the cell count for a given absorbency. 4 Comparative statistics Another common task in data analysis is to compare two or more sets of data to determine whether they are basically the same (I. . They could come from the same population) or one set is significantly different from the others. To start with, it is good practice to calculate the meaner and CICS for the different groups and plot a bar chart with the CICS represented as error bars.

This gives us a good visual idea of how different the groups are. It is sometimes thought that if the error bars don’t overlap, then there must be a significant difference between the data, but this is not necessarily true, and a statistical test of comparison is needed to test for significant differences. The end result of such tests is a probability (P) that the ‘null hypothesis’ (which always states that there is no difference between the sets of data) is true. In biology we usually accept differences as being significant if P is less than 5%, so if P < 5% then we can say that there is a significant difference between the sets of data (i. . reject the null hypothesis). If P > 5% then we can say that there is no significant difference between the sets of data (I. E. Accept the null hypothesis). 28 Figure 5 Linear regression. The scatter graph has a trend line with the regression equation displayed. In this case the intercept was fixed at zero, which is appropriate or these data. T-test The most common comparative statistical test is the t-test, which is used when there are Just two sets of normally-distributed data to compare.

In Excel this is performed by the formula =TEST (range 1, range 2, tails, type) the value of a paired test: although the data are quite varied, with quite a high C’, the pulse rate increased in each individual leading to the high significance of the conclusion. Excel provides two more formulae involving the t-test. The formula =DIDST (t, doff, tails) which returns P directly (not the t statistic itself, which is not reported and we don’t nearly want the two-tailed test, which tests for differences regardless of sign.

Type can either be 1 for paired data (when the two sets are from the same individuals) or 2 for unpaired data (where the sets are from different individuals), and both are common. Both kinds oft-test are demonstrated in Figure 6. In the unpaired test (type 2) the yield of potatoes in 10 plots treated with one fertilizers was compared to that in 10 plots treated with another fertilizers. Fertilizers B delivers a larger mean yield, but the t-test P shows that there is an 8% probability that these two sets are not really different. Since this is more than 5% we must conclude that fertilizers B is not significantly better than fertilizers A.

In the paired test (type 1), the pulse rate of 8 individuals was measured before and after a large meal. The mean pulse rate is a little higher after the meal, and the t-test P shows that there is only a tiny 0. 006% probability that the before and after data are the same. So the pulse rate does significantly increase after a meal. This demonstrates can be used to replace a look-up table, since it returns the probability corresponding to the given t value and degrees of freedom. The formula =TINT (P, doff) goes the reverse: it returns a t value corresponding to a given probability P.

You shouldn’t normally need these formulae. Mann-Whitney U-Test The t-test requires that the data be continuous and normally distributed. Sometimes this is not possible, for example if the data are calculated rather than measured, or if the data are counted. In this case the Mann-Whitney IS- test, which is the non-parametric equivalent of the t-test, should be used. Unfortunately Excel does not support this test. You could use Excel to calculate the U statistic, but you would still need probability tables to find the significance. 9 Figure 6 Two kinds of t-test.

Example of this is