# Variation, Standard Deviations, and Z-Scores

In the previous section, we saw how measures such as the average, the median, the range, and quartiles helped us to understand the general characteristics of a dataset. We also saw how the standard boxplot function in R can help us quickly visualize differences between datasets. In addition to these calculations that help us understand the general shape of a set of numeric data, there is another category of statistic that help us understand the differences and levels of variation within a dataset and help us identify whether a particular point in that dataset is markedly different from others in the group. These calculations are the variation, the standard deviation, and the z-score. ((All are addressed in Kinney 1982 pp. 51 - 60.)) In concrete terms based on the examples from the previous section, these calculations let us quantify the phenomenon illustrated in the boxplot that Aeschlyus' plays tend to be closer to each other in length than those of Euripides. It will also enable us to begin to talk about how different the longest and shortest plays written by Euripides are when compared to the complete body of his work.

The first two metrics - the standard deviation and the variance - are closely related. In fact, calculating the variance is an intermediate step in calculating the standard deviation. Both of these measures are calculations that show how far the values in a dataset differ from the arithmetic mean. It is determined by subtracting the mean from each value in the dataset, taking the square of each of these values (i.e. multiplying them by themselves), calculating the arithmetic mean of the resulting list of numbers (this number is the variance), and then taking the square root of the result (this number is the standard deviation). ((See Kinney 1982 pp. 53 and the concise description at http://www.techbookreport.com/tutorials/stddev-30-secs.html.))

The R function used to calculate these two values are `sd()` and `var()`. To calculate the variance and standard deviation of the length of Aeschylus' plays, the commands would be `var(trag.length[trag.length\$Author == "Aeschylus", "Word.Count"])` and `sd(trag.length[trag.length\$Author == "Aeschylus", "Word.Count"])` respectively giving the results of 1,273,040 and 1,128.291. As Kinney points out, the standard deviation is generally the most useful metric largely because it is expressed in the same terms as the item being measured. In this case, the variance is expressed as 'number of words squared' while we can talk about the standard deviation in terms of number of words overall.

If we calculate the standard deviations for all three of our authors, we can see that Aeschylus' and Sophocles' plays do tend to be closer to the average than those of Euripides.

Standard Deviation of Lengths of Plays by Aeschylus, Sophocles, and Euripides
Aeschylus Sophocles Euripides Total
Mean 5728 8521 7799 7,504
Standard Deviation 1,128 1,132 1,595 1,699

While the standard deviation gives us a sense of how closely the items in a dataset are to the mean, it also provides us with a tool for understanding how similar or different a single data point is relationship to all of the others. Imagine, for example, that we want to know whether Euripides'Hippolytusis unusually long when compared to other works by Euripides. This play is 8,157 words long, the mean length of a play by Eurpides is 7,799 words and the standard deviation is 1,595 words. Because the difference between the length of theHippolytusand the mean is 358, we would say this play is within one standard deviation from the mean.

Generally speaking in a data set with a normal distribution (i.e. a distribution into a traditional bell curve), some 68% of the data should fall within one standard deviation of the mean and a full 95% of the data should fall within two standard deviations of the mean. ((As described by http://www.robertniles.com/stats/stdev.shtml.))

The distance of a given value from the mean can also be measured using a metric known as a z-score. The z-score allows you to state how far a particular number in a data set varies from the mean expressed in terms of the standard deviation. The Z-Score is calculated by subtracting the mean the value you want to measure and then dividing it by the standard deviation. ((http://cran.r-project.org/doc/contrib/Verzani-SimpleR.pdf p. 46)). Using our example above, we could calculate the Z-score using the formula `(8157 - 7799) / 1595` to get the result `0.2241446`. In R, this score can be calculated for any particular set of data using the following four lines of code in R:

```text <- trag.length[trag.length\$Play=="Hippolytus", "Word.Count"] stddev <- sd(trag.length[trag.length\$Author == "Euripides", "Word.Count"] m <- mean(trag.length[trag.length\$Author == "Euripides", "Word.Count"]) z = (text-m)/sd```

The result can be displayed as follows:

```z [1] 0.2241446```

If you are feeling adventurous, the above code can be combined into a single line as follows:

`(trag.length[trag.length\$Play=="Hippolytus", "Word.Count"] - mean(trag.length[trag.length\$Author == "Euripides", "Word.Count"])) / sd(trag.length[trag.length\$Author == "Euripides", "Word.Count"])`

A z-score can be either a positive or a negative number. A positive number simply tells us that the number we are considering is larger than the mean while a negative value tells us that it is smaller than the mean.

The value of the z-score lies in the ability to help you determine the probability of a particular value appearing in our dataset. As noted above, when the data is distributed normally in a dataset, 68% of the numbers will be within one standard deviation of the mean and 95% of the numbers will be within two standard deviations. If we go out to three standard deviations, we should account for a full 99.7% of our data.

A large z-score by itself is not prima facie proof of a literary phonomenon, but it does serve as a useful flag for an area that needs further exploration. When we are looking at quantitative data, a Z-score above +/-2 is unusual while a score above +/- three is extremely unusual and bears further scrutiny. For example, Euripides' Cyclops is 4,104 words long. The z-score for this value when compared to all of Euripides' plays is `-2.316208`. ((We can calculate the z-score for the Cyclops using the code trag.length[trag.length\$Play=="Cyclops", "Word.Count"] - mean(trag.length[trag.length\$Author == "Euripides", "Word.Count"]) / sd(trag.length[trag.length\$Author == "Euripides", "Word.Count"]) )) It is, therefore, reasonable to talk about this play as being significantly shorter than Euripides' other plays and to begin to look for reasons why this might be true.