Averages, Range, Interquartile Measures, and Boxplots

In addition to the length of a particular work, it can also be useful to know something about the general characteristics of a work such as the average length of a Greek tragedy. The average - or arithmeitc mean - is calculated by taking the sum of the items in a list and dividing it by the number of items in the list. For example, to calculate the average length of a play written by Aeschylus, we would add up the number of words in each play and divide by seven. In R, we complete this calculation with the command mean(). To calculate the mean length of Aeschylus' plays, use the command mean(trag.length[trag.length$Author=="Aeschylus", "Word.Count"]) to get the result of 5728.143. or the command mean(trag.length[trag.length$Author=="Sophocles", "Word.Count"]) to get the result of 8521.5714 for Sophocles' plays.

It is important to realize that the mean value can be extremely misleading depending on the nature of our underlying data. If, for example, the mean of the list of numbers 1, 1, 1, 1, 1, 1, 1, 1000 is 125.875 - a number that bears very little relationship to any of the actual values in the list. Statisticians also use other metrics that help characterize the distribution of values in a data-set including the range, the median value, and the interquartile range.

The range provides one extremely useful set of information about the dataset. It consists of the minimum value and the maximum value in a list. For our extremely simple dataset in the previous chapter, the minimum value would be 1 and the maximum value would be 1000. Both numbers can be calculated using the R command range. If you are only interested in one value, they can be calculated using the R commands min and max.

The median value is quite simply the mid-point of a set where half of the values will be larger than the medium and the other half will be smaller. For a small data set with an odd number of values, the median can be determined by arranging the set in numeric order and then taking the middle value. For example, the length of Sophocles' seven plays in ascending order are: 7177 7363 7914 8702 8830 9280 10385. The median value of this list is 8702; half of Sophocles' plays are longer than this and half are shorter. ((Kenny(1982) pp. 37-39 describes the equation used to calculate the median value for a large data set.)) The R command to determine the median value for Sophocles' plays is median(trag.length[trag.length$Author=="Sophocles", "Word.Count"]).

A measure that is similar to the median value is the interquartile range for a data set. Whereas the median gives the value at the midpoint in a data set where half the values are smaller and half are larger, the interquartile range provides similar numbers for 25%, 50% and 75% of the items in the list respectively. For example, the value of the first quartile is the number where 25% of the values are smaller and 75% are larger. ((Kenny(1982) pp. 58-59 describes the equation used to calculate quartlies.)) This is calculated using the R command quantile. Using our small Sophocles dataset again, we can issue the command quantile(trag.length[trag.length$Author=="Sophocles", "Word.Count"]) and get the result:

0% 25% 50% 75% 100%
7177.0 7638.5 8702.0 9055.0 10385.0
Taken together, these values can give us a good sense of the central tendencies and general characteristics of our data. R in fact makes it possible to generate all of these statistics with a single command, summary(). The command summary(trag.length[trag.length$Author=="Sophocles", "Word.Count"]) gives us the output
Min. 1st Qu. Median Mean 3rd Qu. Max.
7177 7638 8702 8522 9055 10380

If used with the summary command is used in conjunction with the tapply function described previously, we can quickly compare the characteristics of the tragedies written by Sophocles, Aeschylus, and Eurpidies. tapply(trag.length[, "Word.Count"], trag.length[, "Author"], summary)

Author Min. 1st Qu. Median Mean 3rd Qu. Max.
Aeschylus 4939 5152 5297 5728 5685 8187
Sophocles 7177 7638 8702 8522 9055 10380
Euripides 4104 7128 7787 7799 9029 10030

The R graphics library also includes a command to generate a boxplot that concisely presents all of this data in a visual form. The command to generate this graph is boxplot(trag.length[, "Word.Count"] ~ trag.length[, "Author"], main="Word Lengths of Tragedies by Aeschylus, Sophocles, and Euripides", ylab="Length in Words", xlab="Author", col=(c("azure3"))) This command has the same basic structure as other commands we have used but with a few more options. Boxplot showing the length of tragedies by Aeschylus, Sophocles, and Eurpides The first part of the command -- trag.length[, "Word.Count"] ~ trag.length[, "Author"] -- is the data we want to graph. This formula tells the boxplot command to graph the word lengths of each tragedy summarized by author. Everything after this defines formatting for the chart; xlab is the label for the x-axis, ylab is the label for the y-axis, and the col command defines the color of the plotted rectangle.

The boxplot presents the interquartile range from 25% to 75% as a rectangle on the chart. The mean for the data set is plotted as a solid black line across the rectangle while the range is plotted with dotted lines extending from the central rectangle up to the maximum and down to the minimum. This graph shown to the right allows us easily see that on average Aeschylus' plays are shorter than those by Sophocles and Euripides. We can also see that Aeschylus' plays fall within a much smaller range than vary and that their length varies substantially less than those of Euripides or Sophocles.

<<-- Previous: Graphing Results: Bar Graphs and Pie Charts
Histograms -->>