Analyzing Literary Data

The first step for any scholar studying literature with statistical methods is to develop literary questions that might be usefully explored with statistical methods.

The sorts of questions that can be answered with statistical methods can be grouped into several broad categories.

Basic Facts:

Which character speaks most frequently in a text
What words are most commonly spoken by an individual speaker or in a segment of text
What is the length of the average sentence in the work, spoken by individuals, or in a specific segment of the text

Linguistic Questions:

What verb tenses are used most frequently by a character or an author
How are grammatical features such as participles, particles, pronouns, etc. distributed through a text

Thematic Questions

How does a particular author write about a specific theme (i.e. Donald Hardy's book, The Body in Flannery O’Connor’s Fiction
How does an author construct narrative or characters (i.e. Michaela Mahlberg's book Corpus Stylistics and Dickens's Fiction)
How does an author's use of pronouns and other extremely common words contribute to the creation of literary characters (i.e. John Burrows' book Computation into Criticism).

After a specific question or research approach has been selected, the next step is to gather data that can address this question in a format that R can read. In fact, in any quantitative literary or linguistic study, gathering and formatting the data will be one of the largest components of the task. Many texts are available on-line but - unless you are lucky - the texts that you want to study will not be available with the elements you want to study marked in a computationally actionable way.

For example, a scholar wanting to study Mrs. Haversham's language in Dickens' Great Expectations would be able to obtain an electronic edition of the text from Project Gutenberg but then would be required to work through the text and mark-up all of the sentences spoken by Mrs. Haversham and then place them in a format that is readably by R. A study of all the language of all of Dickens' characters would require substantially more effort.

Some questions - while still requiring data preparation - are easier to access. Chapters in books, the language of individual poems, or equal sized chunks of text from a novel can all simplify the process of data acquisition if these chunks also correspond to the research question that is being asked.

One of the first and most important questions when analyzing any textual data is whether or not to lemmatize the underlying language. Lemmatization means taking the form that appears in the text and resolving it that all inflected forms are grouped under a single lexical form. For example, the verbs running, runs, and ran would all resolve to the form run.

For Ancient Greek and Latin texts, the morphological data can be obtained from the Perseus Digital Library open source downloads at http://www.perseus.tufts.edu/hopper/opensource. English language texts can be parsed with the Stanford Core NLP software (available from http://nlp.stanford.edu/software/corenlp.shtml or Morphadorner (available from http://morphadorner.northwestern.edu/. Specific guidance for how to create a parsed text can be found on the Preparing Literary Data page.

Previous: Using Data Frames -->>
Next: Count Records In A Table-->>

Statistical Methods for Studying Literature Using R

Jeff Rydberg-Cox, The University of Missouri-Kansas City

Analyzing Literary Data

Table of Contents

Getting Started

Analyzing Literary Data