Literature: Just Another Data Set?

How problematic is the application of Big Data methods to literature? That is: is there a problem in taking huge swaths of literature and subjecting them to data mining techniques?


by Josh Trapani

How problematic is the application of Big Data methods to literature? That is: is there a problem in taking huge swaths of literature and subjecting them to data mining techniques as reported, for example, in this recent New York Times article on the most influential 19th century authors? Is there something about literature – as opposed to, say, daily weather phenomena, stock market movements, or global transactions at Walmart – that makes it irreducible to data?

A much-discussed essay in the Los Angeles Review of Books claims so (subsequent thoughtful replies make counterarguments, and many others around the internet have chimed in). This discussion strikes me as one that could turn into a pointless food fight or, more positively, could start moving scholars to a place where the “two cultures” begin to understand each other a bit better than they have in the past. (I’m not the first person to think of that, either: see here.)

I haven’t read much to suggest that the worry is practical: that “trendy” digital humanities research will take scarce resources away from other avenues of humanities and social sciences research before producing a bunch of low-quality garbage studies and ultimately fizzling out. This quite valid worry (at least about the funding!) is lurking well beneath the surface, if it’s there at all.

The LA Review of Books article makes the especially odd assertion – one I haven’t seen addressed much in the replies or the comments – that one reason literature can’t be reduced to data is because what we know of it is so incomplete. “The information we have about the past is, in almost every case, fragmentary. There are always masses of data which are simply missing or which cannot be untangled.”

I say: so what?

This is like telling my paleontology buddies that they can’t study the history of life because the fossil record is incomplete. Make that assertion – I dare you – and prepare to be pummeled by thousands of (empty, of course) bottles of Sierra Nevada Pale Ale and Bell’s Oberon (and dozens of pyritized coprolites too, just for symbolic value).

Also prepare to be shown heaps of contrary evidence.

One reason that paleontologists can study an incomplete fossil record is because in evolutionary theory they have a solid groundwork for such study. This theoretical underpinning helps them ask sensible questions and pose testable hypotheses. Analysis of large data sets is merely another way to test hypotheses and answer questions. Of course there is a risk of hotshot data analysts who don’t really know the subject finding patterns where there aren’t any or missing nuances in the data (see my review of The Signal and The Noise by Nate Silver for more on that issue). But that’s no more true of literary analysis than of anything else.

Remember that “Big Data,” for all the hype, consists merely of: 1) new information sources (like millions of newly-digitized books or long DNA sequences) and 2) tools, especially statistical, for managing and analyzing them. (To digress momentarily back to the practical: I can think of a whole lot worse from a workforce perspective than having more humanities PhDs graduate with quantitative analysis skills.) But to run large data sets through various analyses with no purpose in mind – just to see if something that might be a pattern emerges – is an answer seeking a question, and it’s shoddy work regardless of discipline.

Others have pointed out that some questions are going to be unanswerable with any data set. Most of the questions I’ve seen addressed with literary data sets seem more social sciences-oriented (cultural, linguistic, historical) than truly literary (for example, here). As for determining whether any particular question is well-answered…well, the devil is in the details, and the New York Times write-up of a research study isn’t the place to find those details.

Readers better versed in the humanities than I am (and remember folks, this might be an online book review, but I’m a scientist by training and an analyst in my day job) might also take some comfort in knowing that these same tools are being put to use on science itself. Textual analysis tools are being applied to the inputs and outputs of scientific research (grant proposals, publications, patents) to better understand both: 1) the process of science and innovation (the National Science Foundation even has a Science of Science and Innovation Policy program, and I know enough to know that development of a coherent theoretical framework is one of the major goals of scholarship in this nascent field) and 2) the areas in which science is progressing at a much finer granularity than would be possible simply by coding projects by discipline. To say that science isn’t reducible to data doesn’t make any sense, and why is science different from literature in this particular regard?

Both are human endeavors, with all that entails.

comments powered by Disqus