Source: It’s Always Been About Big Data

In 1668 the British Royal Society undertook its first official publication, An Essay Towards a Real Character, and a Philosophical Language, penned by the Society’s first secretary, John Wilkins. Wilkins’ ambitious work proposed to create a universal language, intelligible to all nations and peoples, as a means of facilitating and accelerating the production of knowledge. The bulk of the multi-volume work consists of an extended taxonomy of the world, in which Wilkins attempts to understand and categorize everything, literally, as a means of creating a universally navigable system of linguistic representation.


Wilkins’ universal taxonomy may seem to have little connection to the problem of big data in the digital humanities; but, Wilkins stood at the same representational event horizon as does the digital humanist trying to structure, in one form another, a world of endless, semantically related data. There has, in fact, never been anything other than big data. The invention of “Data” as a form of binary stored information, represents an ontological lie that pretends that some set of information can have meaning outside of its connection with the universe of signification.

Because in computing’s infancy we could only physically store a limited number of ones and zeros in either physical or working memory, we began to think of each cluster of ones and zeros, each file, as having a discrete existence of its own, that was somehow separate and different from the rest of our discursive universe. We allowed the limitation of the machine to lull us into the belief that the “black box” held some magical power to create independent, contained realities.

The ubiquity of the network revealed the fallacy of this dream, both functionally and theoretically. The moment most of us started carrying the network around in our pockets, we became immediately dissatisfied with stand-alone applications, data-stores, and personalities. The desire to connect my map, my journal, and my address book not only to each other but also to your map, your journal, and your address book demanded we give up the illusion of a stand-alone discursive universe and recognize that these ones and zeros are simply one of many languages we use to write ourselves.

And so, we find ourselves yet again on the brink of Wilkins’ dilemma—that of defining an Ur language capable of semantically unifying the complete discursive universe. A common problem lies at the heart of all engagement with data. All data manipulations short of a dadaist artificial intelligence rest on the edge of an ontological razor. You cannot visualize, link, search, or browse any set of data without first somehow structuring said data according to some discriminating system that says, at a minimum, “This is like that, and that is like this!” And before we can say this, we must first agree on the very boundaries of the this.

In order to link, order, or display data based upon dates, for example, we must first have an idea of date. In order to link, or, display data by place, we must first have an idea of place. In order to link, order, or display data based upon anything other than a totally random, meaningless presentation of data, we must first have ordered our universe such that we recognize the existence of meaningful categories of existence based upon which we can discriminate and, hence, understand the data.

We thus find ourselves standing at the same ontological brink as did John Wilkins in 1668. The volume of our digital discourse is such that we can no longer pretend that data exists or has meaning that is somehow separate from the entirety of our discursive realities. And we can no longer pretend that the basic problem we face—that of structuring this universe—is fundamentally different from that of our predecessors. Digital tools certainly increase the speed with which we can test new discursive formulations and map links between those composed in different languages. But, in the final estimation, the very discrimination that constitutes the boundary between points of datum in the data is already pre-determined by our ontological history.

It is, of course, the dream of “big data” that the very presence of numerically staggering data pools will spontaneously offer the very solution to this problem, as we can rely on the data itself to reveal its own taxonomies. But we find ourselves here on the equally slippery grounds of apophenia and religiosity—of patterns that have no meaning or, alternatively, that have one and only one meaning. We find ourselves, as it were, once again on the doorstep of the Enlightenment.