Big Data and the Scientific Method

The first in a series of three articles on the implications and challenges in Big Data.

The term “Big Data” suggests that the concept, however defined, is a Big Deal.  However, the concept of collecting, storing, and analyzing large volumes of data in ways that were not possible some years ago, is both older and bigger than the concept that people ascribe to “Big Data.”

The ability to store and retrieve data, and to convert data into information, is an essential aspect of being human.  The development of language, and oral tradition, is perhaps the most profound development in human history.  No one is quite sure when this happened; speech may have occurred a hundred thousand years ago, or maybe earlier.  Certainly by 50,000 years ago, speech had established in what we now view as the human race.  The ability to transmit to others information derived from observations and experience, as well as to successive generations, through words within the construct of a language, completely changed the nature of what human beings could accomplish.  Transmission of information supplementing genetic transmission is an evolutionary leap that redefined the meaning of “human.”

Subsequent advances in information transmission in the history of human existence came slowly at first, and then with increasing frequency.  Around 30,000 years ago, writing was invented, first as a way of recording accounts (numbers), and then through recordings in cave paintings and other markings. Later (perhaps only 5,000 years ago) written language representing words and ideas appeared in various forms, permitting the recording of history, and development and transmission of cultures and religion, in large volumes of written material, rather than merely through oral traditions.  Development of writing instruments and paper was important.  In more modern times, a huge advance occurred with the development of the printing press, which permitted mass production of written material, and the democratization of information.  Arguably, the invention of the printing press enabled the Reformation, and the Peace of Westphalia that established the notion of the nation-state separate from the local customs and religion.  Further democratization occurred with the development of pamphlets and newspapers, and the total amount of stored data greatly increased with the development of photography and then videography.  The twentieth century saw the development of computers, microelectronics, personal computers, and digital communication devices. Each advance profoundly changed humanity’s interaction with data, and how we assemble, store, and transmit information.

With the further development of microprocessors and digital storage media, another profound change has occurred.  What are we to make of the fact that in the year 1993, estimates are that 3% of the world’s data was stored digitally, but that by 2007, 94% of all recorded data was digital[i] (and undoubtedly an even higher percentage by 2013)?  The figures may be dwarfed by imagery and video, but it is clear that there has been a sea-change in the sensing, recording, and retrieval of data, from analog forms to digital representations.  And while this transformation is occurring, the sheer volume of data has been growing, at a rate that is at least exponential.  With new high capacity disks, RAID storage systems, flash memory and solid state drives, and now readily available cloud storage, the opportunity to keep and maintain data from all events and activities in people’s lives has become feasible.  The same researchers that observe the role reversal of digital data in storage also estimate that there has been an exponential increase in the amount of data since 1986 with a compounded annual growth rate of 25%.[ii]

The impact of this rapid transformation is, we assert, as profound as any of the historical revolutions that changed humankind’s interaction with information, on par with the invention of language.  Big Data grossly understates how big a deal is the transformation to massive digital recording.

One of the principle impacts of this change is the method by which we, as a human race, extract information and develop theories to explain phenomena.  The scientific method has served mankind well for centuries: An iterative process of observation, hypothesis, devising of tests, and refinement and/or validation. The development of modeling and simulation capabilities with massive computer processing power allows for greater use of computational models in the testing phase of the scientific method.  But an even greater modification to the scientific method is afforded by the existence of massive stores of digital recording of sensor data.  Instead of using a few observations and developing hypotheses based on human intuition, it is now possible to comb through massive amounts of observed data, and to develop hypotheses based on computed correlations.

For example, throughout history, marketers attempted to convince people to buy things based on good guesses as to what might persuade them.  Now, online advertisers can observe trails of “clicks” and observe patterns of purchases, to far more easily deduce persuasive patterns.  This is the basis of many commercial endeavors that use internet and web tracking.

But scientific discovery can also make use of massive data stores to develop better theories of natural phenomena. Cosmological observations, for example, have enabled us to deduce the presence of planets circling distant stars, based on analysis of patterns of intensity data.  Medical data, including statistics over sequenced DNA data, is permitting us to identify genetic causes of certain diseases.  Image processing of data gathered from particle accelerators (in particular, the Large Hadron Collider) yields massive amounts of information that has allowed scientists to deduce the existence of the Higgs boson.

And while scientific inquiry has been transformed, the analysis of the sociology of humans has been unleashed through the collection of large amounts of data of nearly every human on the planet.  Of course, there are significant concerns about individual privacy, but since data is being collected by companies and authorities of all of our transactions, locations of our devices, and health and status of our machines, soon it will be possible to track the movements and behavior of just about every human being.  The potential to understand human behavior as a function of stimulus and history is both stunning and frightening.  If the data is anonymized and analyzed statistically, then great good can come from the analysis.  If the data is used to discover and “target” individuals, then there are more sinister possibilities.

In any case, the modification to the scientific method, whether for marketing, science, or sociology, goes deeper than just the amount of data being observed.  Instead of having to intuit relationships between observable variables, computer analysis can now look at groups of variables, sometimes correlated across multiple data bases, to look for statistical relationships (a form of correlation) between the variables.  Algorithmic methods can be used to look for constraints that indicate a statistical relationship between observables.  The formulation can be extended to account for “noise” in the observed data by allowing for approximate relationships, and also can be extended for the case of variables that can take on discrete values (as opposed to continuous numerical values).

The change is the ability to analyze by algorithmic means large amounts of data, collected using “big data” techniques that store massive amounts of data in cloud-based disks, or on individual flash drives in home laptops, or in massive government databases.  Whereas until recently, discovery and analysis progressed largely through empirical means, dependent on an analyst’s ability to intuit relationships in data that is sparse, rare, and displayed through analog means, we can now use computer programs to cull massive amounts of data to suggest relationships in variables that might never have been viewed by humans.  Empirical testing and laborious search can become automated analysis through massive databases to suggest relationships that can be used to far more rapidly develop hypotheses of causality and models to describe behavior.

In a quiet revolution, sitting in the midst of a change that takes a few years masks the momentousness of the event in the course of human history.  This massive bulk of digital data, and the information derived from it, can be shared with societies throughout the world, and with generations to come.  Not all of the changes will be good.  It is often said that information is power, and with an ability to extract and share information derived from automated means, concentrations of power are indeed possible.  Already, political parties use massive databases to perform fundraising and campaigning.  Perhaps knowing and understanding human behavior too well will threaten individual choice and independence.  From influencing what we buy to dictating for whom we vote, our very selves might become predictions based on correlations of variables.

Lurking behind the development of “big data” analytics is a transformation comparable to the invention of language, in terms of the ability to transform the meaning of being human.  And yet, the opportunities and challenges afforded by a new way of deducing information from observations, now become massive digital observations, can reasonably be called a revolution in human means of executing a scientific method of understanding phenomena.

While our ability to record data and to analyze massive databases are exploding, and while scientists and analysts develop new skills at executing automated data analysis tasks in place of intuition and empirical searches, the question arises:  Are we getting prepared to accept the implications and consequences of this massive revolution in the way humankind handles data?  Preparations might imply new policies, new rights, and new approaches to sharing of information.  Most likely, these policy and procedural reforms will be enacted post-facto as consequences of the big data revolution unfold.  But since at least some of the directions and implications are apparent now, it would make sense to implement policies now that steer the use of data analytics toward the development of knowledge that is beneficial, or at least not harmful, to mankind as a whole.

 

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s