Big Data is shackling mankind’s sense of creative wonder

by Ashutosh Jogalekar

BIG-DATAPrimitive science began when mankind looked upward at the sky and downward at the earth and asked why. Modern science began when Galileo and Kepler and Newton answered these questions using the language of mathematics and started codifying them into general scientific laws. Since then scientific discovery has been constantly driven by curiosity, and many of the most important answers have come from questions of the kind asked by a child: Why is the sky blue? Why is grass green? Why do monkeys look similar to us? How does a hummingbird flap its wings? With the powerful tool of curiosity came the even more powerful fulcrum of creativity around which all of science hinged. Einstein’s imagining himself on a light beam was a thoroughly creative act; so were Ada Lovelace’s thoughts about a calculating machine as doing something beyond mere calculation, James Watson and Francis Crick’s DNA model-building exercise, Enrico Fermi’s sudden decision to put a block of paraffin wax in the path of neutrons.

What is common to all these flights of fancy is that they were spontaneous, often spur-of-the-moment, informed at best by meager data and mostly by intuition. If Einstein, Lovelace and Fermi had paused to reconsider their thoughts because of the absence of hard evidence or statistical data, they might at the very least been discouraged from exploring these creative ideas further. And yet that is what I think the future Einsteins and Lovelaces of our day are in danger of doing. They are in danger of doing this because they are increasingly living in a world where statistics and data-driven decisions are becoming the beginning and end of everything, where young minds are constantly cautioned to not speculate before they have enough data.

We live in an age where Big Data, More Data and Still More Data seem to be all consuming, looming over decisions both big and mundane; from driving to ordering pet food to getting a mammogram. We are being told that we should not make any decision pending its substantiation through statistics and large-scale data analysis. Now, I will be the first one to advocate making decisions based on data and statistics, especially in an era where sloppy thinking and speculation based on incomplete or non-existent data seems to have turned into the very air which the media and large segments of the population breathe. Statistics has especially been found to be both paramount and sorely lacking in making decisions, and books like Daniel Kahneman’s “Thinking Fast and Slow” and Nate Silver’s “The Signal and the Noise” have stressed how humans are intrinsically bad at probabilistic and statistical thinking and how this disadvantage leads to them consistently making wrong decisions. It seems that a restructuring of our collective thinking process that is grounded in data would be a good thing for everyone.

But there are inherent problems with implementing this principle, quite apart from the severe limitations on creative speculation that an excess of data-based thinking imposes.

Firstly, except in rare cases, we simply don’t have all the data that is necessary for making a good decision. Data itself is not insight, it’s simply raw material for insight. This problem is seen in the nature of the scientific process itself; in the words of the scientist and humanist Jacob Bronowski, in every scientific investigation we decide where to make a “cut” in nature, a cut that isolates the system of interest from the rest of the universe. Even late into the process, we can never truly know whether the part of the universe we have left out is relevant. Our knowledge of what we have left out is thus not just a “known unknown” but often an “unknown unknown”. Secondly and equally importantly, the quality of the data often takes second stage to its quantity; too many companies and research organizations seem to think that more data is always good, even when more data can mean more bad data. Thirdly, even with a vast amount of data, human beings are incapable of digesting this surfeit and making sure that their decisions include all of it. And fourthly and most importantly, making decisions based on data is often a self-fulfilling prophecy; the hypothesis we form and the conclusions we reach are inherently constrained by the data. We get obsessed with the data that we have and develop tunnel vision, and we ignore the importance of the data that we don’t have. This means that all our results are only going to be as good as the existing data.

Consider a seminal basic scientific discovery like the detection of the Higgs Boson, forty years after the prediction was made. There is little doubt that this was a supreme achievement, a technical tour de force that came about only because of the collective intelligence and collaboration of hundreds of scientists, engineers, technicians, bureaucrats and governments. The finding was of course a textbook example of how everyday science works: a theory makes a prediction and a well-designed experiment confirms or refutes the prediction. But how much more novelty the LHC would have found had the parameters been significantly tweaked, if the imagination of the collider and its operator been set loose? Maybe it would not have found the Higgs then, but it would have discovered something wholly different and unexpected. There would certainly have been more noise, but there would also have been more signal that would have led to discoveries which nobody predicted and which might have charted new vistas in physics. One of the major complaints about modern fundamental physics, especially in areas like string theory, is that it is experiment-poor and theory-rich. But experiments can only find something new when they don’t stay too close to the theoretical framework. You cannot always let prevailing theory dictate what experiments should do.

The success of the LHC in finding the Higgs and nothing but the Higgs points to the self-fulfilling prophecy of data that I mentioned: the experiment was set up to find or disprove the Higgs and the data contained within it the existence or absence of the Higgs. True creative science comes from generating hypotheses beyond the domain of the initial hypotheses and the resulting data. These hypotheses have to be confined within the boundaries of the known laws of nature, but there still has to be enough wiggle room to at least push against these boundaries, if not try to break free of them. My contention is that we are gradually becoming so enamored of data that it is clipping and tying down our wings, not allowing us to roam free in the air and explore daring new intellectual landscapes. It’s very much a case of the drunk under the lamppost, looking for his keys there because that’s where the light is.

A related problem with the religion of “dataism” is the tendency to dismiss anything that constitutes anecdotal evidence, even if it can lead to creative exploration. “Yes, but that’s an n of 1” is a refrain that you must have heard from many a data-entranced statistics geek. It’s important to not regard anecdotal evidence as sacrosanct, but it’s equally wrong in my opinion to simply dismiss it and move on. Isaac Asimov reminded us that great discoveries in science are made when an odd observation or fact makes someone go, “Hmm, that’s interesting”. But if instead, the reaction is going to be “Interesting, but that’s just an n of 1, so I am going to move on”, you are potentially giving up on hidden gems of discovery.

With anecdotal data also comes storytelling which has always been an integral part not just of science but of the human experience. Both arouse our sense of wonder and curiosity; we are left fascinated and free to imagine and explore precisely because of the paucity of data and the lone voice from the deep. Very few scientists and thinkers drove home the importance of taking anecdotal storytelling seriously as well as the late Oliver Sacks. If one reads Sacks’s books, every one of them is populated with fascinating stories of individual men and women with neurological deficits or abilities that shed valuable light on the workings of the brain. If Sacks had dismissed these anecdotes as insufficiently data-rich, he would have missed discovering the essence of important neurological disorders. Sacks also extolled the value of looking at historical data, another source of wisdom that would very easily be dismissed by hard scientists who think all historical data suspect because of its absence of large-scale statistical validation. Sacks regarded historical reports as especially neglected and refreshingly valuable sources of novel insights; in his early days, his insistence that his hospital’s weekly journal club discuss the papers of their nineteenth century forebears was met largely with indifference. But this exploration off the beaten track paid dividends. For instance, he once realized that he had rediscovered a key hallucinogenic aspect of severe migraines when he came across a paper on similar self-reported symptoms by the English astronomer John Herschel, written more than a hundred years ago. A data scientist would surely dismiss Herschel’s report as nothing more than a fluke.

The dismissal of historical data is especially visible in our modern system of medicine which ignores many medical reports of the kind that people like Sacks found valuable. It does an even better job ignoring the vast amount of information contained in the medical repositories of ancient systems of medicines, such as the Chinese and Indian pharmacopeias. Now, admittedly there are a lot of inconsistencies in these reports so they cannot all be taken literally, but neither is the process of ignoring them fruitful. Like all uncertain but potentially useful data, they need to be dug up, investigated and validated so that we can keep the gold and throw out the dross. The great potential value of ancient systems of medicine was made apparent when two years ago, the Nobel Prize for medicine was awarded to Chinese medicinal chemist Tu Youyou for her lifesaving discovery of the antimalarial drug artemisinin. Youyou was inspired to make the discovery when she found a process for low-temperature chemical extraction of the drug in a 1600-year-old Chinese text titled “Emergency Prescriptions Kept Up One’s Sleeve”. This obscure and low-visibility data point would have been certainly dismissed by statistics-enamored medicinal chemists in the West, even if they had known where to find it. Part of recognizing the importance of Eastern systems of medicine consists in recognizing their very different philosophy; while Western medicine seeks to attack the disease and is highly reductionist, Eastern medicine takes a much more holistic approach in which it seeks to modify the physiology of the individual itself. This kind of philosophy is harder to study in the traditional double-blinded, placebo-controlled clinical trial that has been the mainstay of successful Western medicine, but the difficulty of implementing a particular scientific paradigm should not be an argument against its serious study or adoption. As Sacks’s and Youyou’s examples demonstrate, gems of discovery still lie hidden in anecdotal and historical reports, especially in medicine where even today we understand so little about entities like the human brain.

Whether it’s the LHC or medical research, the practice of gathering data and relying only on that data is making us stay close to the ground when we could have been soaring high in the air without these constraints. Data is critical for substantiating a scientific idea, but I would argue that it actually makes it harder to explore wild, creative scientific ideas in the first place, ideas that often come from anecdotal evidence, storytelling and speculation. A bigger place for data leaves increasingly smaller room for authentic and spontaneous creativity. Sadly, today’s publishing culture also rooms little room for pure speculation-driven hypothesizing. As just one example of how different things have become in the last forty years, in 1960 the physicist Freeman Dyson wrote a paper in Science speculating on possible ways to detect alien civilizations based on their capture of heat energy from their parent star. Dyson’s paper contained enough calculations to make it at least a mildly serious piece of work, but I feel confident that in 2017 his paper would probably get rejected from major journals like Science and Nature which have lost their taste for interesting speculation and have become obsessed with data-driven research.

Speculation and curiosity have been mainstays of human thinking since our origins. When our ancestors sat around fires and told stories of gods, demons and spirit animals to their grandchildren, it made the wide-eyed children wonder and want to know more about these mysterious entities that their elders were describing. This feeling of wonder led the children to ask questions. Many of these questions led down wrong alleys, but the ones that survived later scrutiny launched important ideas. Today we would dismiss these undisciplined mental meanderings as superstition, but there is little doubt that they involve the same kind of basic curiosity that drives a scientist. There is perhaps no better example of a civilization that went down this path than ancient Greece. Greece was a civilization full of animated spirits and Gods that controlled men’s destinies and the forces of nature. The Greeks certainly found memorable ways to enshrine these beliefs in their plays and literature, but the same cauldron that imagined Zeus and Athena also created Aristotle and Plato. Aristotle and Plato’s universe was a universe of causes and humors, of earth and water, of abstract geometrical entities divorced from real world substantiation. Both men speculated with fierce abandon. And yet both made seminal contributions to Western science and philosophy even as their ideas were accepted, circulated, refined and refuted for the next two thousand years. Now imagine if Aristotle and Plato had refused to speculate on causes and human anatomy and physiology because they had insufficient data, if they had turned away from imagining because the evidence wasn’t there.

We need to remember that much of science arose as poetic speculations on the cosmos. Data kills the poetic urge in science, an urge that the humanities have recognized for a long time and which science has had in plenty. Richard Feynman once wrote,

“Poets say that science takes away the beauty of the stars and turns them into mere globs of gas atoms. But nothing is ‘mere’. I too can see the stars on a desert night, but do I see less or more? The vastness of the heavens stretches my imagination; stuck on this carousel my little eye can catch one-million-year-old light…What men are poets who can speak of Jupiter as if he were a man, but if he is an immense spinning sphere of methane and ammonia must be silent?”

Feynman was speaking to the sense of wonder that science should evoke in all of us. Carl Sagan realized this too when he said that not only is science compatible with spirituality, but it’s a profound source of spirituality. To realize that the world is a multilayered, many-splendored thing, to realize that everything around us is connected through particles and forces, to realize that every time we take a breath or fly on a plane we are being held alive and aloft by the wonderful and weird principles of mechanics and electromagnetism and atomic physics, and to realize that these phenomena are actually real as opposed to the fictional revelations of religion, should be as much a spiritual experience as anything else in one’s life. In this sense, knowing about quantum mechanics or molecular biology is no different from listening to the Goldberg Variations or gazing up at the Sistine Chapel. But this spiritual experience can come only when we let our imaginations run free, constraining them in the straitjacket of skepticism only after they have furiously streaked across the sky of wonder. The first woman, when she asked what the stars were made of, did not ask for a p value.