Le Big Data est LE sujet à la mode en ce moment. D’aucuns le craignent, d’autres en vantent les incroyables possibilités et l’ère nouvelle qui s’ouvre pour l’humanité. Il permettrait même, pour certains, l’avènement d’une science sans théories. Mais qu’est-ce vraiment que le Big Data ?
Sur Paristechreview.com, Henri Verdier, président de Cap Digital (le pôle de compétitivité de la filière des contenus et services numériques), répond avec clarté aux questions que nous nous posons.
ParisTech Review. In the last two or three years, the issue of Big Data has pervaded the public debate, generating enthusiasm and doubts… And yet, we sometimes have a hard time defining what we are exactly talking about. Could you briefly explain?
Henri Verdier. This confusion isn’t surprising, because not only is it a recent theme, but more importantly, there is an ongoing political and economic confrontation around its definition. The term “Big Data” refers to at least three different phenomenon.
In a narrow sense, it refers to new information technologies in the field of massive data processing.
In a broader sense, it refers to the economic and social transformations induced by these technologies.
And last, some analysts present it as an epistemological break: the leap from hypothetic-deductive methods – on which modern science is built – to inductive logics, very different from the former.
Moreover, the boom of Big Data covers enormous interests and this probably adds to the confusion. For example, IBM rebuilt its empire from ashes in this field. Other giants, such as Google and Facebook, are also very heavily involved. It’s an area that draws the attention of consultants and service providers, and all these people tend to exaggerate the impact of the technologies that they are trying to sell.
Does this mean that we are witnessing a mere bubble, a fad?
Definitely not. But precisely because of the significance of this evolution, we must keep a cool head and examine rationally what’s happening before our eyes. First things first: the technologies involved.
The first phenomenon is the deluge of data generated by servers that currently store volumes of information unfathomable only of a few years back (information available in digital format has increased from 193 petabytes in 1996 – the equivalent of all the books printed hitherto by Humanity – to 2.7 zetabytes in 2012, one million times more). This explosion is made possible by technical progress, but it is also fueled by new trends. You and me, everyone, every day, we increasingly produce and exchange messages: tweets, posts, comments, SMS, emails, and so and so. With the popularity of the “quantified self”, which consists in collecting personal data and sharing it, the generating raw data is even a new way of being-in-the-world. But we also produce data we don’t know about, when buying a product in a supermarket, when clicking on a newspaper article, and by allowing our smartphones to geo-localize our position. (…)
The second new feature is the ability to deal with this new data. In a certain way, it is not quantity that defines Big Data, but rather a certain relationship with data, a certain way of playing with it. We learn every day how to manage, measure, interpret better – and at cheaper costs. (…)
Even today, in the Silicon Valley, one can observe the rise of Big Data hardware technology: some players like Facebook, SAP, IBM and Goldman Sachs have organized and funded programs to learn how to manage massive amounts of data. One of the challenges for them is to deal with Google, of course, who has emerged as the actor par excellence in Big Data processing. (…)
This technological shift is not only a question of volume. It is often said that they rely on the “three Vs”: variety, velocity and volume. Big Data information technology is renewed every day to deal with large amounts of data, often poorly structured, in lightning-short timeframes (as for example in high-frequency trading).
Performance, therefore, covers the amount of data processed, the diversity of sources and the search for real-time answers. This new available power opens the door to new strategies of data processing. You learn to handle complete distributions, to play with probabilities, to translate problems into automatic decision systems, to build new visualizations for new rules of interaction with data.
A new school of computer science is rising, a new way of programming, partly inspired by the hacker culture. The members of this community focused on the hardware in the 1970s, on software in the 1980-1990s, on contents in the 2000s. And now, they are focusing on data. (…)
A new computer science, or better said, a new philosophy of computer science: does that suppose different professional skills?
Indeed, yes. Today, a new profession is emerging: datascientists, who could be defined as follows: they’re basically good mathematicians and more specifically, statisticians; next, they’re good computer engineers and, if possible, excellent hackers, capable for instance of installing three virtual machines on a single server; last, and this is a crucial point, they are able to provide strategic advice, because most organizations today are completely unprepared for Big Data. It is possible that these different functions split apart in the future. But for today, we need all three skills.
To these three basic skills, I might add datavisualisation: to be able to give a shape, a readable shape, to calculations is absolutely crucial if we want Big Data to be usable for something.
Precisely, what does it do? What are the applications of these new skills?
General speaking, producting and capturing data creates value. The question, of course, is to know how and where.
On some subjects, applications already exist: marketing comes into mind, of course, where ad targeting is made possible by the cloud processing of the data generated by each user. Another example is customization, as performed by Amazon, which is capable of suggesting books or movies amazingly close to your personal tastes. (…)
But these are just obvious examples. Big Data allows many other things. It can, for example, assist organizations in analyzing complex problems and taking into account the variability of these situations, instead of always thinking in terms of the “average customer”, the “average patient” or the “average voter”…
(…) In a completely different field, high frequency trading is also one possible application of Big Data. It’s not only about multiplying financial transactions, but also an asset for overtaking other operators, by responding quickly to their operations and by replying in more efficient communication channels.
We could also mention the emerging field of feedback economy, based on constant iterations to optimize supply, both in terms of available stocks and in terms of prices. Or personal assistants such as SIRI, that you can train yourself. Or applications such as Dr. Watson, which provide diagnostic support to high-end hospital teams.
Precisely, a site like Dr. Watson poses the problem of the reliability of interpretations made from Big Data.
True. In this case, it is just an aid for diagnostic; it doesn’t replace a doctor’s visit. But we would be wrong to stop at this conclusion. There are situations when you do not have any reliable data. The UN, for example, receives economic data dated several years back, and sometimes even distorted. Epidemiology works with data both expensive and quite long to produce. And yet, one can follow an influenza or dengue epidemic with simple queries on Google. Monitoring an epidemic in real time and with free data is priceless! Information produced by Big Data is often based on imperfect or incomplete sources and is therefore neither absolutely certain, nor guaranteed nor reliable. But oddly enough, because of the law of large numbers, it is often an effective source of information.
But how should we interpret these phenomena? Among the recurring debates around Big Data, there is the idea of a scientific revolution, including the horizon of a “science without theory” as prophesied by Chris Anderson.
Again, we must make some distinctions. There’s definitely something important at work in social sciences, especially in marketing and sociology, which have never had the pretension to unveil universal laws. In these disciplines, Big Data not only leads to a greater ability to process data, but also to a form of liberation in the way we organize this data. For instance, when mapping 30 million blogs, new sociological categories appear which sociologists would never have thought of. In short, sociological categories emerging from pure empirical observation can be far more relevant than previous conventional categories.
This is what led Chris Anderson, the editor of the Wired magazine, to formulate his idea of a “science without theory”, which would use inductive rather than deductive logic. In this model, truth almost spontaneously sprouts from data. And indeed, thanks to “machine learning”, we sometimes end up in situations where we are able to predict, with equations that we don’t really know, results that we can’t really explain! (…) This doesn’t prevent us from making the effort to understand: statisticians emphasize that any serious work on Big Data requires understanding the processes of data generation as well as their evolution. At the same time, data management is always based on causal inferences that should be stated and understood.
Public authorities have often access to large statistical databases: have they taken hold of the subject?
There is a clear interest and some remarkable initiatives. A city like New York, for example, has formed a small team of datascientists and they were able to extract precise information from masses of public data available from the city. For instance, they have identified the areas and streets where fires were more likely to occur. This information provided the right indications for security visits and consequently helped reduce the number of fires. They also developed an algorithm to detect tax frauds. And it works!
Hence the “Big Brother” label often associated with Big Data?
The previous examples use statistical – not personal – data. But indeed, the developments of Big Data should be confronted to the contemporary obsession for transparency, often flavored with a certain naïveté. This is disturbing. (…)
Personally, I think that we have a number of issues to deal with, far more serious than privacy in the strict sense. Privacy will be protected, one way or another. Following Daniel Kaplan’s observations, I might add that beyond the concern for privacy issues, there is another equally important issue about which there haven’t been so many studies: I’m referring to automatic decision. In a close future, the operations by which an online merchant sets the price of an item or service will no longer be based on an ensemble of average buyers but on the price that you, specifically, are willing to pay. One can well imagine a website able to trace your buyer’s profile and offering you a price according to this profile. Not any price, of course. But the highest price that you are willing to pay. It is quite possible that this type of profiling will soon become the basis for relationships in many environments. And this is certainly matter of concern.
Getting back to public data, it’s not only limited to administrative use. Sometimes, the most interesting developments occur when the public sphere renounces to its monopoly on certain data and organizes the possibility for others to work on it. The GPS, originally developed by the U.S. Army, is a classic example of this strategy.
The open data movement is also a major issue. (…) One of the most recent developments is known as smart disclosure: a strategy to return data to those who produce it, so that they can make use of it. The best example in my opinion is the Blue Button of American veterans. When using some specific online services, they press the button and the service is personalized and boosts its efficiency. Incidentally, in this example, it is not really “returning” data to the citizen but rather allowing him or her to retransmit it to whomever he or her wants.
There is a potential political agenda here: let’s summarize some of the issues at stake. The first one is the possibilities of using Big Data to quickly measure and improve the effectiveness of public policies. Then, open the most relevant public data or target it to rally private and social actors for the support of public policies. And last, promote smart disclosure in order to offer new services to the citizens.