Data Science and Prediction

Only available on StudyMode
  • Download(s) : 71
  • Published : April 15, 2013
Open Document
Text Preview
Working paper CeDER-12-01 May 2012
http:// http://hdl.handle.net/2451/31553

Data Science and Prediction Vasant Dhar Professor, Stern School of Business Director, Center for Digital Economy Research March 29, 2012

Abstract The use of the term “Data Science” is becoming increasingly common along with “Big Data.” What does Data Science mean? Is there something unique about it? What skills should a “data scientist” possess to be productive in the emerging digital age characterized by a deluge of data? What are the implications for business and for scientific inquiry? In this brief monograph I address these questions from a predictive modeling perspective.

Electronic copy available at: http://ssrn.com/abstract=2086734

1. Introduction The use of the term “Data Science” is becoming increasingly common along with “Big Data.” What does Data Science mean? Is there something unique about it? What skills should a “data scientist” possess to be productive in the emerging digital age characterized by a deluge of data? What are the implications for scientific inquiry? The term “Science” implies knowledge gained by systematic study. According to one definition, it is a systematic enterprise that builds and organizes knowledge in the form of testable explanations and predictions about the universe.1 Data Science might therefore imply a focus around data and by extension, Statistics, which is a systematic study about the organization, properties, and analysis of data and their role in inference, including our confidence in such inference. Why then do we need a new term, when Statistics has been around for centuries? The fact that we now have huge amounts of data should not in and of itself justify the need for a new term. The short answer is that it is different in several ways. First, the raw material, the “data” part of Data Science, is increasingly heterogeneous and unstructured – text, images, and video, often emanating from networks with complex relationships among its entities. Figure 1 shows the relative expected volumes of unstructured and structured data between 2008 and 2015, projecting a difference of almost 200 pedabytes in 2015 compared to a difference of 50 pedabytes in 2012. Analysis, including the combination of the two types of data with requires integration, interpretation, and sense making, increasingly based on tools from linguistics, sociology, and other disciplines. Secondly, the proliferation of markup languages, tags, etc. are designed to let computers interpret data automatically, making them active agents in the process of sense making. In contrast to early markup languages such as HTML that were about displaying information for human consumption, the majority of the data now being generated by computers is for consumption by other computers. In other words, computers are increasingly doing the background work for each other. This allows decision making to scale: it is becoming increasingly common for the computer to be the decision maker, unaided by humans. The shift from humans towards computers as decision makers raises a multitude of issues ranging from the costs of incorrect decisions to ethical and privacy issues. These fall into the domains of business, law, ethics, among others.

1

The Oxford Companion to the History of Modern Science New York: Oxford University Press, 2003.

Electronic copy available at: http://ssrn.com/abstract=2086734

Figure 1: Projected Growth in Unstructured and Structured Data From an epistemological perspective, the data explosion makes it productive to visit the age old philosophical debate on the limits of induction as a scientific method for knowledge discovery. Specifically, it positions the computer as a credible generator and tester of hypotheses by ameliorating some of the known errors associated with statistical induction. Machine learning, which is characterized by statistical induction aimed at generating robust predictive models, becomes central to...
tracking img