“Big Data”: A Brief History

World War II spurred the creation of a variety of computing technologies and devices. These early computers included:

  • Bletchley Park’s top-secret Colossus
  • The German Z3
  • ENIAC, the U.S. Army’s “Electronic Numerical Integrator And Computer”

In the 1950s and 1960s almost every branch of business and government was getting excited about data.

  • 1955: The NSA was processing 37 tons of intercept material per month. Volume issues like these prompted the government to fund the creation of new supercomputers.
  • 1956: IBM unveiled the IBM 305 RAMAC, the first commercial computer with a hard drive (it weighed over a ton). Built for real-time analysis, 305 could maintain records fast enough to keep up with demand, providing random data access and eliminating peak loads.
  • 1966: Statisticians at North Carolina State University began work on a computerized statistics program capable of processing vast amounts of U.S. agricultural data. The resulting software package was named the Statistical Analysis System. We call it SAS.

By the 1970s, big data had advanced to such a stage that everyday citizens were beginning to feel overwhelmed by developments (sound familiar?). During this era, they were:

  • Shocked to learn about the amount of personal information that large credit reporting agencies were accumulating and storing
  • Angered by the idea of the government creating a centralized national data center (the controversy would lead to the Privacy Act of 1974)
  • Excited to read about concepts like the Ethernet
  • Intrigued by the possibilities of personal computers

Statisticians were way ahead of them. In 1977, during its 41st session, the International Statistical Institute (ISI) established a new section, the International Association for Statistical Computing (IASC).

For people outside the cybernetics loop, it was a decade of optimism. Those in the know, however, foresaw that a perfect storm of data technologies was coming. They just weren’t sure when. Then inexpensive home computers became widely available. The World Wide Web was created. And the storm broke.

Though the Internet had been in development since the early 1960s (it began its life as a Department of Defense project called ARPANET), it wasn’t until 1989 that Tim Berners-Lee proposed the idea that information could be shared globally through a hypertext system.

This “Information Superhighway” changed everything. The world went from the equivalent of horse-drawn carts to automobiles overnight. The volume of global data exploded.

  • 1994: The first full-text Web search engine, WebCrawler, appears. Previous to this, only page titles were being recorded.
  • 1996: Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth publish From Data Mining to Knowledge Discovery in Databases – “In our view, KDD [Knowledge Discovery in Databases] refers to the overall process of discovering useful knowledge from data.”
  • 1998: Google goes live. The company makes a name for itself developing sophisticated search algorithms like PageRank, crawling for hyperlinks as well as text on a page.
  • 1999: Napster becomes the first peer-to-peer file sharing system.

Innovation accelerated to meet demand. As the Web continued to explode, computer scientists worked feverishly to develop new data analysis and storage technologies. Hard drives quickly became cheaper and cheaper. Processing power increased radically. And the data expanded to fill the space available.

Investors fell over themselves in their eagerness to get in on the action. They saw the potential of data-driven start-ups and threw their money behind businesses like Amazon and Zappos. When the dot.com bubble burst in March 2000, the NASDAQ index had more than doubled its value from only a year before.

But there was one very large problem. In a 1997 paper, Michael Cox and David Ellsworth, researchers at NASA’s Ames Research Center, articulated the issue they were having with supercomputers:

“Visualization provides an interesting challenge for computer systems: data sets are generally quite large, taxing the capacities of main memory, local disk, and even remote disk. We call this the problem of big data.”

Two years later, as co-writers with Steve Bryson, Robert Haimes and David Kenwright, they returned to the issue:

“Very powerful computers are a blessing to many fields of inquiry. They are also a curse; fast computations spew out massive amounts of data. Where megabyte data sets were once considered large, we now find data sets from individual simulations in the 300GB range.

“But understanding the data resulting from high-end computations is a significant endeavor. As more than one scientist has put it, it is just plain difficult to look at all the numbers. And as Richard W. Hamming, mathematician and pioneer computer scientist, pointed out, the purpose of computing is insight, not numbers.”

This was a very real issue. When the Sloan Digital Sky Survey (SDSS) started collecting data in 2000, it accumulated more data in its first weeks than all the data that had been collected in history of astronomy.

And it wasn’t just scientists who were panicking. In every industry, messy, magnificent, unstructured data sets were becoming far too unruly to handle. Volume, velocity and variety (terms first defined in the title of Doug Laney’s 2001 research note) were creating an unsustainable strain on conventional data management approaches. In 2001, William Cleveland of Bell Labs published his action plan, Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.

“This document describes a plan to enlarge the major areas of technical work of the field of statistics. Because the plan is ambitious and implies substantial change, the altered field will be called ‘data science.’”

The problem, Cleveland believed, was a lack of shared knowledge. Computer scientists didn’t know enough about data science; statisticians didn’t know enough about computing environments.

In his paper, he suggested a merger of minds – statisticians would team with computer scientists to create a powerful and united force for innovation.

The timing was impeccable. In that same year:

  • A draft of the human genome appeared in Science and Nature Magazine.
  • Wikipedia was launched.
  • Tim Berners-Lee coined the term “Semantic Web” – a place in which computers become capable of analyzing all data on the web.

In response to these challenges, researchers, scientists, government agencies and businesses of all shapes and sizes stepped up their data-related efforts.

  • 2002: The Defense Department began to develop Total Information Awareness (TIA), an initiative that would unite a variety of DARPA projects and fuse existing governmental data sets into a grand database. Data about individuals could then be analyzed to spot suspicious behavior and threats. Predictive modeling, biometrics, language processing and other data mining technologies all came into play.
  • 2004: Five men launched a new data mining company called Palantir. The product of pilots, facilitated by In-Q-Tel (the CIA’s venture arm), Palantir built on software created at PayPal to detect fraud. Using artificial intelligence alone wasn’t enough, the company argued. Human analysts were needed to explore data from many sources.
  • 2005: Apache Hadoop, an open-source framework for storing and processing large-scale data sets on clusters of commodity hardware, appeared. Hadoop used Google’s map/reduce framework, which allowed data miners to split and distribute queries across multiple computing nodes and process the queries in parallel, achieving far greater speed.

And still the data grew.

  • 2003: According to IDC and EMC studies, the amount of digital information created by computers and other data systems surpassed the amount of information created in all of human history up to that year.
  • 2005-2006: Social media sites such as YouTube, Twitter and Facebook appeared on the scene. At Facebook, Jeffrey Hammerbacher assembled a team to mine the site’s data to improve service and generate targeted advertising. At Amazon, employees worked to keep the world’s three largest Linux databases (with capacities of 7.8 TB, 18.5 TB, and 24.7 TB) running smoothly.
  • 2007: Apple released its first generation iPhone in June 2007. Data generated from smartphones and mobile devices created a whole new field of big data mining. The world’s effective capacity to exchange information through telecommunications networks reached 65 exabytes (it was 2.2 in 2000).
  • 2008: The number of devices connected to the Internet exceeded the world’s population.
  • 2009: India’s government launched the Unique Identification Authority of India. The goal was to fingerprint, photograph and iris-scan every single citizen in the country (1.2 billion), assign each person a unique 12-digit ID number and enter this data into the world’s largest biometric database.
  • 2011: On the game show Jeopardy! IBM’s Watson scanned and analyzed 4 terabytes (200 million pages) of data in seconds to defeat two human chess players.
  • 2012: The Obama administration announced the Big Data Research and Development Initiative. This push to make better use of big data in government agencies consists (as of 2013) of 84 programs in six departments.

In 2011, McKinsey Global Institute released a seminal report: Big Data The Next Frontier for Innovation, Competition and Productivity.

“The amount of data in our world has been exploding, and analyzing large data sets – so-called big data – will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus…

“Leaders in every sector will have to grapple with the implications of big data, not just a few data-oriented managers. The increasing volume and detail of information captured by enterprises, the rise of multimedia, social media, and the Internet of Things will fuel exponential growth in data for the foreseeable future.”

They weren’t just whistling Dixie:

  • Big data now encompasses RFID, sensor networks, information from social networks, Internet documents, Internet search indexing and call detail records.
  • Data sets from astronomy, atmospheric science, genomics, biogeochemistry, biology, interdisciplinary scientific research and military surveillance.
  • Medical records, photography archives, video archives and large-scale e-commerce.

Technologies and tools continue to evolve to meet this growth. Predictive modeling. Natural-language processing. Machine learning. Artificial intelligence. Cloud-based storage. The list goes on and on.

With developments happening so quickly, it would be foolish to attempt to predict the future.

“Innovation is serendipity, so you don’t know what people will make,” Tim Berners-Lee once said.

But if we know anything, we know one thing. Big data is only going to get bigger.