All forms of human communications have some structure (e.g., language) and machine-generated data typically has a structure because it is designed to have one. What we have is a continuum that goes from a highly rigid structure which is defined before the processing and mining of the data to highly flexible structure that is defined after the processing and mining of the data. The first end of the continuum has given rise in the 1970s to technologies such as relational databases that exploited the structure imposed on the data. The focus on “structured” data, i.e., data with predefined structure, continued until the 2000s when the increased mining of text documents (in the form of web pages) by online search and other web-based companies gave rise to the development of tools and technologies specifically designed to manage “unstructured” data, i.e., data without a predefined structure.
Back in 2001, Doug Laney from Meta Group (an IT research company acquired by Gartner in 2005) wrote a research paper in which he stated that e-commerce had exploded data management along three dimensions: volume, velocity, and variety. These are called the three Vs of big data.
Another characteristic usually associated with big data is that the data is unstructured which is not exactly true. The confusion stems from a common belief that if data cannot conform to a predefined format, model, or schema, then it is considered unstructured. An e-mail message is typically used as an example of unstructured data; whereas the body of the e-mail could be considered unstructured, it is part of a well-defined structure that follows the specifications of RFC-2822, and contains a set of fields that include From, To, Subject, and Date. This is the same for Twitter messages, in which the body of the message, or tweet, can be considered unstructured as well as part of a well-defined structure.
“Big data are high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and processes optimization.”
Social media is just one component of big data. The second category of big data is machine data, there is a very large number of firewalls, load balancers, routers, switches, and computers that support our digital footprint. All of these systems generate log files, ranging from security and audit log files to web site log files that describe what a visitor has done, including the infamous abandoned shopping carts.
Nissan Leaf is an all-electric car. It has a system called CARWINGS, which not only offers the traditional telematics service and a smartphone app to control all aspects of the car but wirelessly transmits vehicle statistics to a central server. Each Leaf owner can track their driving efficiency and compare their energy economy with that of other Leaf drivers.
Some group these alternate data processing approaches under the name NoSQL and categorize them according to the way they store the data, such as key-value stores and document stores