The 5 V's of Big Data

As part of a research proposal for my Honours level studies I had to prepare a literature review on a chosen topic. I opted for the general topic of "Big Data". Below is an extract from the literature review - I am posting it here for informational purposes for the benefit of anyone who may also have an interest in the topic.

Although the concept of Big Data have been around for over 100 years (Santovena 2013), it is important to note that data is being generated and collected at a scale never before witnessed in human history (Keim 2014), with over 90 percent of the world’s data generated within the last 2 years (Santovena 2013). (Bizer, Boncz et al. 2011) states that without a way of deriving value from this vast amount of data, all the engineering performed on the data is a futile exercise.

Various concepts from Cloud and Semantic Technologies are proposed to address the challenges related with deriving value from Big Data (Ji, Li et al. 2012), (Santovena 2013), (Thirunarayan, Sheth 2013), (Tachmazidis, Antoniou et al. 2014), (Tsuchiya, Sakamoto et al. 2011). (Ji, Li et al. 2012) states that researches have been unable to agree on core attributes of Big Data, but contrarily, notes various definitions existing for Big Data and agree with (Santovena 2013) and (Thirunarayan, Sheth 2013) that Big Data can be characterised by the 3 V’s of Big Data, namely Volume, Velocity and Variety. (Thirunarayan, Sheth 2013) goes a step further and adds an additional two characteristics to Big Data – Veracity and Value, although stating that Value is derived from applying analytics on the data.

These characteristics will be discussed one at a time, along with the role Semantic and Cloud Technology play in addressing the challenges associated with each characteristic.

Semantic metadata, such as employed in the Semantic Web, can be embedded into existing data (Stegmaier, Seifert et al. 2014) and addresses challenges relating to varying data sources (Stegmaier, Seifert et al. 2014), which can enhance the ability of not only finding and extracting value from data (Thirunarayan, Sheth 2013), but also indicate relationships among data from varying sources (Thirunarayan, Sheth 2013).

Cloud Computing has been described as the intention to aggregate resources to create powerful computing infrastructures (Ji, Li et al. 2012). (Mell, Grance 2011) defines cloud computing as a model for accessing a shared pool of computing resources that can be instantaneously acquired and decommissioned when no longer required.

Firstly, the defining characteristic, namely Volume, of Big Data will be discussed. (Santovena 2013) notes that according to IBM, 2.5 quintillion bytes of data are being generated on a daily basis. (Thirunarayan, Sheth 2013) states that in excess of 40 billion sensors are deployed world-wide, with 250 terabytes (TB) of sensor data being generated per Boeing 737 flight from New York to Los Angeles. (Santovena 2013) observes that Big Data can be considered high volume data when business value can no longer be generated using native data storage or processing technologies – or when a specific technology have been implemented to cater for this large amounts of data. Big Data volume creates challenges when the data has to be abstracted and summarised into human understandable pieces of information (Thirunarayan, Sheth 2013). To deal with the interpretive challenge of Big Data Volume, (Thirunarayan, Sheth 2013) proposes Semantic Scalability through Semantic Perception, which involves semantically integrating large amounts of heterogeneous data and applying background knowledge about the data in question as well as perceptual interpretation to deduce useful information.

Big Data volume also creates challenges relating to Data Storage – specifically in view of traditional Database Management Systems (DBMS’s), or relational/structured storage systems, which performs well at high transaction loads (Santovena 2013), but does not handle large scale data well due to the potential bottleneck created by the Database Server (Ji, Li et al. 2012). (Santovena 2013) notes that the nature of Big Data is usually unstructured, or at best, semi-structured and is best suited to non-relational databases, with (Ji, Li et al. 2012) also proposing Distributed File Systems as a solution to consider for this challenge. Big Data creates additional challenges in terms of the amount of time required to process this vast amounts of data, with (Ji, Li et al. 2012) suggesting parallelization techniques to enhance processing performance, or the concept of the Distributed Data Store that distributes data read/write requests which not only has a positive impact on performance, but also negates the problem of a single point of failure – adding to system stability (Tsuchiya, Sakamoto et al. 2011). (Ji, Li et al. 2012) indicate that Big Data quantity is growing much more rapidly than storage capacity which requires re-thinking the existing information framework. (Ji, Li et al. 2012) makes the statement that the top seven Big Data drivers are data generated by the Internet, Science, Finance, Mobile Devices, Sensors, Streaming Data and RFID.

The second characteristic of Big Data – Variety, is exemplified in that the data originating from these varying sources is usually heterogeneous in nature (Tachmazidis, Antoniou et al. 2014), varying in structure (Santovena 2013), and originating from varying data sources (Stegmaier, Seifert et al. 2014). In addition, (Santovena 2013) notes that previously unused data which become accessible due to new processing techniques also exacerbate this issue. Often, in order for this variety of data to generate value, it is required to combine information from multiple sources (Tachmazidis, Antoniou et al. 2014). (Thirunarayan, Sheth 2013) proposes the concept of Hybrid Representation and Reason as a means to address the challenges created by the Variety characteristic of Big Data. This involves using semantic metadata to enhance interoperability and integration between heterogeneous sources of data, with the automated, or manually guided generation and disambiguation of annotations. Big Data Variety creates challenges in terms of data storage and retrieval, as with Big Data Volume when considering traditional relational database systems. (Santovena 2013) notes that RDBMS’s are not suited for handling a variety in information types with (Ji, Li et al. 2012) supporting this view by stating that iterating large amounts of data is slow when not using indices, with current use of indices being applied to simple data types – but Big Data types are becoming increasingly complex. In addition, (Ji, Li et al. 2012) notes that due to these challenge a re-organising of data may be required to be compatible with existing technology, which creates a new set of problems in Big Data management. (Ji, Li et al. 2012) proposes appropriate indexing of Big Data with the application of pre-processing techniques to enhance speed. (Ji, Li et al. 2012) suggests Distributed File Systems as a possible means of addressing the problem of a variety of data sources, noting that ES2 (an elastic storage
system by epiC6) provides efficient loading of data from multiple sources.

Apart from Volume and Variety, Big Data Velocity also creates specific challenges that need addressing. (Santovena 2013) states that Velocity refers to a high rate of data influx and outflow and that data has evolved into a continuous flow, with data snapshots in business analytics making way for real-time monitoring. In addition, (Santovena 2013) notes that the rate at which data is collected is constantly growing; this notion is supported by (Thirunarayan, Sheth 2013) who states that amount of data generated is growing rapidly. Apart from the voluminous nature of this stream of data, (Stegmaier, Seifert et al. 2014) notes that these data sources are also dynamic, with (Santovena 2013) pointing out that high velocity can be characterised by the changing velocity in a single data set, or changing rates of data influx between multiple data sets. This stream of large volumes of data poses unique challenges in that focusing on and classification of relevant data becomes problematic, it has an impact on incremental data processing and the dynamic nature of data impacting utilisation of relevant background knowledge in decision-making (Thirunarayan, Sheth 2013). (Tsuchiya, Sakamoto et al. 2011) also mentions possible scenarios where a sudden influx of data can occur, due to the occurrence of some event. To address this issue of highly dynamic data, (Thirunarayan, Sheth 2013) introduces the concept of Continuous Semantics, which involves using social-knowledge sources to update and create semantic models in order to gain useful information from high velocity data. Cloud Technologies in the form of Distributed Data Stores and Parallel Processing provide possible solutions for the challenges related to
Big Data Velocity. As already mentioned, the concept of the Distributed Data Store positively impacts performance due to the distribution of read and write requests to storage (Tsuchiya, Sakamoto et al. 2011). (Tachmazidis, Antoniou et al. 2014) outlines two models of Parallel Processing, or Distributed Computing, that can be utilized for large scale data processing namely partitioning of processing by rule or by data. (Tachmazidis, Antoniou et al. 2014) states that the data partitioning approach is favourable, as differencing rule set characteristics may cause improper workload balancing and thus impact scalability. (Ji, Li et al. 2012) agrees that parallel processing is essential for enhancing scalability and increasing performance in processing of Big Data.

The challenges connected to Data Quality, or Data Veracity as coined by (Bizer, Boncz et al. 2011), relates to various possible traits of data. This can range from data being incomplete, invalid, inconsistent, suffering from compromised integrity, being inaccurate or simply being out of date (Santovena 2013). Trustworthiness of data should also be considered, especially on open mediums where anyone can publish information (Bizer, Boncz et al. 2011). In addressing the challenges relating to Data Quality – or Veracity, (Thirunarayan, Sheth 2013) proposes the idea of Gleaning Trustworthiness, which involves a semantic based approach to combine physical and human sourced sensor data in order to improve the trustworthiness of sensor data. (Bizer, Boncz et al. 2011) proposes an additional method identifying trustworthy information in the form of a ranking system where previous users rate information in terms of usefulness and quality. The literature did not explicitly cover Data Quality in terms of Cloud Technology, but (Santovena 2013; Thirunarayan, Sheth 2013; and Ji, Li et al. 2012) touch on the subjects of Data Privacy and Security, which could have an adverse impact on Data Quality
when compromised. In addition (Ji, Li et al. 2012) notes that the use of third-party services and vendors in the Cloud Technology sphere adds to security concerns along with the magnitude of data creating challenges in securing information.

Not only a characteristic, but also a key challenge to Big Data is the concept of deriving Value (Thirunarayan, Sheth 2013), with (Bizer, Boncz et al. 2011, Page 1) going so far as to state that without the ability to derive meaning from Big Data, all the engineering around it is no more than a “bunch of cool tricks”. (Santovena 2013) further states that the application of data collection and visualisation techniques should always be targeted towards the related business case and objectives and that Big Data endeavours should not be started before clearly understanding the underlying business problem. In addition, (Santovena 2013) notes that many Big Data initiatives fail due to the business outcomes not being properly defined which in turn has a negative impact on business adoption of Big Data initiatives. (Jacobs, 2009) makes the observation that Big Data requires new thought around how data should be structured to be useful for analysis and that traditional RDBMS technology is limited by speed and functionality to an extent that it fails to address this issue. Since Value is derived from the other 4 characteristics of Big Data, namely Volume, Variety, Velocity and Veracity, the addressing of the challenges related with these characteristics will also impact Value generation.

Data Visualisation is one way of expressing the Value of Big Data in that relationships between or characteristics of information can be reported on or analysed (Santovena 2013). The graphical nature of Data Visualisation is appealing since it effectively utilises human perceptual ability of visually identifying patterns and trends (Santovena 2013). (Thirunarayan, Sheth 2013 and Lam 2013) makes the combined observation that Data Visualisation and Analysis is about enhancing human understanding in order to assist in decision making, not about the underlying algorithms or data. Apart from decision making, involving human creativity, flexibility and background knowledge in the Data Analysis process could prove valuable in gaining insights from large volumes of data (Keim 2014). Some challenges relating to Data Visualisation relate directly to the challenges originating from the other Big Data characteristics, including the speed at which relevant information can be accessed and processed and doing so reliably (Santovena 2013), (Lam 2013). In addition to these challenges, Data Visualisation brings the challenge of rendering large amounts of data at acceptable speeds (Lam 2013).

References

Berners-Lee, T., Hendler, J. " Lassila, O. 2001, "The semantic web", Scientific American, vol. 284, no. 5, pp. 28-37.

Bhana, A. 2006, "Participatory action research: A practical guide for realistic radicals" in Research in practice: Applied methods for the social sciences, eds. M. Terre Blanche, K. Durrheim " D. Painter, Juta and Company Ltd, pp. 429.

Bizer, C., Boncz, P., Brodie, M.L. " Erling, O. 2011, "The meaningful use of big data: four perspectives - four challenges", SIGMOD Record, vol. 40, no. 4, pp. 56-60.

Creswell, J. 2009, Research design: Qualitative, quantitative, and mixed methods approaches, 3rd ed, SAGE Publications, Incorporated, London, UK.

Jacobs, A. 2009, "The pathologies of big data", Communications of the ACM, vol. 52, no. 8, pp. 36-44.

Ji, C., Li, Y., Qiu, W., Awada, U. " Li, K. 2012, "Big data processing in cloud computing environments", Proceedings of the 2012 12th International Symposium on Pervasive Systems, Algorithms and NetworksIEEE Computer Society, , pp. 17.

Henning, E., Van Rensburg, W. " Smit, B. 2004, Finding your way in qualitative research, Van Schaik Publishers, Pretoria, SA.

Keim, D.A. 2014, "Exploring Big Data using Visual Analytics.", EDBT/ICDT Workshops, pp. 160.

Kelly, K. 2006, "From encounter to text: Collecting data in qualitative research." in Research in practice: Applied methods for the social sciences, eds. M. Terre Blanche, K. Durrheim " D. Painter, Juta and Company Ltd, pp. 285.

Lam, H. 2013, "How to Display Big Data for Analysis", The Role of Visualization in the Big Data era:An end to a means or a means to an end?, eds. A. Dasgupta, D. Fisher, C. Scheidegger, D. Keim " R. Kosara, ,13/10/2013, pp. 1.

Lindegger, G. 2006, "Research methods in clinical research" in Research in practice: Applied methods for the social sciences, eds. M. Terre Blanche, K. Durrheim " D. Painter, Juta and Company Ltd, pp. 455.

Mell, P. " Grance, T. 2011, "The NIST definition of cloud computing";,

Sagiroglu, S. " Sinanc, D. 2013, "Big data: A review", Collaboration Technologies and Systems (CTS), 2013 International Conference on IEEE, pp. 42.

Santovena, A. 2013, Big Data: Evolution, Components, Challenges and Opportunities., Massachusetts Institute of Technology.

Stegmaier, F., Seifert, C., Kern, R., Ho¨fler, P., Bayerl, S., Granitzer, M., Kosch, H., Lindstaedt, S., Mutlu, B.,

Sabol, V., Schlegel, K., Zwicklbauer, S., Rabl, T., Poess, M., Baru, C. " Jacobsen, H.-. 2014, "Unleashing Semantics of Research Data", , eds. T. Rabl, M. Poess, C. Baru " H.-. Jacobsen, Springer, Place of Publication: Berlin, Germany; Pune, India. Country of Publication: Germany., 01/01.

Sumathi, S. " Esakkirajan, S. 2007, Fundamentals of relational database management systems, Springer.

Tachmazidis, I., Antoniou, G. " Faber, W. 2014, "Efficient Computation of the Well-Founded Semantics over Big Data", arXiv preprint arXiv:1405.2590, .

Thirunarayan, K. "" Sheth, A. 2013, "Semantics-Empowered Approaches to Big Data Processing for Physical- Cyber-Social Applications", Proc. AAAI 2013 Fall Symp. Semantics for Big Data.

Terre Blanche, M. "" Durrheim, K. 2006, "Histories of the present: Social science research in context" in Research in practice: Applied methods for the social sciences, eds. M. Terre Blanche, K. Durrheim " D. Painter, Juta and Company Ltd, pp. 1.

Terre Blanche, M., Kelly, M. " Durrheim, K. 2006, "Why qualitative research?" in Research in practice: Applied methods for the social sciences, eds. M. Terre Blanche, K. Durrheim " D. Painter, Juta and Company Ltd, pp. 271.

Tredoux, C. " Smith, M. 2006, "Evaluating research design." in Research in practice: Applied methods for the social sciences, eds. M. Terre Blanche, K. Durrheim " D. Painter, Juta and Company Ltd, pp. 160.

Tsuchiya, S., Sakamoto, Y., Tsuchimoto, Y. " Lee, V. 2011, "Big data processing in cloud environments", Fujitsu, vol. 62, no. 5, pp. 522-530.

Van der Riet, M. " Durrheim, K. 2006, "Putting design into practice: Writing and evaluating research proposals." in Research in practice: Applied methods for the social sciences, eds. M. Terre Blanche, K. Durrheim " D. Painter, Juta and Company Ltd, pp. 80.

Vicknair, C., Macias, M., Zhao, Z., Nan, X., Chen, Y. " Wilkins, D. 2010, "A comparison of a graph database and a relational database: a data provenance perspective", Proceedings of the 48th annual Southeast regional conference ACM, pp. 42.

Wassenaar, D. 2006, "Ethical issues in social sciences research" in Research in practice: Applied methods for the social sciences, eds. M. Terre Blanche, K. Durrheim " D. Painter, Juta and Company Ltd, pp. 60.

Photo by chuttersnap on Unsplash