Automatic Big Data Provenance Capture at Middleware Level in Advanced Big Data Frameworks

Chacko, Anu Mary; Kumar, S. D. Madhu; Cuzzocrea, Alfredo Massimiliano

Huge amounts of data are being generated by IoT devices, and are termed as ‘Big Data’. Big Data needs to be reliably stored and analyzed. Capturing provenance of such data provides a mechanism to explain the result of data analyt-ics, and provides greater trustworthiness to the insights gathered from data analyt-ics. Capturing the provenance of the data stored in NoSQL databases can help to understand how the data reached its current state. A holistic explanation of the re-sults of data analytics can be achieved through the combination of provenance in-formation of the data with results of analytics. This chapter explores the challenges of automatic provenance capture at the middleware level in three different contexts – in an analytics framework like MapReduce, NoSQL data stores analyzed using the MapReduce framework and in NoSQL stores with SQL front ends. The chapter also portrays how the provenance captured in the MapReduce framework is useful for improving the future executions of job re-runs and anomaly detection, apart from its use in debugging.

Automatic Big Data Provenance Capture at Middleware Level in Advanced Big Data Frameworks / Chacko, A.M., Kumar, S.D.M., Cuzzocrea, A.M.. - STAMPA. - (2017), pp. 219-239.