A knowledge-based platform for Big Data analytics based on publish/subscribe services and stream processingby Christian Esposito, Massimo Ficco, Francesco Palmieri, Aniello Castiglione

Knowledge-Based Systems

About

Similar

Unified analytics platform for big data

Authors:
Donald Miner
2012

Handling churn in DHT-based Publish/Subscribe systems

Authors:
Amina Chaabane, Fatma Abdennadher, Wassef Louati, Mohamed Jmaiel
2012

Cloud Based Big Data Analytics for Smart Future Cities

Authors:
Zaheer Khan, Ashiq Anjum, Saad Liaquat Kiani
2013

Summary of ‘Statins for children with familial hypercholesterolemia’

Authors:
Evidence-Based Child Health Editorial Office
2011

Text

Accepted Manuscript

A Knowledge-Based Platform for Big Data Analytics Based on Publish/Subscribe Services and Stream Processing

Christian Esposito, Massimo Ficco, Francesco Palmieri, Aniello Castiglione

PII: S0950-7051(14)00181-6

DOI: http://dx.doi.org/10.1016/j.knosys.2014.05.003

Reference: KNOSYS 2844

To appear in: Knowledge-Based Systems

Received Date: 13 January 2014

Revised Date: 8 April 2014

Accepted Date: 1 May 2014

Please cite this article as: C. Esposito, M. Ficco, F. Palmieri, A. Castiglione, A Knowledge-Based Platform for Big

Data Analytics Based on Publish/Subscribe Services and Stream Processing, Knowledge-Based Systems (2014), doi: http://dx.doi.org/10.1016/j.knosys.2014.05.003

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

A Knowledge-Based Platform for Big Data Analytics Based on1

Publish/Subscribe Services and Stream Processing2

Christian Espositoa,1,∗, Massimo Ficcob,2, Francesco Palmierib,2, Aniello Castiglionec,33 aInstitute of High Performance Computing and Networking (ICAR), National Research Council,4

Via Pietro Castellino 111, I-80131 Napoli, Italy.5 bDepartment of Industrial and Information Engineering, Second University of Naples,6

Via Roma 29, I-81031 Aversa (CE), Italy.7 cDepartment of Computer Science, University of Salerno,8

Via Ponte don Melillo, I-84084 Fisciano (SA), Italy.9

Abstract10

Big Data Analytics is considered an imperative aspect to be further improved in order to increase11 the operating margin of both public and private enterprises, and represents the next frontier for their12 innovation, competition, and productivity. Big Data are typically produced in different sectors of the above13 organizations, often geographically distributed throughout the world, and are characterized by a large size14 and variety. Therefore, there is a strong need for platforms handling larger and larger amounts of data in15 contexts characterized by complex event processing systems and multiple heterogeneous sources, dealing with16 the various issues related to efficiently disseminating, collecting and analyzing them in a fully distributed17 way.18

In such scenario, this work proposes a way to overcome two fundamental issues: data heterogeneity19 and advanced processing capabilities. We present a knowledge-based solution for Big Data analytics, which20 consists in applying automatic schema mapping to face with data heterogeneity, as well as ontology extraction21 and semantic inference to support innovative processing. Such a solution, based on the publish/subscribe22 paradigm, has been evaluated within the context of a simple experimental proof of concept in order to23 determine its performance and effectiveness.24

Keywords: Publish/Subscribe Services, Interoperability, Schema Matching, Semantic Search, Complex25

Event Processing, Big Data Analytics, Ontologies.26 ∗Corresponding author.

Email addresses: christian.esposito@na.icar.cnr.it (Christian Esposito), massimo.ficco@unina2.it (Massimo

Ficco), francesco.palmieri@unina.it (Francesco Palmieri), castiglione@ieee.org castiglione@acm.org (Aniello

Castiglione) 1Christian Esposito is a fixed-term researcher at the Institute of High Performance Computing and Networking (ICAR), located in Napoli (Italy). Office Telephone Number: (+39) 081 6139508 - Fax Number: (+39) 081 6139531. 2Massimo Ficco and Francesco Palmieri are Assistant Professors at the DIII Department of the Second University of Naples (SUN), located in Aversa (Italy). Office Telephone Number: (+39) 081 5010505 - Fax Number: (+39) 081 5010203. 3Aniello Castiglione is Network and Security manager at the DIA Department of the University of Salerno. Office Telephone

Number: (+39) 089 969594 - Fax Number: (+39) 089 969600.

Preprint submitted to Journal of Knowledge-Based Systems May 8, 2014 1. Introduction27

At the state of the art, large and complex ICT systems are designed by assuming a system of systems28 perspective, i.e., a large number of components integrated by means of middleware adapters/interfaces over29 a wide-area communication network. Such systems usually generate a large amount of loosely structured30 data sets, often known as Big Data, since they are characterized by a huge size and an high degree of31 complexity, that need to be effectively stored and processed [1, 2]. Some concrete examples can be taken32 from the application domains of environmental monitoring, intrusion/anomaly detection systems, healthcare33 management and online analysis of financial data, such as stock price trends. The analysis of such data sets34 is becoming vital for the success of a business or for the achievement of the ICT mission for the involved35 organizations. Therefore, there is the need for extremely efficient and flexible data analysis platforms to36 manage and process such data sets, sometimes on a on-line/timely basis. However, their huge size and37 variety are limiting the applicability of the traditional data mining approaches, which typically encompass a38 centralized collector, able to store and process data, that can become an unacceptable performance bottle-39 neck. Consequently, the demand for a more distributed approach for the scalable and efficient management40 of Big Data is strongly increasing in the current business arena.41

The well-known MapReduce paradigm [3] has attracted great interest, and is currently considered the42 winning-choice framework for large-scale data processing. Such a successful adoption both in industry and43 academia is motivated by its simplicity, scalability and fault-tolerance features, and further boosted by44 the availability of an open-source implementation offered by Apache and named Hadoop [4]. Despite such45 a great success and benefits, MapReduce exhibits several limitations, making it unsuitable for the overall46 spectrum of needs for large-scale data processing. In particular, as described in details in [5], the MapReduce47 paradigm is affected by several performance limitations, introducing high latency in data access and making48 it not suitable for interactive use. As a matter of fact, Hadoop is built on top of the Hadoop Distributed49