Credit: Data Science Central
In many business scenarios it is no longer desirable to wait hours, days or weeks for the results of analytic processes. Psychologically, people expect real-time or near real-time responses from the systems they interact with. Real-time analytics is closely tied to infrastructure issues and recent move to technologies like in-memory databases is beginning to make ‘real-time’ look achievable in the business world and not just in the computer science laboratory.
Handling large amounts of streaming data, ranging from structured to unstructured, numerical to micro-blogs streams of data, is challenging in a Big Data context because the data, besides its volume, is very heterogeneous and highly dynamic. It also calls for scalability and high throughput, since data collection related to a disaster area can easily occupy terabytes in binary GIS formats and data streams can show bursts of gigabytes per minutes.
The capabilities of existing system to process such streaming information and answer queries in real-time and for thousands of concurrent users are limited. Approaches based on traditional solutions like Data Stream Management Systems (DSMS) and Complex Event Processors (CEP), are generally insufficient for the challenges posed by stream processing in a Big Data context: the analytical tasks required by stream processing are so knowledge-intensive that automated reasoning tasks are also needed.
The problem of effective and efficient processing of streams in a Big Data context is far from being solved, even when considering the recent breakthroughs in noSQL databases and parallel processing technologies
A holistic approach is needed for developing techniques, tools, and infrastructure which spans across the areas of inductive reasoning (machine learning), deductive reasoning (inference), high performance computing (parallelization) and statistical analysis, adapted to allow continuous querying over streams (i.e., on-line processing).
One of the most open Big Data technical challenges of primary industrial interest is the proper storage/processing/management of huge volumes of data streams. Some interesting academic/industrial approaches have started to mature in the last years, e.g., based on the Map Reduce model to provide a simple and partially automated wa6y to parallelize stream processing over cluster or data centre computing/storage resources. However, Big Data stream processing often poses hard/soft real-time requirements for the identification of significant events because their detection with a too high latency could be completely useless.
New Big Data-specific parallelization techniques and (at least partially) automated distribution of tasks over clusters are crucial elements for effective stream processing. Achieving industrial grade products will require:
1. New techniques to associate quality preferences/requirements to different tasks and to their interworking relationships;
2. New frameworks and open APIs for the quality-aware distribution of stream processing tasks, with minimal development effort requested by application developers and domain experts.