Skip to content

Support Continuous Processing Mode in Spark Streaming #1422

Open
@jbaiera

Description

@jbaiera

By default, Spark's streaming capabilities follow a "micro-batching" model, where data is collected into a batch for a window of time. At the end of that window, a batch job is launched on the cluster to process the records that fall into that window. Once the "micro-batch" is complete, Structured Streaming persists information about the completed batch in the WAL.

The new continuous processing mode, however, follows a more reactive approach to processing the streaming data, similar to how Flink and Storm do. Instead of batching up data over a windowed time frame and processing the records in batches, the framework starts up a set of long-running workers on the cluster which continuously read data from the sources and process the data into sinks.

We should make sure that we are compatible with this mode of operation, and officially document it. This will require integration tests with the given run mode. I also anticipate that there may need to be changes in how we flush data to Elasticsearch, as the sink implementation for structured streaming also batches records internally, flushing them to Elasticsearch as they fill up, or when closing the sink at the end of a micro-batch. The whole point of using continuous mode is that it reduces latency of accepting a piece of data, and thus we should be exploring ways to stream this data directly to ES if possible, or at least limiting how long data can be allowed to accumulate in the sink implementation.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions