Pi-Edge-Spark is a distributed real-time ETL and streaming analytics platform
built for edge clusters such as Raspberry Pi networks.
It combines PySpark Structured Streaming, Redis Streams, and MinIO
to provide an online data-processing framework where each edge node continuously
produces sensor or event data, and a central Spark cluster performs live analytics,
aggregation, and anomaly detection.
- Dynamic edge discovery β supports variable number of worker nodes.
- Secure architecture β no IPs hard-coded; cluster info injected at runtime.
- Streaming ETL β continuous ingestion + online aggregation.
- Edge-to-Cloud bridge β lightweight message broker (Redis Streams / Redpanda).
- Hybrid storage β MinIO (S3-compatible) for history, TimescaleDB for metrics.
- Airflow orchestration β DAGs trigger ETL & upload jobs automatically.
- Visual analytics β optional Grafana or Streamlit dashboards.
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Airflow DAGs β
β β’ Schedule ETL and upload jobs β
β β’ Monitor cluster health β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββ
β Spark Structured Streaming Cluster β
β β’ Subscribe to Redis / Kafka topics β
β β’ Aggregate, clean, and detect anomalies β
β β’ Output β MinIO / TimescaleDB β
ββββββββββββββββββββββββββββββββββββββββββββββββ
β²
β
ββββββββββββββββ¬βββββββββββββββ¬βββββββββββββββ
β Edge Node 1 β Edge Node 2 β Edge Node N β
ββββββββββββββββββββββββββββββββββββββββββββββ
β β’ Sensor dataβ β’ File tail β β’ MQTT input β
β β’ Python pub β β’ Redis pub β β’ local ETL β
ββββββββββββββββ΄βββββββββββββββ΄βββββββββββββββ
β
βΌ
Message Broker Layer
(Redis Streams / Kafka / Redpanda)
| Layer | Component | Description |
|---|---|---|
| Edge | π Python Stream Producer | Continuously reads local sensors or logs and publishes structured JSON messages to Redis Streams or Redpanda. |
| Broker | π§© Redis Streams / Redpanda | Acts as the lightweight message queue between Edge nodes and the central Spark cluster. |
| Compute | π₯ Spark Structured Streaming | Performs real-time aggregation, cleaning, and anomaly detection across all Edge nodes. |
| Storage | βοΈ MinIO / TimescaleDB | Stores processed results, historical archives, and time-series metrics. |
| Orchestration | βοΈ Airflow | Automates scheduled ETL runs, uploads, and cluster monitoring workflows. |
| Visualization | π Grafana / Streamlit | Provides real-time dashboards and system health visualization for devices and KPIs. |
Pi-Edge-Spark/
βββ conf/
β βββ cluster.yaml # Cluster configuration (auto or manual worker list)
β βββ spark-env.sh # Auto-generated Spark master environment file
β
βββ data/
β βββ raw/ # Edge-sourced raw CSV or JSON input data
β βββ processed/ # Processed & aggregated data outputs
β
βββ dags/
β βββ pi_edge_streaming_dag.py # Airflow DAG for ETL orchestration and upload
β
βββ scripts/
β βββ init_cluster_env.py # Detects master IP, writes spark-env.sh
β βββ edge_producer.py # Example edge node streaming data producer
β βββ upload_to_minio.py # Uploads processed results to MinIO
β βββ setup_minio.py # Initializes MinIO bucket and access policy
β βββ run_local_test.sh # Local cluster test runner script
β
βββ spark_jobs/
β βββ streaming_etl.py # Core Spark Structured Streaming job
β βββ batch_etl.py # Offline fallback ETL process
β
βββ docker/
β βββ docker-compose.yml # Optional: Spark + Redis + MinIO stack
β βββ airflow.dockerfile # Lightweight Airflow image (for Pi/ARM)
β
βββ requirements.txt # Python dependency list
βββ README.md # Project documentation