Skip to content

miao4ai/Pi-Edge-Spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 Pi-Edge-Spark

Pi-Edge-Spark is a distributed real-time ETL and streaming analytics platform
built for edge clusters such as Raspberry Pi networks.

It combines PySpark Structured Streaming, Redis Streams, and MinIO
to provide an online data-processing framework where each edge node continuously
produces sensor or event data, and a central Spark cluster performs live analytics,
aggregation, and anomaly detection.


✨ Features

  • Dynamic edge discovery β€” supports variable number of worker nodes.
  • Secure architecture β€” no IPs hard-coded; cluster info injected at runtime.
  • Streaming ETL β€” continuous ingestion + online aggregation.
  • Edge-to-Cloud bridge β€” lightweight message broker (Redis Streams / Redpanda).
  • Hybrid storage β€” MinIO (S3-compatible) for history, TimescaleDB for metrics.
  • Airflow orchestration β€” DAGs trigger ETL & upload jobs automatically.
  • Visual analytics β€” optional Grafana or Streamlit dashboards.

🧱 Architecture Overview

    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚               Airflow DAGs                   β”‚
    β”‚  β€’ Schedule ETL and upload jobs              β”‚
    β”‚  β€’ Monitor cluster health                    β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚        Spark Structured Streaming Cluster     β”‚
    β”‚  β€’ Subscribe to Redis / Kafka topics          β”‚
    β”‚  β€’ Aggregate, clean, and detect anomalies     β”‚
    β”‚  β€’ Output β†’ MinIO / TimescaleDB               β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–²
                       β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Edge Node 1  β”‚ Edge Node 2  β”‚ Edge Node N  β”‚
    │──────────────│──────────────│──────────────│
    β”‚ β€’ Sensor dataβ”‚ β€’ File tail  β”‚ β€’ MQTT input  β”‚
    β”‚ β€’ Python pub β”‚ β€’ Redis pub  β”‚ β€’ local ETL   β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β”‚
                       β–Ό
             Message Broker Layer
      (Redis Streams / Kafka / Redpanda)

βš™οΈ Core Components

Layer Component Description
Edge 🐍 Python Stream Producer Continuously reads local sensors or logs and publishes structured JSON messages to Redis Streams or Redpanda.
Broker 🧩 Redis Streams / Redpanda Acts as the lightweight message queue between Edge nodes and the central Spark cluster.
Compute πŸ”₯ Spark Structured Streaming Performs real-time aggregation, cleaning, and anomaly detection across all Edge nodes.
Storage ☁️ MinIO / TimescaleDB Stores processed results, historical archives, and time-series metrics.
Orchestration βš™οΈ Airflow Automates scheduled ETL runs, uploads, and cluster monitoring workflows.
Visualization πŸ“Š Grafana / Streamlit Provides real-time dashboards and system health visualization for devices and KPIs.

πŸ“‚ Project File Structure

Pi-Edge-Spark/
β”œβ”€β”€ conf/
β”‚   β”œβ”€β”€ cluster.yaml              # Cluster configuration (auto or manual worker list)
β”‚   └── spark-env.sh              # Auto-generated Spark master environment file
β”‚
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                      # Edge-sourced raw CSV or JSON input data
β”‚   └── processed/                # Processed & aggregated data outputs
β”‚
β”œβ”€β”€ dags/
β”‚   └── pi_edge_streaming_dag.py  # Airflow DAG for ETL orchestration and upload
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ init_cluster_env.py       # Detects master IP, writes spark-env.sh
β”‚   β”œβ”€β”€ edge_producer.py          # Example edge node streaming data producer
β”‚   β”œβ”€β”€ upload_to_minio.py        # Uploads processed results to MinIO
β”‚   β”œβ”€β”€ setup_minio.py            # Initializes MinIO bucket and access policy
β”‚   └── run_local_test.sh         # Local cluster test runner script
β”‚
β”œβ”€β”€ spark_jobs/
β”‚   β”œβ”€β”€ streaming_etl.py          # Core Spark Structured Streaming job
β”‚   └── batch_etl.py              # Offline fallback ETL process
β”‚
β”œβ”€β”€ docker/
β”‚   β”œβ”€β”€ docker-compose.yml        # Optional: Spark + Redis + MinIO stack
β”‚   └── airflow.dockerfile        # Lightweight Airflow image (for Pi/ARM)
β”‚
β”œβ”€β”€ requirements.txt              # Python dependency list
└── README.md                     # Project documentation

About

Edge-Orchestrated Data Processing with Apache Spark on Raspberry Pi

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages