This project is a comprehensive Big Data ETL pipeline specifically designed to process and analyze Brazilian e-commerce data. It implements data ingestion, transformation, and storage using technologies like Apache Spark, Hadoop, SQL, and cloud services such as Azure and Databricks. The project is structured to ensure scalable and efficient data processing, allowing for advanced analytics and insights.
The architecture of the pipeline follows a hybrid model combining Azure Data Factory for data ingestion, Azure Databricks for data transformation, and Azure Synapse for data visualization. The system leverages the Lakehouse Medallion Architecture on Databricks to ensure structured data flow from raw ingestion to refined insights.
-
Data Ingestion:
- Data is ingested from multiple sources including GitHub (via HTTP) and SQL tables.
- Azure Data Factory is used to move the data to Azure Data Lake Storage Gen2 (ADLS Gen2).
-
Data Transformation:
- Azure Databricks cleans and transforms the data.
- MongoDB data is used to enrich the information during transformation.
- The transformed data is stored back in ADLS Gen2.
-
Data Storage and Visualization:
- Data from ADLS Gen2 is processed using Azure Synapse Analytics.
- Data is visualized using tools like Power BI, Tableau, and Fabric.
The pipeline leverages the Delta Lakehouse Medallion Architecture on Databricks to organize data in three layers:
-
Bronze Layer:
- Raw data ingestion without schema enforcement.
- Acts as a landing zone for batch and streaming data.
-
Silver Layer:
- Data is cleaned, enriched, and structured.
- Schema enforcement is applied during transformation.
-
Gold Layer:
- Aggregated data for business insights.
- High-quality, ready-for-analysis data.
-
MySQL-DB-Data:
-
MongoDB-Data:
The ETL pipeline is hosted and orchestrated using Azure services:
- Azure Data Factory: Data movement and orchestration.
- Azure Databricks: Data transformation using Spark.
- Azure Synapse: Data aggregation and visualization.
- The final view of the data pipeline shows the processed data in the Gold Layer, ready for analysis and visualization.
- Clone the repository:
git clone https://BigData-ETL-Pipelines-Ecommerce.git$###
- Setup Environment:
- Install Apache Spark, Hadoop, and required libraries.
- Configure Azure and Databricks integration.
- Run Data Transformation:
- Use Databricks notebooks to execute the ETL tasks.
Data processed through this pipeline can be visualized using Power BI, Tableau, or Fabric, allowing for comprehensive insights and business intelligence applications.
Contributions to improve data processing efficiency, scalability, or analytics features are welcome. Please open a pull request with a detailed description of changes.
This project is licensed under the MIT License.


.png)
.png)
.png)
.png)
.png)