GitHub - ragztigadi/BigData-ETL-Pipelines-Ecommerce: Big Data ETL pipeline for Brazilian e-commerce data. Implements data ingestion, transformation, and storage using Apache Spark, Hadoop, and SQL. Designed for scalable data processing and analytics.

BigData-ETL-Pipelines-Ecommerce

Project Description

This project is a comprehensive Big Data ETL pipeline specifically designed to process and analyze Brazilian e-commerce data. It implements data ingestion, transformation, and storage using technologies like Apache Spark, Hadoop, SQL, and cloud services such as Azure and Databricks. The project is structured to ensure scalable and efficient data processing, allowing for advanced analytics and insights.

Project Flow and Architecture

The architecture of the pipeline follows a hybrid model combining Azure Data Factory for data ingestion, Azure Databricks for data transformation, and Azure Synapse for data visualization. The system leverages the Lakehouse Medallion Architecture on Databricks to ensure structured data flow from raw ingestion to refined insights.

Pipeline Flow

Data Ingestion:
- Data is ingested from multiple sources including GitHub (via HTTP) and SQL tables.
- Azure Data Factory is used to move the data to Azure Data Lake Storage Gen2 (ADLS Gen2).
Data Transformation:
- Azure Databricks cleans and transforms the data.
- MongoDB data is used to enrich the information during transformation.
- The transformed data is stored back in ADLS Gen2.
Data Storage and Visualization:
- Data from ADLS Gen2 is processed using Azure Synapse Analytics.
- Data is visualized using tools like Power BI, Tableau, and Fabric.

Architecture Diagram

Data Handling with Medallion Architecture

The pipeline leverages the Delta Lakehouse Medallion Architecture on Databricks to organize data in three layers:

Bronze Layer:
- Raw data ingestion without schema enforcement.
- Acts as a landing zone for batch and streaming data.
Silver Layer:
- Data is cleaned, enriched, and structured.
- Schema enforcement is applied during transformation.
Gold Layer:
- Aggregated data for business insights.
- High-quality, ready-for-analysis data.

Data Sources

MySQL-DB-Data:
- Contains structured data from relational databases.
- Example screenshot:
MongoDB-Data:
- Contains semi-structured product category data.
- Example screenshot:

Building Pipelines on Azure Cloud

The ETL pipeline is hosted and orchestrated using Azure services:

Azure Data Factory: Data movement and orchestration.
Azure Databricks: Data transformation using Spark.
Azure Synapse: Data aggregation and visualization.

Final Data Transformation View

The final view of the data pipeline shows the processed data in the Gold Layer, ready for analysis and visualization.

Usage Instructions

Clone the repository:

git clone https://BigData-ETL-Pipelines-Ecommerce.git$###

Setup Environment:
- Install Apache Spark, Hadoop, and required libraries.
- Configure Azure and Databricks integration.
Run Data Transformation:
- Use Databricks notebooks to execute the ETL tasks.

Results and Visualization

Data processed through this pipeline can be visualized using Power BI, Tableau, or Fabric, allowing for comprehensive insights and business intelligence applications.

Contribution Guidelines

Contributions to improve data processing efficiency, scalability, or analytics features are welcome. Please open a pull request with a detailed description of changes.

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
BigData-ETL-Pipelines-Ecommerce		BigData-ETL-Pipelines-Ecommerce
Data		Data
Note_Books		Note_Books
.gitattributes		.gitattributes
.gitignore		.gitignore
Architecture Diagram.png		Architecture Diagram.png
Databricks_code_for_Transformation_raghav_22.html		Databricks_code_for_Transformation_raghav_22.html
Databricks_code_for_Transformation_raghav_22.ipynb		Databricks_code_for_Transformation_raghav_22.ipynb
ForEachInput.json		ForEachInput.json
README.md		README.md
Screenshot (13).png		Screenshot (13).png
Screenshot (14).png		Screenshot (14).png
Screenshot (15).png		Screenshot (15).png
Screenshot (23).png		Screenshot (23).png
Screenshot (29).png		Screenshot (29).png
lakehouse-medallion-architecture.jpeg		lakehouse-medallion-architecture.jpeg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

BigData-ETL-Pipelines-Ecommerce

Project Description

Project Flow and Architecture

Pipeline Flow

Architecture Diagram

Data Handling with Medallion Architecture

Data Sources

Building Pipelines on Azure Cloud

Final Data Transformation View

Usage Instructions

Results and Visualization

Contribution Guidelines

License

About

Uh oh!

Releases

Packages

Languages

ragztigadi/BigData-ETL-Pipelines-Ecommerce

Folders and files

Latest commit

History

Repository files navigation

BigData-ETL-Pipelines-Ecommerce

Project Description

Project Flow and Architecture

Pipeline Flow

Architecture Diagram

Data Handling with Medallion Architecture

Data Sources

Building Pipelines on Azure Cloud

Final Data Transformation View

Usage Instructions

Results and Visualization

Contribution Guidelines

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages