This project demonstrates a scalable data engineering solution for extracting, transforming, and storing financial data using the Alpha Vantage API. The pipeline is designed to extract historical stock data for multiple symbols, transform it into a structured format, and store it in a cloud-based data lake for downstream analytics.
- Data Extraction: Retrieves daily time-series stock data for given symbols using the Alpha Vantage API.
- Data Transformation: Cleanses and transforms raw JSON data into a structured format (e.g., date, open, high, low, close, volume).
- Data Storage: Saves the transformed data as
.csv
files to an Azure Data Lake for efficient querying and analysis. - Cloud Orchestration: The pipeline is orchestrated and executed in Databricks Jobs.
- Incremental Loading: Supports loading new data incrementally to avoid duplicate processing.
- Python: Core language for extraction and transformation logic.
- Databricks: Cloud platform for executing the pipeline.
- Azure Data Lake Storage (ADLS): Destination for storing transformed data.
- Alpha Vantage API: Source of financial data.
- Pandas & Pyspark: Data manipulation and transformation.
- Git: Version control and collaboration.
The pipeline uses the Alpha Vantage API for financial market data.
-
API Details:
- URL:
https://www.alphavantage.co/
- Function:
TIME_SERIES_DAILY
- Parameters:
symbol
: Stock ticker symbol (e.g., IBM, MSFT, GOOGL)apikey
: Your Alpha Vantage API key.
- URL:
-
Rate Limits: The free tier of Alpha Vantage limits API calls to 5 per minute and 25 per day. For higher limits, a premium plan is required.
- Python 3.x installed on your machine.
- A Databricks workspace.
- An Azure Data Lake Storage account.
- Git installed and configured.
- Clone the repository:
git clone https://github.com/your-username/your-repo.git cd your-repo