URLFetchX: High-Performance Async URL Processor

URLFetchX is a robust, high-performance Python script for concurrently fetching, processing, and saving data from a large list of URLs. It's built with modern asynchronous libraries (asyncio, aiohttp) and utilizes parallel processing for CPU-bound tasks.

Core Features

High Concurrency: Utilizes asyncio and aiohttp to handle hundreds of network requests simultaneously.
Parallel CPU Processing: Leverages concurrent.futures.ProcessPoolExecutor to run CPU-intensive tasks on all available cores without blocking network I/O.
Robust Error Handling: Automatically retries failed requests with configurable limits and delays.
Resource Management: Implements backpressure to prevent the request queue from growing too large and consuming excess memory.
Detailed Reporting: Provides real-time, color-coded console output of the status of each URL (Alive, Dead, Retry).
Structured Output: Saves processed results and failures cleanly to a CSV file.

How It Works

The script follows a pipeline architecture, decoupling the various stages of work for maximum efficiency.

[Input File] -> https://www.merriam-webster.com/dictionary/queue -> [Fetch Workers] -> [Network Result Queue] -> [CPU Process Workers] -> [Processed Result Queue] -> [Writer Worker] -> [Output CSV]

Load: URLs are loaded from the input file into an initial queue.
Fetch: Asynchronous workers pick URLs from the queue, fetch their content over the network, and place the results (or failures) into a network results queue.
Process: A pool of parallel processes picks successful network results, performs a CPU-heavy data processing task on the content, and places the final result into a processed results queue.
Write: A single asynchronous writer takes final results and writes them row-by-row to the output CSV file.

Prerequisites

Python 3.8 or newer
pip for installing packages

Installation

Clone the repository:

git clone https://github.com/0xWhoknows/URLFetchX.git
cd URLFetchX

Install the required packages: (Make sure you have a requirements.txt file in your project)
```
pip install -r requirements.txt
```

Usage

Create your URL list: Create a text file (e.g., urls.txt) and populate it with the URLs you want to process, one URL per line.

[https://example.com](https://example.com)
[https://www.python.org](https://www.python.org)
[https://httpbin.org/status/404](https://httpbin.org/status/404)

Run the script: Execute urlfetchx.py from your terminal.
```
python urlfetchx.py
```
Provide the input file: The script will prompt you to enter the name of your URL file.
```
Enter input filename: urls.txt
```
Check the results: The script will process all URLs and save the output in a new file named results_<your_input_file>. In this example, it would be results_urls.txt.

Customization

The power of URLFetchX comes from its customizable processing logic.

Modifying the Processing Logic

The main part to edit is the cpu_heavy_processing static method within the URLProcessor class. This is where you can add your own code to parse HTML, extract data, run computations, etc.

Example: To parse the HTML content and extract the page title using BeautifulSoup, you would:

Install the new library:
```
pip install beautifulsoup4
```
(And remember to add beautifulsoup4 to your requirements.txt file!)

Update the cpu_heavy_processing method:

from bs4 import BeautifulSoup

# Inside the URLProcessor class...
@staticmethod
def cpu_heavy_processing(content: str, url: str) -> str:
    """
    Parses the HTML content to extract the <title> tag.
    """
    try:
        soup = BeautifulSoup(content, 'html.parser')
        title = soup.title.string.strip() if soup.title else "No Title Found"
        return f"Title: {title}"
    except Exception as e:
        return f"Error processing HTML at {url}: {e}"

Adjusting Configuration

You can tweak the script's performance by changing the class variables at the top of the URLProcessor class in urlfetchx.py:

MAX_CONCURRENT_REQUESTS: Number of URLs to fetch at the same time.
CPU_WORKERS: Number of CPU cores to use for processing. None auto-detects all available cores.
RETRY_LIMIT: How many times to retry a failed URL.
RETRY_DELAY: Seconds to wait before putting a failed URL back in the queue.
BACKPRESSURE_THRESHOLD: The script will pause fetching new URLs if the queue of unprocessed items grows larger than this number.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
requirements.txt		requirements.txt
urlfetchx.py		urlfetchx.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

URLFetchX: High-Performance Async URL Processor

Core Features

How It Works

Prerequisites

Installation

Usage

Customization

Modifying the Processing Logic

Adjusting Configuration

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

0xWhoknows/URLFetchX

Folders and files

Latest commit

History

Repository files navigation

URLFetchX: High-Performance Async URL Processor

Core Features

How It Works

Prerequisites

Installation

Usage

Customization

Modifying the Processing Logic

Adjusting Configuration

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages