Skip to content

WebCrawler is a C# console application that recursively scans a website starting from a given URL, collects all discovered links, and saves them to a file. It’s useful for site mapping, link analysis, and content discovery.

Notifications You must be signed in to change notification settings

atymri/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

A powerful and extensible C# console web crawler that recursively visits URLs, supports filtering, and exports discovered links to a file.

Features

  • Recursive link crawling with domain-relative expansion
  • URL filtering (domain, file extensions, keywords to include/exclude)
  • Queue-based scheduling with concurrency control
  • Export results to crawled_links.txt
  • Interactive CLI for user-defined filters
  • Console output with colored highlights

Getting Started

Prerequisites

  • .NET 6 SDK or newer
  • Internet connection

Build and Run

cd src/WebCrawler
dotnet run

Usage

  1. You will be prompted to enter a starting URL.

  2. Optionally, enter filtering criteria:

    • Allowed domain (e.g., example.com)
    • Allowed extensions (.html, .php, etc.)
    • Keywords to include or exclude in URLs
  3. The crawler will process the site and save all valid links to crawled_links.txt.

Customization

You can modify filters or concurrency settings inside:

  • QueueCrawlerService.cs — crawling logic
  • UrlHelper.cs — filtering logic

Screenshots

WebCrawler

License

MIT License — use freely, modify boldly.

About

WebCrawler is a C# console application that recursively scans a website starting from a given URL, collects all discovered links, and saves them to a file. It’s useful for site mapping, link analysis, and content discovery.

Topics

Resources

Stars

Watchers

Forks

Languages