Use Cases¶

This page showcases how urlcheck-smith can be used to solve real-world problems.

Verifying a Large List of Unknown URLs¶

Scenario: You have a list of hundreds or thousands of URLs (e.g., from a web crawl, a legacy database, or a set of research papers). You need to know: 1. Which URLs are still reachable (HTTP 200)? 2. Which URLs point to reliable sources (Government, Educational institutions, major News outlets)? 3. Which URLs are potentially suspicious or just private blogs?

Solution: Use the scan command with reliability presets.

Step 1: Prepare your input¶

Create a text file (e.g., sources.txt) containing your URLs, or even arbitrary text containing URLs. urlcheck-smith will automatically extract them.

Step 2: Run the scan with a preset¶

To classify URLs based on global reliability standards (like .gov, .edu, and major news domains), use the --preset global flag:

urlcheck-smith scan sources.txt --preset global --output results.csv

Step 3: Analyze the results¶

The resulting results.csv will contain columns such as:

url: The extracted URL.
status_code: The HTTP status (e.g., 200, 404).
category: The site category (e.g., government, education, news, private).
trust_tier: The reliability level (e.g., TIER_1_OFFICIAL, TIER_2_RELIABLE, TIER_3_GENERAL).
soft_404_detected: Whether a “Page Not Found” message was detected in a 200 OK response.

Why this is impressive: Instead of manually clicking each link and guessing the site’s nature, urlcheck-smith provides a structured, automated report in seconds. By combining reachability (HTTP check) with reliability (suffix/domain-based classification), and Soft 404 detection, you can immediately filter for high-quality, active, and valid references.

Example Output (simplified)¶

URL	Status	Category	Trust Tier	Soft 404
https://www.data.gov/	200	government	TIER_1_OFFICIAL	False
https://www.bbc.co.uk/news	200	news	TIER_2_RELIABLE	False
http://expired-link.com/old	404	private	TIER_3_GENERAL	False
https://example.com/missing	200	private	TIER_3_GENERAL	True

This allows researchers and data scientists to quickly validate their datasets and focus on the most trustworthy sources.