urlcheck-smith documentation¶

Welcome to the documentation for urlcheck-smith.

A compact, fast URL analysis pipeline:

Extract URLs from arbitrary text files
Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
Trust Tier classification (Official, News, General)
Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
Output results as CSV or JSONL
Standalone URL classifier (classify-url)
Interactive HTTPS URL extractor (extract-https) with CSV export
Batch classification mode (classify)
Database management command (db) to enrich or add custom trusted domains
Supports custom YAML rules, explain mode, quiet mode
Classification: Assigns categories (e.g., government, education) based on domain suffix rules from the built-in UC Smith database.
HTTP Verification: Checks reachability and captures status codes.
Soft 404 Detection: Identifies pages that return a 200 OK status but contain “Page Not Found” text.
Trust Tier Analysis: Automatically categorizes URLs into TIER_1_OFFICIAL, TIER_2_RELIABLE, or TIER_3_GENERAL using TrustManager.
Human-Check Detection: Flags URLs that likely lead to CAPTCHA or bot-detection screens.
Enrichment: Query the Google Fact Check API to scout for known misinformation flags and update the credibility score.

Features in Detail¶

Soft 404 Detection¶

Many websites are configured to return a standard 200 OK status even when a page is missing, often displaying a custom “not found” message to users. urlcheck-smith detects this by scanning the first 2000 characters of the response for common markers like:

“page not found”
“error 404”
“the page you requested cannot be found”

If a marker is found, the soft_404_detected field in the output is set to True, allowing you to filter out these “ghost” pages from your results.

Trust Tier Classification¶

To help prioritize analysis, urlcheck-smith assigns a trust tier to each URL:

TIER_1_OFFICIAL: Government (.gov, .go.jp, etc.), UN, and official international domains.
TIER_2_RELIABLE: Verified news organizations (Reuters, AP, BBC, etc.) and educational institutions.
TIER_3_GENERAL: All other domains.

This is available via the trust_tier field in CSV/JSONL outputs.

urlcheck-smith documentation¶

Features in Detail¶

Soft 404 Detection¶

Trust Tier Classification¶

Getting started¶