urlcheck-smith documentation¶
Welcome to the documentation for urlcheck-smith.
A compact, fast URL analysis pipeline:
Extract URLs from arbitrary text files
Classify domains using suffix-based “site runner” rules (government, edu, private, etc.)
Trust Tier classification (Official, News, General)
Optional HTTP checks (status, redirect, CAPTCHA/human-check heuristic)
Output results as CSV or JSONL
Standalone URL classifier (
classify-url)Batch classification mode (
classify)Database management command (
db) to enrich or add custom trusted domainsSupports custom YAML rules, explain mode, quiet mode
Classification: Assigns categories (e.g., government, education) based on domain suffix rules from the built-in UC Smith database.
HTTP Verification: Checks reachability and captures status codes.
Soft 404 Detection: Identifies pages that return a
200 OKstatus but contain “Page Not Found” text.Trust Tier Analysis: Automatically categorizes URLs into
TIER_1_OFFICIAL,TIER_2_RELIABLE, orTIER_3_GENERALusingTrustManager.Human-Check Detection: Flags URLs that likely lead to CAPTCHA or bot-detection screens.
Enrichment: Query the Google Fact Check API to scout for known misinformation flags and update the credibility score.
Features in Detail¶
Soft 404 Detection¶
Many websites are configured to return a standard 200 OK status even when a page is missing, often displaying a custom “not found” message to users. urlcheck-smith detects this by scanning the first 2000 characters of the response for common markers like:
“page not found”
“error 404”
“the page you requested cannot be found”
If a marker is found, the soft_404_detected field in the output is set to True, allowing you to filter out these “ghost” pages from your results.
Trust Tier Classification¶
To help prioritize analysis, urlcheck-smith assigns a trust tier to each URL:
TIER_1_OFFICIAL: Government (
.gov,.go.jp, etc.), UN, and official international domains.TIER_2_RELIABLE: Verified news organizations (Reuters, AP, BBC, etc.) and educational institutions.
TIER_3_GENERAL: All other domains.
This is available via the trust_tier field in CSV/JSONL outputs.