urlcheck_smith package¶

class urlcheck_smith.CrawledURL(url: 'str', anchor_text: 'str | None' = None)[source]¶

Bases: object

anchor_text: str | None = None¶

url: str¶

class urlcheck_smith.SiteClassifier(rules_path: Path | List[Path] | None = None, explain: bool = False, normalize_domain: bool = False, db_path: str | Path | None = None)[source]¶

Bases: object

classify(records: Iterable[UrlRecord]) → List[UrlRecord][source]¶

Classifies a list of URLs into categories based on their hostname patterns.

This function processes the provided collection of UrlRecord objects, determines the category for each record based on hostname pattern matches, and applies a trust tier classification. Optionally, an explanation of the classification process can be added for each record if explain is enabled.

Parameters:

records (Iterable[UrlRecord]) – A collection of UrlRecord objects representing URLs to classify.

Returns:

A list of UrlRecord objects with updated: attributes for base_url, category, trust_tier, and optionally an explanation (explain) of classification.

Return type:

List[UrlRecord]

class urlcheck_smith.TrustManager(override_rules=None, default_tier='TIER_3_GENERAL', db_path=None)[source]¶

Bases: object

A General Purpose URL Auditor for urlcheck-smith integration.

audit_list(url_list: list) → dict[source]¶: Processes a list of raw URLs into a categorized report.

classify_url(url: str) → str[source]¶: Classifies a single URL into a trust tier using rules then fallbacks.

urlcheck_smith.check_urls(records: Iterable[UrlRecord], timeout: float = 5.0, user_agent: str | None = None) → List[UrlRecord][source]¶

Checks the URLs provided in the records and returns updated UrlRecord instances with the results of HTTP requests.

This function performs HTTP GET requests for the URLs in the given UrlRecord objects. It populates details such as HTTP status, final redirected URL, snippet of the response text, and flags for suspected human check or soft 404 detection. In case of errors, it includes the error message in the updated UrlRecord.

Parameters:

records (Iterable[UrlRecord]) – An iterable of UrlRecord instances representing the URLs to be checked.
timeout (float) – The timeout in seconds for each HTTP request. Defaults to 5.0 seconds.
user_agent (Optional[str]) – The User-Agent string to be used for HTTP requests. If not provided, a default User-Agent is used.

Returns:

A list of updated UrlRecord instances containing the results of the: HTTP requests.

Return type:

List[UrlRecord]

urlcheck_smith.crawl_url_layers(src_uri: str, *, max_pages: int = 100, timeout: float = 5.0, depth: int = 0, request_interval: float = 0.0, include_assets: bool = False) → list[CrawledURL][source]¶

Crawl a source HTML page to a fixed depth and return discovered URLs.

The source page is layer 0’s input. URLs extracted from that page are layer 0 results, URLs extracted from layer 0 pages are layer 1 results, and URLs extracted from layer 1 pages are layer 2 results. The depth argument is the deepest layer to collect. Fetch failures and non-HTML responses are skipped. At most max_pages pages are fetched, including the source page. request_interval seconds are slept between page fetches. By default, output is limited to HTML-like pages plus PDF, TXT, CSV, DOCX, XLSX, and PPTX URLs. Static asset URLs and other file types are skipped unless include_assets is true.

urlcheck_smith.extract_urls_from_paths(paths: Iterable[Path]) → List[UrlRecord][source]¶: Aggregates URLs from multiple paths using the streaming logic.

urlcheck_smith.extract_urls_from_text(text: str) → List[UrlRecord][source]¶: Extract, clean, and deduplicate URLs from a block of text using urlextract.

urlcheck_smith.run_extract_https(args: Namespace) → int[source]¶

Extracts and processes HTTPS URLs from a given source file and writes them to a CSV file.

This function performs several tasks: 1. Validates the provided input file path or prompts the user to input it. 2. Checks the existence and type of the input path to ensure it is a file. 3. Extracts unique HTTPS URLs from the input file. 4. Prompts for an output path or uses a timestamped default if not provided. 5. Writes the extracted URLs to a specified CSV file. 6. Logs the process at various steps.

Parameters:: args (Namespace) – The namespace object containing the following attributes: - input (Optional[Path]): The path to the source file containing URLs. - output (Optional[Path]): The path to the output CSV file for saving extracted URLs.
Returns:: Status code. Returns 0 if the operation is successful, 1 otherwise.
Return type:: int

urlcheck_smith.stream_extract_from_file(path: Path) → Iterator[UrlRecord][source]¶: Generator that yields URLs line-by-line to handle large files efficiently.

Submodules¶

urlcheck_smith.models¶

class urlcheck_smith.models.UrlRecord(url: str, base_url: str | None = None, category: str = 'unknown', http_status: int | None = None, redirected_url: str | None = None, human_check_suspected: bool = False, soft_404_detected: bool = False, trust_tier: str = 'TIER_3_GENERAL', error: str | None = None, explain: str | None = None)[source]

Bases: object

Represents a URL record with associated metadata.

This class is used to store and manage information about a URL, including its category, HTTP status, redirection details, and other metadata. The class is immutable and can be utilized for analysis, categorization, or tracking of URL details.

url

The URL being recorded.

Type:: str

base_url

The base URL from which the url is derived, if any.

Type:: Optional[str]

category

The category assigned to the URL. Defaults to “unknown”.

Type:: str

http_status

The HTTP status code obtained from the URL, if available.

Type:: Optional[int]

redirected_url

The URL to which the original URL redirects, if any.

Type:: Optional[str]

human_check_suspected

Indicates whether a human check or challenge was suspected while accessing the URL. Defaults to False.

Type:: bool

soft_404_detected

Indicates whether a soft 404 error was detected for the URL. Defaults to False.

Type:: bool

trust_tier

The trust level assigned to the URL, defaulting to “TIER_3_GENERAL”.

Type:: str

error

Additional information about errors encountered while accessing or processing the URL, if any.

Type:: Optional[str]

explain

An explanation or metadata describing additional details about the URL, if available.

Type:: Optional[str]

base_url: str | None = None

category: str = 'unknown'

error: str | None = None

explain: str | None = None

http_status: int | None = None

human_check_suspected: bool = False

redirected_url: str | None = None

soft_404_detected: bool = False

trust_tier: str = 'TIER_3_GENERAL'

url: str

urlcheck_smith.core.check¶

urlcheck_smith.core.check.check_urls(records: Iterable[UrlRecord], timeout: float = 5.0, user_agent: str | None = None) → List[UrlRecord][source]¶

Checks the URLs provided in the records and returns updated UrlRecord instances with the results of HTTP requests.

Parameters:

records (Iterable[UrlRecord]) – An iterable of UrlRecord instances representing the URLs to be checked.
timeout (float) – The timeout in seconds for each HTTP request. Defaults to 5.0 seconds.
user_agent (Optional[str]) – The User-Agent string to be used for HTTP requests. If not provided, a default User-Agent is used.

Returns:

A list of updated UrlRecord instances containing the results of the: HTTP requests.

Return type:

List[UrlRecord]

urlcheck_smith.core.classify¶

class urlcheck_smith.core.classify.SiteClassifier(rules_path: Path | List[Path] | None = None, explain: bool = False, normalize_domain: bool = False, db_path: str | Path | None = None)[source]¶

Bases: object

classify(records: Iterable[UrlRecord]) → List[UrlRecord][source]¶

Classifies a list of URLs into categories based on their hostname patterns.

Parameters:

records (Iterable[UrlRecord]) – A collection of UrlRecord objects representing URLs to classify.

Returns:

A list of UrlRecord objects with updated: attributes for base_url, category, trust_tier, and optionally an explanation (explain) of classification.

Return type:

List[UrlRecord]

urlcheck_smith.core.crawl¶

class urlcheck_smith.core.crawl.CrawledURL(url: 'str', anchor_text: 'str | None' = None)[source]¶

Bases: object

anchor_text: str | None = None¶

url: str¶

urlcheck_smith.core.crawl.crawl_url_layers(src_uri: str, *, max_pages: int = 100, timeout: float = 5.0, depth: int = 0, request_interval: float = 0.0, include_assets: bool = False) → list[CrawledURL][source]¶

Crawl a source HTML page to a fixed depth and return discovered URLs.

urlcheck_smith.core.extract¶

urlcheck_smith.core.extract.extract_https_urls(path: Path) → list[str][source]¶

Extracts unique HTTPS URLs from a file. Insecure HTTP links are ignored.

This function reads the content of a file from the provided path, extracts all HTTP and HTTPS URLs using a predefined regular expression, cleans the URLs by removing trailing characters such as ‘.’, ‘,’, ‘)’, ‘]’, ‘>’, or quotation marks, removes duplicates, and returns a sorted list of unique URLs. This script is made CLI-apps-friendly; e.g., a manual input string will be converted to a Path object to avoid raising errors.

Parameters:: path (Path) – The path to the file to be processed.
Returns:: A sorted list of unique cleaned HTTP and HTTPS URLs extracted from the file.
Return type:: list[str]

urlcheck_smith.core.extract.extract_urls_from_paths(paths: Iterable[Path]) → List[UrlRecord][source]¶: Aggregates URLs from multiple paths using the streaming logic.

urlcheck_smith.core.extract.extract_urls_from_text(text: str) → List[UrlRecord][source]¶: Extract, clean, and deduplicate URLs from a block of text using urlextract.

urlcheck_smith.core.extract.normalize_url(url: str) → str[source]¶: Standardize the URL to prevent duplicate checks of the same resource.

urlcheck_smith.core.extract.sha256_hex(s: str) → str[source]¶

urlcheck_smith.core.extract.stream_extract_from_file(path: Path) → Iterator[UrlRecord][source]¶: Generator that yields URLs line-by-line to handle large files efficiently.

urlcheck_smith.core.extract.urls_to_csv(urls: list[str], output_path: Path) → None[source]¶

Save the URL list to a CSV file with columns: - URL - hashed_URL (SHA-256 hex)

Parameters:

urls (list[str]) – List of URLs
output_path (Path) – Output CSV file path

urlcheck_smith.core.trust_manager¶

class urlcheck_smith.core.trust_manager.TrustManager(override_rules=None, default_tier='TIER_3_GENERAL', db_path=None)[source]¶

Bases: object

A General Purpose URL Auditor for urlcheck-smith integration.

audit_list(url_list: list) → dict[source]¶: Processes a list of raw URLs into a categorized report.

classify_url(url: str) → str[source]¶: Classifies a single URL into a trust tier using rules then fallbacks.