urlcheck_smith package

class urlcheck_smith.SiteClassifier(rules_path: Path | List[Path] | None = None, explain: bool = False, normalize_domain: bool = False, db_path: str | Path | None = None)[source]

Bases: object

classify(records: Iterable[UrlRecord]) List[UrlRecord][source]

Classifies a list of URLs into categories based on their hostname patterns.

This function processes the provided collection of UrlRecord objects, determines the category for each record based on hostname pattern matches, and applies a trust tier classification. Optionally, an explanation of the classification process can be added for each record if explain is enabled.

Parameters:

records (Iterable[UrlRecord]) – A collection of UrlRecord objects representing URLs to classify.

Returns:

A list of UrlRecord objects with updated

attributes for base_url, category, trust_tier, and optionally an explanation (explain) of classification.

Return type:

List[UrlRecord]

class urlcheck_smith.TrustManager(override_rules=None, default_tier='TIER_3_GENERAL', db_path=None)[source]

Bases: object

A General Purpose URL Auditor for urlcheck-smith integration.

audit_list(url_list: list) dict[source]

Processes a list of raw URLs into a categorized report.

classify_url(url: str) str[source]

Classifies a single URL into a trust tier using rules then fallbacks.

class urlcheck_smith.UrlRecord(url: str, base_url: str | None = None, category: str = 'unknown', http_status: int | None = None, redirected_url: str | None = None, human_check_suspected: bool = False, soft_404_detected: bool = False, trust_tier: str = 'TIER_3_GENERAL', error: str | None = None, explain: str | None = None)[source]

Bases: object

Represents a URL record with associated metadata.

This class is used to store and manage information about a URL, including its category, HTTP status, redirection details, and other metadata. The class is immutable and can be utilized for analysis, categorization, or tracking of URL details.

url

The URL being recorded.

Type:

str

base_url

The base URL from which the url is derived, if any.

Type:

Optional[str]

category

The category assigned to the URL. Defaults to “unknown”.

Type:

str

http_status

The HTTP status code obtained from the URL, if available.

Type:

Optional[int]

redirected_url

The URL to which the original URL redirects, if any.

Type:

Optional[str]

human_check_suspected

Indicates whether a human check or challenge was suspected while accessing the URL. Defaults to False.

Type:

bool

soft_404_detected

Indicates whether a soft 404 error was detected for the URL. Defaults to False.

Type:

bool

trust_tier

The trust level assigned to the URL, defaulting to “TIER_3_GENERAL”.

Type:

str

error

Additional information about errors encountered while accessing or processing the URL, if any.

Type:

Optional[str]

explain

An explanation or metadata describing additional details about the URL, if available.

Type:

Optional[str]

base_url: str | None = None
category: str = 'unknown'
error: str | None = None
explain: str | None = None
http_status: int | None = None
human_check_suspected: bool = False
redirected_url: str | None = None
soft_404_detected: bool = False
trust_tier: str = 'TIER_3_GENERAL'
url: str
urlcheck_smith.check_urls(records: Iterable[UrlRecord], timeout: float = 5.0, user_agent: str | None = None) List[UrlRecord][source]

Checks the URLs provided in the records and returns updated UrlRecord instances with the results of HTTP requests.

This function performs HTTP GET requests for the URLs in the given UrlRecord objects. It populates details such as HTTP status, final redirected URL, snippet of the response text, and flags for suspected human check or soft 404 detection. In case of errors, it includes the error message in the updated UrlRecord.

Parameters:
  • records (Iterable[UrlRecord]) – An iterable of UrlRecord instances representing the URLs to be checked.

  • timeout (float) – The timeout in seconds for each HTTP request. Defaults to 5.0 seconds.

  • user_agent (Optional[str]) – The User-Agent string to be used for HTTP requests. If not provided, a default User-Agent is used.

Returns:

A list of updated UrlRecord instances containing the results of the

HTTP requests.

Return type:

List[UrlRecord]

urlcheck_smith.extract_urls_from_paths(paths: Iterable[Path]) List[UrlRecord][source]

Aggregates URLs from multiple paths using the streaming logic.

urlcheck_smith.extract_urls_from_text(text: str) List[UrlRecord][source]

Extract, clean, and deduplicate URLs from a block of text using urlextract.

urlcheck_smith.run_extract_https(args: Namespace) int[source]

Extracts and processes HTTPS URLs from a given source file and writes them to a CSV file.

This function performs several tasks: 1. Validates the provided input file path or prompts the user to input it. 2. Checks the existence and type of the input path to ensure it is a file. 3. Extracts unique HTTPS URLs from the input file. 4. Prompts for an output path or uses a timestamped default if not provided. 5. Writes the extracted URLs to a specified CSV file. 6. Logs the process at various steps.

Parameters:

args (Namespace) – The namespace object containing the following attributes: - input (Optional[Path]): The path to the source file containing URLs. - output (Optional[Path]): The path to the output CSV file for saving extracted URLs.

Returns:

Status code. Returns 0 if the operation is successful, 1 otherwise.

Return type:

int

urlcheck_smith.stream_extract_from_file(path: Path) Iterator[UrlRecord][source]

Generator that yields URLs line-by-line to handle large files efficiently.

Submodules

urlcheck_smith.models

class urlcheck_smith.models.UrlRecord(url: str, base_url: str | None = None, category: str = 'unknown', http_status: int | None = None, redirected_url: str | None = None, human_check_suspected: bool = False, soft_404_detected: bool = False, trust_tier: str = 'TIER_3_GENERAL', error: str | None = None, explain: str | None = None)[source]

Bases: object

Represents a URL record with associated metadata.

This class is used to store and manage information about a URL, including its category, HTTP status, redirection details, and other metadata. The class is immutable and can be utilized for analysis, categorization, or tracking of URL details.

url

The URL being recorded.

Type:

str

base_url

The base URL from which the url is derived, if any.

Type:

Optional[str]

category

The category assigned to the URL. Defaults to “unknown”.

Type:

str

http_status

The HTTP status code obtained from the URL, if available.

Type:

Optional[int]

redirected_url

The URL to which the original URL redirects, if any.

Type:

Optional[str]

human_check_suspected

Indicates whether a human check or challenge was suspected while accessing the URL. Defaults to False.

Type:

bool

soft_404_detected

Indicates whether a soft 404 error was detected for the URL. Defaults to False.

Type:

bool

trust_tier

The trust level assigned to the URL, defaulting to “TIER_3_GENERAL”.

Type:

str

error

Additional information about errors encountered while accessing or processing the URL, if any.

Type:

Optional[str]

explain

An explanation or metadata describing additional details about the URL, if available.

Type:

Optional[str]

base_url: str | None = None
category: str = 'unknown'
error: str | None = None
explain: str | None = None
http_status: int | None = None
human_check_suspected: bool = False
redirected_url: str | None = None
soft_404_detected: bool = False
trust_tier: str = 'TIER_3_GENERAL'
url: str

urlcheck_smith.core.check

urlcheck_smith.core.check.check_urls(records: Iterable[UrlRecord], timeout: float = 5.0, user_agent: str | None = None) List[UrlRecord][source]

Checks the URLs provided in the records and returns updated UrlRecord instances with the results of HTTP requests.

This function performs HTTP GET requests for the URLs in the given UrlRecord objects. It populates details such as HTTP status, final redirected URL, snippet of the response text, and flags for suspected human check or soft 404 detection. In case of errors, it includes the error message in the updated UrlRecord.

Parameters:
  • records (Iterable[UrlRecord]) – An iterable of UrlRecord instances representing the URLs to be checked.

  • timeout (float) – The timeout in seconds for each HTTP request. Defaults to 5.0 seconds.

  • user_agent (Optional[str]) – The User-Agent string to be used for HTTP requests. If not provided, a default User-Agent is used.

Returns:

A list of updated UrlRecord instances containing the results of the

HTTP requests.

Return type:

List[UrlRecord]

urlcheck_smith.core.classify

class urlcheck_smith.core.classify.SiteClassifier(rules_path: Path | List[Path] | None = None, explain: bool = False, normalize_domain: bool = False, db_path: str | Path | None = None)[source]

Bases: object

classify(records: Iterable[UrlRecord]) List[UrlRecord][source]

Classifies a list of URLs into categories based on their hostname patterns.

This function processes the provided collection of UrlRecord objects, determines the category for each record based on hostname pattern matches, and applies a trust tier classification. Optionally, an explanation of the classification process can be added for each record if explain is enabled.

Parameters:

records (Iterable[UrlRecord]) – A collection of UrlRecord objects representing URLs to classify.

Returns:

A list of UrlRecord objects with updated

attributes for base_url, category, trust_tier, and optionally an explanation (explain) of classification.

Return type:

List[UrlRecord]

urlcheck_smith.core.extract

urlcheck_smith.core.extract.extract_https_urls(path: Path) list[str][source]

Extracts unique HTTPS URLs from a file. Insecure HTTP links are ignored.

This function reads the content of a file from the provided path, extracts all HTTP and HTTPS URLs using a predefined regular expression, cleans the URLs by removing trailing characters such as ‘.’, ‘,’, ‘)’, ‘]’, ‘>’, or quotation marks, removes duplicates, and returns a sorted list of unique URLs. This script is made CLI-apps-friendly; e.g., a manual input string will be converted to a Path object to avoid raising errors.

Parameters:

path (Path) – The path to the file to be processed.

Returns:

A sorted list of unique cleaned HTTP and HTTPS URLs extracted from the file.

Return type:

list[str]

urlcheck_smith.core.extract.extract_urls_from_paths(paths: Iterable[Path]) List[UrlRecord][source]

Aggregates URLs from multiple paths using the streaming logic.

urlcheck_smith.core.extract.extract_urls_from_text(text: str) List[UrlRecord][source]

Extract, clean, and deduplicate URLs from a block of text using urlextract.

urlcheck_smith.core.extract.normalize_url(url: str) str[source]

Standardize the URL to prevent duplicate checks of the same resource.

urlcheck_smith.core.extract.sha256_hex(s: str) str[source]
urlcheck_smith.core.extract.stream_extract_from_file(path: Path) Iterator[UrlRecord][source]

Generator that yields URLs line-by-line to handle large files efficiently.

urlcheck_smith.core.extract.urls_to_csv(urls: list[str], output_path: Path) None[source]

Save the URL list to a CSV file with columns: - URL - hashed_URL (SHA-256 hex)

Parameters:
  • urls (list[str]) – List of URLs

  • output_path (Path) – Output CSV file path

urlcheck_smith.core.trust_manager

class urlcheck_smith.core.trust_manager.TrustManager(override_rules=None, default_tier='TIER_3_GENERAL', db_path=None)[source]

Bases: object

A General Purpose URL Auditor for urlcheck-smith integration.

audit_list(url_list: list) dict[source]

Processes a list of raw URLs into a categorized report.

classify_url(url: str) str[source]

Classifies a single URL into a trust tier using rules then fallbacks.