urlcheck_smith package¶
- class urlcheck_smith.SiteClassifier(rules_path: Path | List[Path] | None = None, explain: bool = False, normalize_domain: bool = False, db_path: str | Path | None = None)[source]¶
Bases:
object- classify(records: Iterable[UrlRecord]) List[UrlRecord][source]¶
Classifies a list of URLs into categories based on their hostname patterns.
This function processes the provided collection of UrlRecord objects, determines the category for each record based on hostname pattern matches, and applies a trust tier classification. Optionally, an explanation of the classification process can be added for each record if explain is enabled.
- class urlcheck_smith.TrustManager(override_rules=None, default_tier='TIER_3_GENERAL', db_path=None)[source]¶
Bases:
objectA General Purpose URL Auditor for urlcheck-smith integration.
- class urlcheck_smith.UrlRecord(url: str, base_url: str | None = None, category: str = 'unknown', http_status: int | None = None, redirected_url: str | None = None, human_check_suspected: bool = False, soft_404_detected: bool = False, trust_tier: str = 'TIER_3_GENERAL', error: str | None = None, explain: str | None = None)[source]¶
Bases:
objectRepresents a URL record with associated metadata.
This class is used to store and manage information about a URL, including its category, HTTP status, redirection details, and other metadata. The class is immutable and can be utilized for analysis, categorization, or tracking of URL details.
- url¶
The URL being recorded.
- Type:
str
- base_url¶
The base URL from which the url is derived, if any.
- Type:
Optional[str]
- category¶
The category assigned to the URL. Defaults to “unknown”.
- Type:
str
- http_status¶
The HTTP status code obtained from the URL, if available.
- Type:
Optional[int]
- redirected_url¶
The URL to which the original URL redirects, if any.
- Type:
Optional[str]
- human_check_suspected¶
Indicates whether a human check or challenge was suspected while accessing the URL. Defaults to False.
- Type:
bool
- soft_404_detected¶
Indicates whether a soft 404 error was detected for the URL. Defaults to False.
- Type:
bool
- trust_tier¶
The trust level assigned to the URL, defaulting to “TIER_3_GENERAL”.
- Type:
str
- error¶
Additional information about errors encountered while accessing or processing the URL, if any.
- Type:
Optional[str]
- explain¶
An explanation or metadata describing additional details about the URL, if available.
- Type:
Optional[str]
- base_url: str | None = None¶
- category: str = 'unknown'¶
- error: str | None = None¶
- explain: str | None = None¶
- http_status: int | None = None¶
- human_check_suspected: bool = False¶
- redirected_url: str | None = None¶
- soft_404_detected: bool = False¶
- trust_tier: str = 'TIER_3_GENERAL'¶
- url: str¶
- urlcheck_smith.check_urls(records: Iterable[UrlRecord], timeout: float = 5.0, user_agent: str | None = None) List[UrlRecord][source]¶
Checks the URLs provided in the records and returns updated UrlRecord instances with the results of HTTP requests.
This function performs HTTP GET requests for the URLs in the given UrlRecord objects. It populates details such as HTTP status, final redirected URL, snippet of the response text, and flags for suspected human check or soft 404 detection. In case of errors, it includes the error message in the updated UrlRecord.
- Parameters:
records (Iterable[UrlRecord]) – An iterable of UrlRecord instances representing the URLs to be checked.
timeout (float) – The timeout in seconds for each HTTP request. Defaults to 5.0 seconds.
user_agent (Optional[str]) – The User-Agent string to be used for HTTP requests. If not provided, a default User-Agent is used.
- Returns:
- A list of updated UrlRecord instances containing the results of the
HTTP requests.
- Return type:
List[UrlRecord]
- urlcheck_smith.extract_urls_from_paths(paths: Iterable[Path]) List[UrlRecord][source]¶
Aggregates URLs from multiple paths using the streaming logic.
- urlcheck_smith.extract_urls_from_text(text: str) List[UrlRecord][source]¶
Extract, clean, and deduplicate URLs from a block of text using urlextract.
- urlcheck_smith.run_extract_https(args: Namespace) int[source]¶
Extracts and processes HTTPS URLs from a given source file and writes them to a CSV file.
This function performs several tasks: 1. Validates the provided input file path or prompts the user to input it. 2. Checks the existence and type of the input path to ensure it is a file. 3. Extracts unique HTTPS URLs from the input file. 4. Prompts for an output path or uses a timestamped default if not provided. 5. Writes the extracted URLs to a specified CSV file. 6. Logs the process at various steps.
- Parameters:
args (Namespace) – The namespace object containing the following attributes: - input (Optional[Path]): The path to the source file containing URLs. - output (Optional[Path]): The path to the output CSV file for saving extracted URLs.
- Returns:
Status code. Returns 0 if the operation is successful, 1 otherwise.
- Return type:
int
- urlcheck_smith.stream_extract_from_file(path: Path) Iterator[UrlRecord][source]¶
Generator that yields URLs line-by-line to handle large files efficiently.
Submodules¶
urlcheck_smith.models¶
- class urlcheck_smith.models.UrlRecord(url: str, base_url: str | None = None, category: str = 'unknown', http_status: int | None = None, redirected_url: str | None = None, human_check_suspected: bool = False, soft_404_detected: bool = False, trust_tier: str = 'TIER_3_GENERAL', error: str | None = None, explain: str | None = None)[source]¶
Bases:
objectRepresents a URL record with associated metadata.
This class is used to store and manage information about a URL, including its category, HTTP status, redirection details, and other metadata. The class is immutable and can be utilized for analysis, categorization, or tracking of URL details.
- url¶
The URL being recorded.
- Type:
str
- base_url¶
The base URL from which the url is derived, if any.
- Type:
Optional[str]
- category¶
The category assigned to the URL. Defaults to “unknown”.
- Type:
str
- http_status¶
The HTTP status code obtained from the URL, if available.
- Type:
Optional[int]
- redirected_url¶
The URL to which the original URL redirects, if any.
- Type:
Optional[str]
- human_check_suspected¶
Indicates whether a human check or challenge was suspected while accessing the URL. Defaults to False.
- Type:
bool
- soft_404_detected¶
Indicates whether a soft 404 error was detected for the URL. Defaults to False.
- Type:
bool
- trust_tier¶
The trust level assigned to the URL, defaulting to “TIER_3_GENERAL”.
- Type:
str
- error¶
Additional information about errors encountered while accessing or processing the URL, if any.
- Type:
Optional[str]
- explain¶
An explanation or metadata describing additional details about the URL, if available.
- Type:
Optional[str]
- base_url: str | None = None¶
- category: str = 'unknown'¶
- error: str | None = None¶
- explain: str | None = None¶
- http_status: int | None = None¶
- human_check_suspected: bool = False¶
- redirected_url: str | None = None¶
- soft_404_detected: bool = False¶
- trust_tier: str = 'TIER_3_GENERAL'¶
- url: str¶
urlcheck_smith.core.check¶
- urlcheck_smith.core.check.check_urls(records: Iterable[UrlRecord], timeout: float = 5.0, user_agent: str | None = None) List[UrlRecord][source]¶
Checks the URLs provided in the records and returns updated UrlRecord instances with the results of HTTP requests.
This function performs HTTP GET requests for the URLs in the given UrlRecord objects. It populates details such as HTTP status, final redirected URL, snippet of the response text, and flags for suspected human check or soft 404 detection. In case of errors, it includes the error message in the updated UrlRecord.
- Parameters:
records (Iterable[UrlRecord]) – An iterable of UrlRecord instances representing the URLs to be checked.
timeout (float) – The timeout in seconds for each HTTP request. Defaults to 5.0 seconds.
user_agent (Optional[str]) – The User-Agent string to be used for HTTP requests. If not provided, a default User-Agent is used.
- Returns:
- A list of updated UrlRecord instances containing the results of the
HTTP requests.
- Return type:
List[UrlRecord]
urlcheck_smith.core.classify¶
- class urlcheck_smith.core.classify.SiteClassifier(rules_path: Path | List[Path] | None = None, explain: bool = False, normalize_domain: bool = False, db_path: str | Path | None = None)[source]¶
Bases:
object- classify(records: Iterable[UrlRecord]) List[UrlRecord][source]¶
Classifies a list of URLs into categories based on their hostname patterns.
This function processes the provided collection of UrlRecord objects, determines the category for each record based on hostname pattern matches, and applies a trust tier classification. Optionally, an explanation of the classification process can be added for each record if explain is enabled.
urlcheck_smith.core.extract¶
- urlcheck_smith.core.extract.extract_https_urls(path: Path) list[str][source]¶
Extracts unique HTTPS URLs from a file. Insecure HTTP links are ignored.
This function reads the content of a file from the provided path, extracts all HTTP and HTTPS URLs using a predefined regular expression, cleans the URLs by removing trailing characters such as ‘.’, ‘,’, ‘)’, ‘]’, ‘>’, or quotation marks, removes duplicates, and returns a sorted list of unique URLs. This script is made CLI-apps-friendly; e.g., a manual input string will be converted to a Path object to avoid raising errors.
- Parameters:
path (Path) – The path to the file to be processed.
- Returns:
A sorted list of unique cleaned HTTP and HTTPS URLs extracted from the file.
- Return type:
list[str]
- urlcheck_smith.core.extract.extract_urls_from_paths(paths: Iterable[Path]) List[UrlRecord][source]¶
Aggregates URLs from multiple paths using the streaming logic.
- urlcheck_smith.core.extract.extract_urls_from_text(text: str) List[UrlRecord][source]¶
Extract, clean, and deduplicate URLs from a block of text using urlextract.
- urlcheck_smith.core.extract.normalize_url(url: str) str[source]¶
Standardize the URL to prevent duplicate checks of the same resource.