urlcheck_smith package¶
- class urlcheck_smith.SiteClassifier(rules_path: Path | List[Path] | None = None, explain: bool = False, normalize_domain: bool = False, db_path: str | Path | None = None)[source]¶
Bases:
object- classify(records: Iterable[UrlRecord]) List[UrlRecord][source]¶
Classifies a list of URLs into categories based on their hostname patterns.
This function processes the provided collection of UrlRecord objects, determines the category for each record based on hostname pattern matches, and applies a trust tier classification. Optionally, an explanation of the classification process can be added for each record if explain is enabled.
- Parameters:
records (Iterable[UrlRecord]) – A collection of UrlRecord objects representing URLs to classify.
- Returns:
- A list of UrlRecord objects with updated
attributes for base_url, category, trust_tier, and optionally an explanation (explain) of classification.
- Return type:
List[UrlRecord]
- class urlcheck_smith.TrustManager(override_rules=None, default_tier='TIER_3_GENERAL', db_path=None)[source]¶
Bases:
objectA General Purpose URL Auditor for urlcheck-smith integration.
- urlcheck_smith.check_urls(records: Iterable[UrlRecord], timeout: float = 5.0, user_agent: str | None = None) List[UrlRecord][source]¶
Checks the URLs provided in the records and returns updated UrlRecord instances with the results of HTTP requests.
This function performs HTTP GET requests for the URLs in the given UrlRecord objects. It populates details such as HTTP status, final redirected URL, snippet of the response text, and flags for suspected human check or soft 404 detection. In case of errors, it includes the error message in the updated UrlRecord.
- Parameters:
records (Iterable[UrlRecord]) – An iterable of UrlRecord instances representing the URLs to be checked.
timeout (float) – The timeout in seconds for each HTTP request. Defaults to 5.0 seconds.
user_agent (Optional[str]) – The User-Agent string to be used for HTTP requests. If not provided, a default User-Agent is used.
- Returns:
- A list of updated UrlRecord instances containing the results of the
HTTP requests.
- Return type:
List[UrlRecord]
- urlcheck_smith.extract_urls_from_paths(paths: Iterable[Path]) List[UrlRecord][source]¶
Aggregates URLs from multiple paths using the streaming logic.
- urlcheck_smith.extract_urls_from_text(text: str) List[UrlRecord][source]¶
Extract, clean, and deduplicate URLs from a block of text using urlextract.
- urlcheck_smith.run_extract_https(args: Namespace) int[source]¶
Extracts and processes HTTPS URLs from a given source file and writes them to a CSV file.
This function performs several tasks: 1. Validates the provided input file path or prompts the user to input it. 2. Checks the existence and type of the input path to ensure it is a file. 3. Extracts unique HTTPS URLs from the input file. 4. Prompts for an output path or uses a timestamped default if not provided. 5. Writes the extracted URLs to a specified CSV file. 6. Logs the process at various steps.
- Parameters:
args (Namespace) – The namespace object containing the following attributes: - input (Optional[Path]): The path to the source file containing URLs. - output (Optional[Path]): The path to the output CSV file for saving extracted URLs.
- Returns:
Status code. Returns 0 if the operation is successful, 1 otherwise.
- Return type:
int
- urlcheck_smith.stream_extract_from_file(path: Path) Iterator[UrlRecord][source]¶
Generator that yields URLs line-by-line to handle large files efficiently.
Submodules¶
urlcheck_smith.models¶
- class urlcheck_smith.models.UrlRecord(url: str, base_url: str | None = None, category: str = 'unknown', http_status: int | None = None, redirected_url: str | None = None, human_check_suspected: bool = False, soft_404_detected: bool = False, trust_tier: str = 'TIER_3_GENERAL', error: str | None = None, explain: str | None = None)[source]
Bases:
objectRepresents a URL record with associated metadata.
This class is used to store and manage information about a URL, including its category, HTTP status, redirection details, and other metadata. The class is immutable and can be utilized for analysis, categorization, or tracking of URL details.
- url
The URL being recorded.
- Type:
str
- base_url
The base URL from which the url is derived, if any.
- Type:
Optional[str]
- category
The category assigned to the URL. Defaults to “unknown”.
- Type:
str
- http_status
The HTTP status code obtained from the URL, if available.
- Type:
Optional[int]
- redirected_url
The URL to which the original URL redirects, if any.
- Type:
Optional[str]
- human_check_suspected
Indicates whether a human check or challenge was suspected while accessing the URL. Defaults to False.
- Type:
bool
- soft_404_detected
Indicates whether a soft 404 error was detected for the URL. Defaults to False.
- Type:
bool
- trust_tier
The trust level assigned to the URL, defaulting to “TIER_3_GENERAL”.
- Type:
str
- error
Additional information about errors encountered while accessing or processing the URL, if any.
- Type:
Optional[str]
- explain
An explanation or metadata describing additional details about the URL, if available.
- Type:
Optional[str]
- base_url: str | None = None
- category: str = 'unknown'
- error: str | None = None
- explain: str | None = None
- http_status: int | None = None
- human_check_suspected: bool = False
- redirected_url: str | None = None
- soft_404_detected: bool = False
- trust_tier: str = 'TIER_3_GENERAL'
- url: str
urlcheck_smith.core.check¶
- urlcheck_smith.core.check.check_urls(records: Iterable[UrlRecord], timeout: float = 5.0, user_agent: str | None = None) List[UrlRecord][source]¶
Checks the URLs provided in the records and returns updated UrlRecord instances with the results of HTTP requests.
This function performs HTTP GET requests for the URLs in the given UrlRecord objects. It populates details such as HTTP status, final redirected URL, snippet of the response text, and flags for suspected human check or soft 404 detection. In case of errors, it includes the error message in the updated UrlRecord.
- Parameters:
records (Iterable[UrlRecord]) – An iterable of UrlRecord instances representing the URLs to be checked.
timeout (float) – The timeout in seconds for each HTTP request. Defaults to 5.0 seconds.
user_agent (Optional[str]) – The User-Agent string to be used for HTTP requests. If not provided, a default User-Agent is used.
- Returns:
- A list of updated UrlRecord instances containing the results of the
HTTP requests.
- Return type:
List[UrlRecord]
urlcheck_smith.core.classify¶
- class urlcheck_smith.core.classify.SiteClassifier(rules_path: Path | List[Path] | None = None, explain: bool = False, normalize_domain: bool = False, db_path: str | Path | None = None)[source]¶
Bases:
object- classify(records: Iterable[UrlRecord]) List[UrlRecord][source]¶
Classifies a list of URLs into categories based on their hostname patterns.
This function processes the provided collection of UrlRecord objects, determines the category for each record based on hostname pattern matches, and applies a trust tier classification. Optionally, an explanation of the classification process can be added for each record if explain is enabled.
- Parameters:
records (Iterable[UrlRecord]) – A collection of UrlRecord objects representing URLs to classify.
- Returns:
- A list of UrlRecord objects with updated
attributes for base_url, category, trust_tier, and optionally an explanation (explain) of classification.
- Return type:
List[UrlRecord]
urlcheck_smith.core.extract¶
- urlcheck_smith.core.extract.extract_https_urls(path: Path) list[str][source]¶
Extracts unique HTTPS URLs from a file. Insecure HTTP links are ignored.
This function reads the content of a file from the provided path, extracts all HTTP and HTTPS URLs using a predefined regular expression, cleans the URLs by removing trailing characters such as ‘.’, ‘,’, ‘)’, ‘]’, ‘>’, or quotation marks, removes duplicates, and returns a sorted list of unique URLs. This script is made CLI-apps-friendly; e.g., a manual input string will be converted to a Path object to avoid raising errors.
- Parameters:
path (Path) – The path to the file to be processed.
- Returns:
A sorted list of unique cleaned HTTP and HTTPS URLs extracted from the file.
- Return type:
list[str]
- urlcheck_smith.core.extract.extract_urls_from_paths(paths: Iterable[Path]) List[UrlRecord][source]¶
Aggregates URLs from multiple paths using the streaming logic.
- urlcheck_smith.core.extract.extract_urls_from_text(text: str) List[UrlRecord][source]¶
Extract, clean, and deduplicate URLs from a block of text using urlextract.
- urlcheck_smith.core.extract.normalize_url(url: str) str[source]¶
Standardize the URL to prevent duplicate checks of the same resource.