metapub.findit package

Subpackages

metapub.findit.journals package

Submodules

metapub.findit.dances module

Dance functions organized by publisher.

This module provides organized dance functions in separate files. All dance functions have been extracted from the main dances.py file and organized by publisher for better maintainability.

metapub.findit.findit module

metapub.findit.findit.log = <Logger metapub.findit (INFO)>[source]

findit/findit.py

Provides FindIt object, providing a tidy object layer: into the logic.get_pdf_from_pma function. (see logic.py)

The FindIt class allows lookups of the PDF starting from only a DOI or a PMID, using the following instantiation approaches:

source = FindIt(‘1234567’) # assumes argument is a pubmed ID

source = FindIt(pmid=1234567) # pmid can be an int or a string

source = FindIt(doi=”10.xxxx/xxx.xxx”) # doi instead of pmid.

See the FindIt docstring for more information.

* IMPORTANT NOTE *

In many cases, this code performs intermediary HTTP requests in order to scrape a PDF url out of a page, and sometimes tests the url to make sure that what’s being sent back is in fact a PDF.

If you would like these requests to go through a proxy (e.g. if you would like to prevent making multiple requests of the same servers, which may have effects like getting your IP shut off from PubMedCentral), set the HTTP_PROXY environment variable in your code or on the command line before using any FindIt functionality.

class metapub.findit.findit.FindIt[source]

Bases: object

FindIt helps locate an article’s fulltext PDF based on its pubmed ID or doi, using the following instantiation approaches:

source = FindIt(‘1234567’) # assumes argument is a pubmed ID

source = FindIt(pmid=1234567) # pmid can be an int or a string

source = FindIt(doi=”10.xxxx/xxx.xxx”) # doi instead of pmid.

The machinery in the FindIt object performs all necessary data lookups (e.g. looking up a missing DOI, or using a DOI to get a PubMedArticle) to end up with a url and reason, which attaches to the FindIt object in the following attributes:

source = FindIt(pmid=PMID) source.url source.reason source.pmid source.doi source.doi_score

The “doi_score” is an indication of where the DOI for this PMID ended up coming from. If it was supplied by the user or by PubMed, doi_score will be 100.

If CrossRef came into play during the process to find a DOI that was missing for the PubMedArticle object, the doi_score will come from CrossRef (0 to 100).

Network Timeout Configuration (v0.11+):

FindIt now includes timeout controls to prevent infinite stalling: - request_timeout: HTTP request timeout in seconds (default: 10) - max_redirects: Maximum redirects to follow (default: 3)

These parameters are applied consistently across all publisher-specific strategies to ensure reliable operation.

__init__(pmid=None, cachedir='/home/docs/.cache', **kwargs)[source]

Initialize FindIt to locate full-text PDFs for academic papers.

Parameters:

pmid (str or int, optional) – PubMed ID of the article to find.
cachedir (str, optional) – Directory for caching results. Defaults to system cache directory. Set to None to disable caching.
**kwargs –
Additional keyword arguments: doi (str): DOI of the article (alternative to pmid). url (str): Pre-existing URL (for testing/validation). use_nih (bool): Use NIH access when available. Defaults to False. use_crossref (bool): Enable CrossRef fallback for missing DOIs.

Defaults to False.

doi_min_score (int): Minimum CrossRef confidence score for DOI
matches. Defaults to 60.

verify (bool): Verify URLs by testing HTTP response. Defaults to True. retry_errors (bool): Retry if cached result has error reasons like

”PAYWALL”, “TODO”, “CANTDO”, or “TXERROR”. Note: “NOFORMAT” results are always retried. Defaults to False.

debug (bool): Enable debug logging. Defaults to False. tmpdir (str): Temporary directory for downloads. Defaults to ‘/tmp’. request_timeout (int): Timeout in seconds for HTTP requests. Defaults to 10. max_redirects (int): Maximum number of redirects to follow. Defaults to 3.

Raises:

MetaPubError – If neither pmid nor doi is provided.

Note

After initialization, access results via the url and reason attributes. If url is None, check reason for explanation of why PDF wasn’t found.

load(verify=True)[source]

Find full-text PDF URL for the loaded article.

This method performs the core FindIt logic using publisher-specific strategies to locate downloadable PDFs.

Parameters:

verify (bool, optional) – Test URLs by making HTTP requests to ensure files are downloadable. Setting to False speeds up processing significantly. Defaults to True.

Returns:

A tuple of (url, reason).

url: Direct link to PDF if found, None otherwise.
reason: Explanation if PDF not found (e.g., “PAYWALL”, “NOFORMAT”). May be None if URL was successfully found.

Return type:

Tuple[Optional[str], Optional[str]]

Note

If a ConnectionError occurs during lookup, returns (None, “TXERROR: <details>”).

load_from_cache(verify=True, retry_errors=False)[source]

Load article URL from cache, with fallback to fresh lookup.

Checks cache for previously computed results using article identifiers. If not cached or retry_errors is True for error reasons, performs fresh lookup and caches the result.

Parameters:

verify (bool, optional) – Verify URLs by testing HTTP response. Defaults to True.
retry_errors (bool, optional) – Force fresh lookup if cached result has error reasons like “TODO”, “PAYWALL”, “CANTDO”, or “TXERROR”. Note: “NOFORMAT” results are always retried since new publisher support is frequently added. Defaults to False.

Returns:

A tuple of (url, reason).

url: Direct link to PDF if found, None otherwise.
reason: Explanation if PDF not found, None if successful.

Return type:

Tuple[Optional[str], Optional[str]]

Note

Connection errors are not cached to avoid persisting temporary network issues.

to_dict()[source]: Returns a dictionary containing the public attributes of this object

metapub.findit.logic module

metapub.findit.logic.find_article_from_pma(pma, verify=True, use_nih=False, cachedir=None, request_timeout=10, max_redirects=3)[source]

The real workhorse of FindIt.

Based on the contents of the supplied PubMedArticle object, this function returns the best possible download link for a Pubmed PDF.

This version uses the new registry-based lookup system for scalable journal handling.

Be aware that this function no longer performs doi lookups; if you want this handled for you, use the FindIt object (which will also record the doi score from the lookup for you).

Returns (url, reason) – url being self-explanatory, and “reason” containing any qualifying message about why the url came back the way it did.

Reasons may include (but are not limited to):

“DOI missing from PubMedArticle and CrossRef lookup failed.” “pii missing from PubMedArticle XML” “No URL format for Journal %s”

Optional params:: use_nih – source PubmedCentral articles from nih.gov (NOT recommended)

Parameters:

pma – PubMedArticle object)
verify – (bool) default: True
use_nih – (bool) default: False
cachedir – (str) cache directory for registry database
request_timeout – (int) HTTP request timeout in seconds, default: 10
max_redirects – (int) maximum redirects to follow, default: 3

Returns:

(url, reason)

metapub.findit.logic.find_article_from_doi(doi, verify=True, use_nih=False, cachedir=None, request_timeout=10, max_redirects=3)[source]

Pull a PubMedArticle based on CrossRef lookup (using doi2pmid), then run it through find_article_from_pma.

Parameters:

doi – (string)
cachedir – (str) cache directory for registry database
request_timeout – (int) HTTP request timeout in seconds, default: 10
max_redirects – (int) maximum redirects to follow, default: 3

Returns:

(url, reason)