metapub package

Subpackages

Submodules

metapub.base module

metapub.base.parse_elink_response(xmlstr)[source]

return all Ids from an elink XML response

Parameters:: xmlstr
Returns:: list of IDs, or None if XML response empty

class metapub.base.MetaPubObject(xml, root=None, *args, **kwargs)[source]

Bases: object

Base class for XML parsing objects (e.g. PubMedArticle)

__init__(xml, root=None, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

static parse_xml(xml, root=None)[source]

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

class metapub.base.Borg[source]

Bases: object

singleton class backing cache engine objects.

__init__()[source]

metapub.cache_utils module

Utilities for cache file creation and management.

metapub.cache_utils.datetime_to_timestamp(dt, epoch=datetime.datetime(1970, 1, 1, 0, 0))[source]

takes a python datetime object and converts it to a Unix timestamp.

This is a non-timezone-aware function.

Parameters:

dt – datetime to convert to timestamp
epoch – datetime, option specification of start of epoch [default: 1/1/1970]

Returns:

timestamp

metapub.cache_utils.get_cache_path(cachedir='/home/docs/.cache', filename='metapub-cache.db')[source]

checks if cachedir exists; if not, tries to create it; raises MetaPubError if it can’t be created.

if cachedir is None, returns None.

Default: DEFAULT_CACHE_DIR set in config.py (~/.cache)

Supports expansion of user directory shortcut ‘~’ to full path.

Parameters:

cachedir – directory to store
filename – name of cache file

Returns:

path to SQLite DB file

:raises MetaPubError

metapub.cache_utils.cleanup_dir(cachedir)[source]

Remove all files from a cache directory and delete the directory itself.

This function is used for cache maintenance and cleanup operations. Silently handles errors if files cannot be removed.

Parameters:: cachedir (str) – path to directory to clean up
Returns:: None

metapub.cite module

Common functions for the formatting of academic reference citations.

metapub.cite.author_str(author_list_or_string, as_html=False)[source]

Helper function for constructing article citations.

Parameters:: author_list_or_string
Returns:: author(s) str suitable for printed citation

metapub.cite.citation(**kwargs)[source]

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.

see cite.article and cite.book for more specific use cases.

Note that “authors” (as list) will be used preferentially over “author” (as str).

Keywords:: as_html: (bool) returns citation with light HTML formatting. author: (str) – prints author as-is without modification authors: (list) – prints as author1 (first in list) as “Lastname_FirstInitials, et al” title: (str) journal: (str) year: (str or int) volume: (str or int) pages: (str) should be formatted “nn-mm”, e.g. “55-58” doi: (str)

Returns:: citation (str)

metapub.cite.article(**kwargs)[source]

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.

This function uses the Article format citation template. For example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

Keywords:: journal title doi authors (str or list) – if str, prints authors without modification.

Returns:: citation (str)

metapub.cite.book(book, **kwargs)[source]

Takes a PubMedArticle “book” and formats a citation string. This is a special type of citation built mostly for NCBI GeneReviews and not currently generalizable to other academic books (yet).

Returns a formatted citation string for a book. A “book” needs to contain the following attributes:

author title book_date_revised book_contribution_date editors journal book_publisher (may be a URL)

This function uses the Book format citation template:

book_cit_fmt = ‘{author}. {title}. {cdate} (Update {mdate}). In: {editors}, editors. {journal} (Internet). {book_publisher}’

For example:

Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

Parameters:

book – PubMedArticle of type “book”
use_html – (bool) whether to return with light HTML formatting

Returns:

formatted citation string

Return type:

str

metapub.cite.bibtex(**kwargs)[source]

Returns a BibTeX formatted citation string built from the book or article author(s), title, journal, year, volume, pages, and doi if the fields exist

see cite.article and cite.book for more specific use cases.

see https://ctan.org/tex-archive/biblio/bibtex/contrib/doc/ for more on the BibTeX format

Note that “authors” (as list) will be used preferentially over “author” (as str).

Keywords:: isbook: (bool) returns citation with standard entry type as ‘book’ author: (str) – prints author as-is without modification authors: (list) – prints as author1 (first in list) as “Lastname_FirstInitials, et al” title: (str) journal: (str) year: (str or int) volume: (str or int) pages: (str) should be formatted “nn-mm”, e.g. “55-58” doi: (str)

Returns:: bibtex citation (str)

metapub.clinvarfetcher module

metapub.clinvarfetcher: tools for interacting with ClinVar data

class metapub.clinvarfetcher.ClinVarFetcher(a Borg singleton object)[source]

Bases: Borg

Toolkit for retrieval of ClinVar information.

Set optional ‘cachedir’ parameter to absolute path of preferred directory if desired; cachedir defaults to <current user directory> + /.cache

clinvar = ClinVarFetcher()

clinvar = ClinVarFetcher(cachedir=’/path/to/cachedir’)

Usage

Get ClinVar accession IDs for gene name (switch single_gene to True to filter out results containing more genes than the specified gene being searched, default False).

cv_ids = clinvar.ids_by_gene(‘FGFR3’, single_gene=True)

Get ClinVar accession in python dictionary format for given ID:

cv_subm = clinvar.accession(65533) # can also submit ID as string

Get list of pubmed IDs (pmids) for given ClinVar accession ID:

pmids = clinvar.pmids_for_id(65533) # can also submit ID as string

Get list of pubmed IDs (pmids) for hgvs string:

pmids = clinvar.pmids_for_hgvs(‘NM_017547.3:c.1289A>G’)

For more info, see the ClinVar eutils page: https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/

__init__(method='eutils', cachedir='default')[source]

Initialize ClinVarFetcher for clinical variant data retrieval.

Parameters:

method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.
cachedir (str, optional) – Directory for caching responses. Use ‘default’ for system cache directory. Defaults to ‘default’.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state. Provides access to NCBI’s ClinVar database for clinical significance of genetic variants, gene-disease relationships, and variant literature.

metapub.clinvarvariant module

metapub.clinvarvariant – ClinVarVariant class instantiated by supplying ESummary XML string.

class metapub.clinvarvariant.PathogenicSummary(counts: dict[Literal['pathogenic', 'likely pathogenic', 'uncertain significance', 'likely benign', 'benign', 'conflicting interpretations', 'drug response', 'risk factor', 'association', 'protective', 'other', 'likely pathogenic, low penetrance', 'pathogenic, low penetrance', 'uncertain risk allele', 'likely risk allele', 'established risk allele', 'affects', 'conflicting data from submitters', 'not provided', 'vus-high', 'vus-mid', 'vus-low'], int], total_submitters: int, consensus: Literal['pathogenic', 'likely pathogenic', 'uncertain significance', 'likely benign', 'benign', 'conflicting interpretations', 'drug response', 'risk factor', 'association', 'protective', 'other', 'likely pathogenic, low penetrance', 'pathogenic, low penetrance', 'uncertain risk allele', 'likely risk allele', 'established risk allele', 'affects', 'conflicting data from submitters', 'not provided', 'vus-high', 'vus-mid', 'vus-low'] | None, conflicting: bool, review_status: str | None)[source]

Bases: object

counts: dict[Literal['pathogenic', 'likely pathogenic', 'uncertain significance', 'likely benign', 'benign', 'conflicting interpretations', 'drug response', 'risk factor', 'association', 'protective', 'other', 'likely pathogenic, low penetrance', 'pathogenic, low penetrance', 'uncertain risk allele', 'likely risk allele', 'established risk allele', 'affects', 'conflicting data from submitters', 'not provided', 'vus-high', 'vus-mid', 'vus-low'], int]

total_submitters: int

consensus: Literal['pathogenic', 'likely pathogenic', 'uncertain significance', 'likely benign', 'benign', 'conflicting interpretations', 'drug response', 'risk factor', 'association', 'protective', 'other', 'likely pathogenic, low penetrance', 'pathogenic, low penetrance', 'uncertain risk allele', 'likely risk allele', 'established risk allele', 'affects', 'conflicting data from submitters', 'not provided', 'vus-high', 'vus-mid', 'vus-low'] | None

conflicting: bool

review_status: str | None

__init__(counts, total_submitters, consensus, conflicting, review_status)

class metapub.clinvarvariant.ClinVarVariant(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

__init__(xmlstr, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]: returns a dictionary composed of all extractable properties of this concept.

property hgvs_c: Returns a list of all coding HGVS strings from the Allelle data.

property hgvs_g: Returns a list of all genomic HGVS strings from the Allelle data.

property hgvs_p: Returns a list of all protein effect HGVS strings from the Allelle data.

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

metapub.config module

metapub.config.get_process_log(filepath, loglevel=20, name='metapub.process')[source]: Sets up a file-based logger for process logging and returns its log object.

metapub.config.get_data_log(filepath, name='metapub.data')[source]: Sets up a file-based logger for data logging and returns its log object.

metapub.convert module

Convert.pmid2doi / Convert.doi2pmid / Convert.bookid2pmid

Usage:

convert -h convert pmid2doi <pmid> [options] convert doi2pmid <doi> [options] convert bookid2pmid <book_id> [options]

Options:

-h, --help: Show this help page
-v, --version: Show this command’s version.
-q, --quiet: Shut up all that log garbage.
-d, --debug: No wait, give me ALL the log garbage! Superceded by –quiet.
-a, --article: Also print out the article information (from PubMedArticle) if possible.
-w, --work: Also print out info from the CrossRef entry, if possible.

metapub.convert.interpret_pmids_for_citation_results(pmids)[source]

metapub.convert.PubMedArticle2doi(pma)[source]

Starting with a PubMedArticle object, use CrossRef to find a DOI for given article.

Parameters:: pma (PubMedArticle)
Returns:: doi (str) or None

metapub.convert.pmid2doi(pmid)[source]

starting with a pubmed ID, lookup article in pubmed. If DOI found in PubMedArticle object,: return it. Otherwise, use CrossRef to find the DOI for given article.

Parameters:

pmid (str or int)

Returns:

doi (str) or None

Raises:

InvalidPMID (if pmid is invalid) –
NCBIServiceError (if NCBI services are down) –

metapub.convert.doi2pmid(doi)[source]

uses CrossRef and PubMed eutils to lookup a PMID given a known doi.

Warning: NO validation of input DOI performed here. Use: metapub.text_mining.find_doi_in_string beforehand if needed.

If a PMID can be found, return it. Otherwise return None.

In very rare cases, use of the CrossRef->pubmed citation method used here may result in more than one pubmed ID. In this case, this function will return instead the word ‘AMBIGUOUS’.

Parameters:: pmid – (str or int)
Return doi:: (str) if found; ‘AMBIGUOUS’ if citation count > 1; None if no results.
Raises:: NCBIServiceError if NCBI services are down

metapub.convert.bookid2pmid(book_id)[source]: Convenience interface to PubMedFetcher.pmid_for_bookID

metapub.convert.main()[source]

metapub.crossref module

metapub.crossref.get_most_similar_work_from_crossref_results(qstring, qname, cr_results)[source]

Uses Levenshtein distance on result title to rank CrossRef results. Returns top candidate for a match from these items based on comparison title.

Parameters:

qstring – (str) original query string for search
qname – (str) name of query item (e.g. “title”)
cr_results – (dict) crossref results as returned by habanero

Returns:

{‘title_ld’: <score>, ‘work’: <CrossRefWork or None>}

class metapub.crossref.CrossRefWork(**kwargs)[source]

Bases: object

Represents one ‘work’ from CrossRef search results.

__init__(**kwargs)[source]

property first_page: Returns first page (number) of article as string, or None if self.page is empty.

property citation: Returns a formal citation string for this work.

property pubyear

property pubmonth

property pubdate

property author1

property author1_last_fm

property authors_str_lastfirst: Returns this work’s authors as a semicolon-separated string – LASTNAME FIRSTInitial.

property author_list: Returns this work’s authors as a flat list (Firstname Lastname), retaining order given by Crossref.

property author_list_last_fm: Returns this work’s authors as a flat list (Lastname FirstInitial), retaining order given by Crossref.

to_citation()[source]: Describes this work as a dictionary suitable for citation lookups in PubMed.

to_dict()[source]: Describes this Work as a dictionary similar to the one returned by CrossRef.

class metapub.crossref.CrossRefFetcher(**kwargs)[source]

Bases: Borg

Valid field queries for this route are: affiliation, degree, event-acronym, bibliographic, container-title, publisher-name, author, event-theme, standards-body-acronym, chair, event-location, translator, funder-name, event-name, publisher-location, title, standards-body-name, contributor, editor, event-sponsor

__init__(**kwargs)[source]

article_by_doi(doi)[source]

Returns a CrossRefWork object loaded by querying the Crossref works/DOI REST endpoint.

Parameters:: doi – (str)
Return type:: CrossRefWork
Raises:: HTTPError (404) if DOI not found.
Raises:: Exception for network/service issues

article_by_pma(pma, ideal_ld=0.95, min_ld=0.8)[source]

From a PubMedArticle object, use as much info as needed to get as precise a match on CrossRef as is possible.

1st attempt: Title + Journal. Runs Levenshtein distance on results; if any results have: a better similarity ratio than ideal_ld, the top of these results will be returned. Otherwise, the first item with a score better than min_ld will be kept and compared against 1nd attempt results.
2nd attempt: Title + First Author. Same process as 1st attempt but with any candidates: found in 1st attempt submitted for comparison.

Finally: Return None or CrossRefWork from best candidate that exceeds min_ld requirement.

Parameters:

pma – PubMedArticle object
ideal_ld – (float) [default: set in global at top of crossref.py]
min_ld – (float) [default: set in global at top of crossref.py]

Return type:

CrossRefWork

article_by_title(title, **kwargs)[source]

Use CrossRef to find a work by its title. Returns first item in the list.

Keywords are passed unmodified to crossref.works() [habanero].

Parameters:: title – str
Return type:: CrossRefWork or None (if no results)

metapub.dx_doi module

class metapub.dx_doi.DxDOI(retries=1, **kwargs)[source]

Bases: Borg

Looks up DOIs in dx.doi.org and caches results in an SQLite cache. This is a Borg singleton object.

resolve(doi, *args)[source]: uses supplied doi to get link to publisher.

check_doi(doi, *args)[source]: returns doi if supplied DOI is good, raises BadDOI if not good.

__init__(retries=1, **kwargs)[source]

check_doi(doi, whitespace=False)[source]

Checks validity of supplied doi.

If whitespace is True (default False), allows supplied doi to contain whitespace.

Parameters:

doi – (str)
whitespace – (bool)

Returns:

doi (str) – verified DOI)

:raise BadDOI if supplied DOI fails regular expression check.

resolve(doi, check_doi=True, whitespace=False, skip_cache=False)[source]

Takes a doi (string), returns a url to article page on journal website.

if check_doi is True (default True), checks DOI before submitting query to dx.doi.org.

if whitespace is True (default False), allows prospective dois to contain whitespace when checked.

if skip_cache is True (default False), doesn’t check cache for pre-existing results (loads from remote dx.doi.org).

Parameters:

doi – (str)
check_doi – (bool)
whitespace – (bool)
skip_cache – (bool)

Returns:

url (str)

Raises:

BadDOI – if supplied DOI failed regular expression check
DxDOIError – if not-ok HTTP status code while loading url
ConnectionError – if problem making dx.doi.org connection

metapub.eutils_common module

metapub.eutils_common.get_eutils_client(cache_path, cache=None)[source]

Parameters:

cache_path – valid filesystem path to SQLite cache file
api_key – (optional) NCBI API Key obtainable from https://www.ncbi.nlm.nih.gov

Returns:

lightweight NCBI client object (drop-in replacement for eutils)

metapub.exceptions module

exception metapub.exceptions.MetaPubError[source]

Bases: Exception

Base Exception class from which all other exceptions in this library derive.

__init__(*args, **kwargs)

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception metapub.exceptions.BaseXMLError[source]

Bases: MetaPubError

Raised when XML needed to instantiate an object fails at the most basic level.

__init__(*args, **kwargs)

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception metapub.exceptions.InvalidPMID[source]

Bases: MetaPubError

Raised when NCBI efetch of a pubmed ID results in “invalid” response.

__init__(*args, **kwargs)

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception metapub.exceptions.InvalidBookID[source]

Bases: MetaPubError

Raised when attempting to lookup an NCBI book with something that doesn’t look like a Book ID.

__init__(*args, **kwargs)

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception metapub.exceptions.CrossRefConnectionError[source]

Bases: MetaPubError

Raised when a well-formed CrossRef query results in a server error.

__init__(*args, **kwargs)

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception metapub.exceptions.NoPDFLink(reason, *args, **kwargs)[source]

Bases: MetaPubError

Raised when a FindIt url lookup fails for some specific reason that is particular to the journal or publisher.

This Exception provides extended attributes:

reason : human-readable “reason” why URL lookup failed. url : last url attempted status_code : last HTTP code returned in attempt (if any) missing : list of data items missing from last attempt (if any)

This Exception is mostly used internally in FindIt as flow control.

__init__(reason, *args, **kwargs)[source]

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception metapub.exceptions.AccessDenied(reason, *args, **kwargs)[source]

Bases: NoPDFLink

Raised when a FindIt url lookup fails for some specific reason that is particular to the journal or publisher.

__init__(reason, *args, **kwargs)

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception metapub.exceptions.BadDOI[source]

Bases: MetaPubError

Raised when DxDOI class tests validity of DOI and it fails to pass muster.

__init__(*args, **kwargs)

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

exception metapub.exceptions.DxDOIError[source]

Bases: MetaPubError

Raised when a bad status code comes from loading dx.doi.org

__init__(*args, **kwargs)

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

metapub.medgenconcept module

metapub.medgenconcept – MedGenConcept class instantiated by supplying ESummary XML string.

class metapub.medgenconcept.MedGenConcept(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

__init__(xmlstr, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]: returns a dictionary composed of all extractable properties of this concept.

property synonyms: Returns a list of the ‘name’ values from self.names.

property medgen_uid: Synonym for “uid”. Sometimes when juggling concepts from multiple places, this helps.

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

metapub.medgenfetcher module

metapub.MedGenFetcher – tools to deal with NCBI’s E-utilities interface to the MedGen db

class metapub.medgenfetcher.MedGenFetcher(a Borg singleton object)[source]

Bases: Borg

An interaction layer for querying to return MedGenConcept objects.

Currently available methods: eutils

Basic Usage:

fetch = MedGenFetcher()

To specify a service method (more coming soon):

fetch = MedGenFetcher(‘eutils’)

To return a MedGenConcept from a known UID:

concept = fetch.concept_by_uid(known_UID)

To return a list of UIDs relevant to a given term known in medgen:

uids = fetch.uids_by_term(some_term)

To get a medgen UID given a known Concept ID (cui):

uid = fetch.uid_for_cui(known_cui)

__init__(method='eutils', cachedir='default')[source]

Initialize MedGenFetcher for medical genetics concept retrieval.

Parameters:

method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.
cachedir (str, optional) – Directory for caching responses. Use ‘default’ for system cache directory. Defaults to ‘default’.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state. Provides access to NCBI’s MedGen database for medical genetics concepts, diseases, and gene-phenotype relationships.

metapub.ncbi_errors module

NCBI Service Error Detection and User-Friendly Error Messages

This module provides intelligent error detection for NCBI service outages and converts cryptic network/XML errors into clear, actionable user messages.

class metapub.ncbi_errors.ServiceStatus(is_available, error_type=None, error_message=None, response_time=None, status_code=None)[source]

Bases: object

Status information for NCBI services.

is_available: bool

error_type: str | None = None

error_message: str | None = None

response_time: float | None = None

status_code: int | None = None

__init__(is_available, error_type=None, error_message=None, response_time=None, status_code=None)

class metapub.ncbi_errors.NCBIErrorDetector[source]

Bases: object

Detects and categorizes NCBI service errors.

__init__()[source]

check_service_status(url, timeout=10)[source]

Check if NCBI service is available and responding properly.

diagnose_error(exception, url=None)[source]

Analyze an exception and provide user-friendly diagnosis.

metapub.ncbi_errors.check_ncbi_status(url='https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')[source]

Quick function to check NCBI service status.

metapub.ncbi_errors.diagnose_ncbi_error(exception, url=None)[source]

Quick function to diagnose NCBI-related errors.

metapub.ncbi_errors.format_user_error(exception, url=None)[source]

Format a user-friendly error message with suggestions.

exception metapub.ncbi_errors.NCBIServiceError(message, error_type='unknown', suggestions=None)[source]

Bases: Exception

Custom exception for NCBI service issues.

__init__(message, error_type='unknown', suggestions=None)[source]

add_note(): Exception.add_note(note) – add a note to the exception

args

with_traceback(): Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

metapub.ncbi_errors.handle_ncbi_request_error(func)[source]: Decorator to wrap NCBI API calls with intelligent error handling.

metapub.ncbi_health_check module

NCBI Service Health Check Utility

A command-line tool to check the status of various NCBI services used by metapub. Helps diagnose service outages and determine which endpoints are affected.

Usage:: python ncbi_health_check.py # Check all services python ncbi_health_check.py –quick # Check only essential services python ncbi_health_check.py –json # Output results as JSON

class metapub.ncbi_health_check.ServiceResult(name, url, status, response_time, status_code=None, error_message=None, details=None)[source]

Bases: object

Result of checking a single NCBI service.

name: str

url: str

status: str

response_time: float

status_code: int | None = None

error_message: str | None = None

details: str | None = None

__init__(name, url, status, response_time, status_code=None, error_message=None, details=None)

class metapub.ncbi_health_check.NCBIHealthChecker(timeout=10)[source]

Bases: object

Health checker for NCBI services.

__init__(timeout=10)[source]

check_service(service_id, config)[source]

Check a single NCBI service.

check_all_services(quick=False)[source]

Check all services with conservative rate limiting.

metapub.ncbi_health_check.print_status_icon(status)[source]

Get emoji/icon for status.

metapub.ncbi_health_check.print_results(results, show_details=True)[source]

Print results in human-readable format.

metapub.ncbi_health_check.main()[source]: Main CLI function.

metapub.pubmed_clinicalqueries module

metapub.pubmedarticle module

metapub.pubmedarticle – PubMedArticle class instantiated by supplying ncbi XML string.

class metapub.pubmedarticle.PubMedArticle(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

This PubMedArticle class receives an XML string as its required argument and parses it into its constituent parts, exposing them as attributes.

Usage:: paper = PubMedArticle(xml_string)

To query services to return an article by pmid, use PubMedFetcher, which returns PubMedArticle objects.

When xmlstr is parsed, the pubmed_type attribute will be set to one of ‘article’ or ‘book’, depending on whether PubmedBookArticle or PubmedArticle headings are found in the supplied xmlstr at instantiation.

Since this class needs to work seamlessly in production whether it’s a book or an article, the PubmedArticle attributes will always be available (set to None in many cases for PubmedBookArticle, e.g. volume, issue, journal), but PubmedBookArticle attributes will only be set when pubmed_type=’book’.

PubMedBook special handling of certain attributes:

abstract: a joined string from self.book_abstracts
title: comes from ArticleTitle

Special attributes for PubmedBookArticle (pubmed_type=’book’):

book_id (default: None) - string from IdType=”bookaccession”, e.g. “NBK1403”
book_title (default: None) - string with name of book (as differentiated from ArticleTitle)
book_publisher (default: None) - dict containing {‘name’: string, ‘location’: string}
book_sections (default: []) - dict with key->value pairs as section_name->SectionTitle
book_contribution_date (default: None) - python datetime date
book_date_revised (default: None) - python datetime date
book_history (default: []) - dictionary with key->value pairs as PubStatus -> python datetime
book_language (default: None) - string (e.g. “eng”)
book_editors (default: []) - list containing names from ‘editors’ AuthorList
book_abstracts (default: []) - dict with key->value pairs as Label->AbstractText.text)
book_medium (default: None) - string (e.g. “Internet”)
book_synonyms (default: None) - list of disease synonyms (applicable to “gene” book)
book_publication_status (default: None) - string (e.g. “ppublish”)

__init__(xmlstr, *args, **kwargs)[source]

Initialize PubMedArticle from NCBI XML data.

Parameters:

xmlstr (str) – XML string from NCBI containing PubmedArticle or PubmedBookArticle data.
*args – Additional positional arguments passed to parent class.
**kwargs – Additional keyword arguments passed to parent class.

Note

The XML type is automatically detected to handle both regular articles and book chapters. The pubmed_type attribute will be set to ‘article’ or ‘book’ accordingly, and appropriate attributes will be populated.

to_dict()[source]

Convert PubMedArticle to dictionary representation.

Returns:

Dictionary containing all article attributes except: internal XML content and processing attributes.

Return type:

Dict[str, Any]

Note

Excludes ‘content’, ‘xml’, and ‘_root’ attributes from the output to provide a clean data representation suitable for serialization.

property citation

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.

Article Example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

Book Example (GeneReviews):

Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

property citation_html

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, and pages.

Article Example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

GeneReviews Example: Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

property citation_bibtex

property pubdate

Normalized publication date as datetime object.

Returns the best available publication date from PubMed XML in order of preference: 1. Article PubDate (Year/Month/Day or MedlineDate) 2. Book contribution date 3. History dates (pubmed, entrez, etc.)

Returns:: Publication date as datetime object, or None if no date found
Return type:: datetime or None

Example

article = fetch.article_by_pmid(‘12345’) if article.pubdate:

print(f”Published: {article.pubdate.strftime(‘%Y-%m-%d’)}”)

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

metapub.pubmedarticle.square_voliss_data_for_pma(pma)[source]: Takes a PubMedArticle object, returns same object with corrected volume/issue information (if needed)

metapub.pubmedarticle.determine_pubmed_xml_type(xmlstr)[source]

Returns string “type” of pubmed article XML based on presence of expected strings.

Possible returns:: ‘article’ ‘book’ ‘unknown’

Parameters:: xmlstr – xml in any data type (str, bytes, unicode…)
Return typestring:: (str)
Return type:: str

metapub.pubmedauthor module

metapub.pubmedauthor – PubMedAuthor class instantiated a ncbi Author XML Element

class metapub.pubmedauthor.PubMedAuthor(xmlelem, *args, **kwargs)[source]

Bases: MetaPubObject

This PubMedAuthor class receives a xml element as required argument and parses it into its parts, exposing them as attributes.

Usage:: author = PubMedAuthor(xml_elem)

To retrieve the standard represenation of a author name, use the __str__ method.

(About unicode: metapub uses unicode_literals in both py3 and py2, so the str() function returns unicode, unless called by a py2k “str()” statement in which unicode_literals is off.)

__init__(xmlelem, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

metapub.pubmedcentral module

An assortment of functions providing access to various web APIs.

The pubmedcentral.* functions abstract the submission of one of the following acceptable IDs to the Pubmed Central ID Conversion API as a lookup to get another ID mapping to the same pubmed article:

doi Digital Object Identifier

pmid Pubmed ID

pmcid Pubmed Central ID (includes Versioned Identifier)

Available functions:

get_pmid_for_otherid(string)

get_doi_for_otherid(string)

get_pmcid_for_otherid(string)

metapub.pubmedcentral.get_pmid_for_otherid(otherid)[source]

Use the PMC ID conversion API to attempt to convert either PMCID or DOI to a PMID. Returns PMID if successful, or None if there is no ‘pmid’ item in the response.

Parameters:: otherid – (str)
Return pmid:: (str)
Return type:: str

metapub.pubmedcentral.get_pmcid_for_otherid(otherid)[source]

Use the PMC ID conversion API to attempt to convert either PMID or DOI to a PMCID. Returns PMCID if successful, or None if there is no ‘pmcid’ item in the response.

Parameters:: otherid – (str)
Return pmcid:: (str)
Return type:: str

metapub.pubmedcentral.get_doi_for_otherid(otherid)[source]

Use the PMC ID conversion API to attempt to convert either PMID or PMCID to a DOI. Returns DOI if successful, or None if there is no ‘doi’ item in the response.

Note: this method has a very low success rate for retrieving DOIs. Check out the CrossRef object, i.e. from metapub import CrossRef which excels at resolving citations into DOIs (and DOIs into citations).

Parameters:: otherid – (str)
Return doi:: (str)
Return type:: str

metapub.pubmedfetcher module

metapub.PubMedFetcher – tools to deal with NCBI’s E-utilities interface to PubMed

metapub.pubmedfetcher.get_uids_from_esearch_result(xmlstr)[source]

Extract unique identifiers from an ESearch XML result.

Parameters:: xmlstr (str) – XML string returned from NCBI ESearch query.
Returns:: List of PMID strings extracted from the XML.
Return type:: List[str]
Raises:: NCBIServiceError – If XML parsing fails due to NCBI service issues.

metapub.pubmedfetcher.parse_related_pmids_result(xmlstr)[source]

Parse XML results from ELink query for related PMIDs.

Parameters:

xmlstr (str) – XML string returned from NCBI ELink query.

Returns:

Dictionary mapping relationship types to lists of PMIDs.: Common keys include ‘pubmed’, ‘reviews’, ‘cited’, etc.

Return type:

Dict[str, List[str]]

Raises:

NCBIServiceError – If XML parsing fails due to NCBI service issues.

class metapub.pubmedfetcher.PubMedFetcher(a Borg singleton object backed by an optional SQLite cache)[source]

Bases: Borg

An interaction layer for querying via specified method to return PubMedArticle objects.

Currently available methods: eutils

Basic Usage:

fetch = PubMedFetcher()

To specify a service method (more coming soon):

fetch = PubMedFetcher(‘eutils’)

To return an article by querying the service with a known PMID or NCBI Book ID:

paper = fetch.article_by_pmid(‘123456’) book = fetch.article_by_pmid(‘NBK1234’)

Similar methods exist for returning papers by DOI and PM Central id:

paper = fetch.article_by_doi(‘10.1038/ng.379’) paper = fetch.article_by_pmcid(‘PMC3458974’)

Finally, you can search for PMIDs via citation details by using the pmids_for_citation method, for which you usually only need 3 out of 5 details to triangulate on a good result.

pmids = fetch.pmids_for_citation(journal=’Science’, year=’2008’, volume=’4’,
first_page=’7’, author_name=’Grant’)

__init__(method='eutils', **kwargs)[source]

Initialize PubMedFetcher with specified service method.

Parameters:

method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.
**kwargs –
Additional keyword arguments. cachedir (str, optional): Custom directory for caching responses.

If not provided, uses default cache directory.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state.

pmids_for_clinical_query(query, category, optimization='broad', since=None, until=None, retstart=0, retmax=250, pmc_only=False, **kwargs)[source]

Takes a query and a category (required, see below) and returns a list of pubmed IDs returned by NCBI for that query.

See also PubMedFetcher.pmids_for_query for other parameters.

available categories:

therapy diagnosis etiology prognosis prediction

available optimizations:
broad (default) narrow

Param:: query (string)
Param:: category (string)
Param:: optimization (string) [default: broad]
Returns:: list of pubmed IDs

pmids_for_medical_genetics_query(query, category='all', since=None, until=None, retstart=0, retmax=250, pmc_only=False, **kwargs)[source]

Takes a query and category (see below) and returns a list of pubmed IDs. IDs returned by NCBI for that query.

See also PubMedFetcher.pmids_for_query for other parameters.

available categories:

all (default) diagnosis differential_diagnosis clinical_description management genetic_counseling genetic_testing

Param:: query (string)
Param:: category (string) [default: all]
Returns:: list of pubmed IDs

pmids_for_citation(**kwargs)[source]

returns list of pmids for given citation. requires at least 3/5 of these keyword arguments:: jtitle or journal (journal title) year or date volume spage or first_page (starting page / first page) aulast (first author’s last name) or author1_first_lastfm (as produced by PubMedArticle class)

Strings submitted for journal/jtitle will be run through metapub.utils.remove_chars to deal with HTML- encoded characters and to remove punctuation.

related_pmids(pmid)[source]

For supplied pmid, return related ids of related pubmed articles, organized into a dictionary keyed by type of relation. The keys include:

pubmed (all related links)

citedin (papers that cited this paper)

five (the “five” that pubmed displays as the top related results)

reviews (review papers that cite this paper)

combined (?)

query example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?retmode=xml&dbfrom=pubmed&id=14873513&cmd=neighbor

Raises:: NCBIServiceError if NCBI ELink service is down

pmid_for_bookID(book_id)[source]

For supplied NCBI Book ID, use the pubmed advanced query API to find its PMID.

Not all NCBI Books have PMIDs. If there is no associated PMID, this returns None.

Parameters:: book_id – (str) e.g. “NBK2020”
Returns:: (str) or None – PMID if found, None otherwise.

metapub.pubmedfetcher_cli module

pubmed_article: utility for fetching an article by PMID.

Usage:: pubmed_article <pmid>

Options:

-h, --help

Print this screen.

-v, --version

Print the version of this program.

-a, --abstract

Include the abstract.

-f, --full

Print the full article, if possible. (experimental)

metapub.pubmedfetcher_cli.print_pma(pmid)[source]: Takes a PMID and prints a stringified PubMedArticle to the command line.

metapub.pubmedfetcher_cli.main()[source]

metapub.text_mining module

metapub.text_mining.findall_ncbi_bookIDs(text)[source]

GeneReviews books look like this: NBK1210 (see https://www.ncbi.nlm.nih.gov/pubmed/?term=NBK1210 )

Parameters:: text
Returns book_ids:: list of IDs (possibly empty)

metapub.text_mining.is_ncbi_bookID(book_id)[source]: Returns whether supplied book_id appears to be an NCBI book ID (e.g. “NBK1010”).

metapub.text_mining.findall_pmcIDs(text)[source]

PubmedCentral IDs look like this: PMC123456

Parameters:: text
Returns pmc_ids:: list of IDs (possibly empty)

metapub.text_mining.is_pmcid(pmcid)[source]: Returns boolean on whether supplied pmcid looks like a PubMedCentral ID (e.g. “PMC31345”).

metapub.text_mining.pick_pmid(text)[source]

return longest numerical string from text (string) as the pmid.: if text is empty or there are no pmids, return None.

Parameters:: text – (str)
Returns:: pmid (str) or None

metapub.text_mining.findall_dois_in_text(inp, whitespace=False)[source]

Returns all seen DOIs in submitted text.

if whitespace arg set to True, look for DOIs like the following:
10.1002 / pd.354

…but return with whitespace stripped:
10.1002/pd.354

Parameters:

inp – (str)
whitespace – (bool)

Returns:

list of DOIs found in inp

metapub.text_mining.find_doi_in_string(inp, whitespace=False)[source]

Returns the first seen DOI in the input string.

Parameters:

inp – (str)
whitespace – (bool)

Returns:

string containing first found DOI, or None

metapub.text_mining.scrape_doi_from_article_page(url)[source]

Takes an article link (url), loads its page, and searches its content for DOIs, returning the first one it finds.

The first DOI found on the page being the correct one for the article at hand seems to be a reasonable and workable assumption in general.

Parameters:: url – (str)
Returns:: doi or None
Raises:: Exception for network/connection issues

metapub.text_mining.get_pmc_fulltext_filename_for_PubMedArticle(pma)[source]

metapub.utils module

metapub.utils.kpick(args, options, default=None)[source]

metapub.utils.remove_chars(inp, chars='[],.()<>\'/?;:"&', urldecode=False)[source]

Remove target characters from input string.

Parameters:

inp – (str)
chars – (str) characters to remove [default: utils.PUNCS_WE_DONT_LIKE]
urldecode – (bool) whether to first urldecode the input string [default: False]

metapub.utils.hostname_of(url)[source]

Takes a url (may or may not contain protocol prefix) and returns the simplest base form of the hostname in the supplied URL.

If hostname starts with ‘www.’, this will be stripped out.

Examples

http://www.nature.com/pr/journal/v49/n1/full/pr20018a.html –> nature.com https://webhome.weizmann.ac.il –> webhome.weizmann.ac.il https://www.ncbi.nlm.nih.gov/pubmed/17108762 –> ncbi.nlm.nih.gov

Parameters:: url – (str)
Return hostname:: (str)

metapub.utils.rootdomain_of(url)[source]

Returns the root domain of hostname of supplied URL.

Examples

http://blood.oxfordjournals.org –> oxfordjournals.org https://webhome.weizmann.ac.il –> ac.il https://regex101.com/ –> regex101.com https://www.ncbi.nlm.nih.gov/pubmed/17108762 –> nih.gov

Parameters:: url – (str)
Return rootdomain:: (str)

metapub.utils.asciify(inp)[source]

Nuke all the unicode from orbit. It’s the only way to be sure.

WARNING: this function is mostly used for Python2 compatibility and other legacy stuff, and may be removed in upcoming versions of metapub.

Parameters:: inp – (str)
Returns:: string converted to pure, American ASCII

metapub.utils.squash_spaces(inp)[source]

Convert multiple ‘ ‘ chars to a single space.

Parameters:: inp – (str)
Returns:: same string with only one space where multiple spaces were.

metapub.utils.parameterize(inp, sep='+')[source]

Make strings suitable for submission to GET-based query service.

Strips out the characters named in metapub.utils.PUNCS_WE_DONT_LIKE

If inp is None, return empty string.

Parameters:

inp – (str or None): input to be parameterized
sep – (str): separator to use in place of spaces (default=’+’)

Returns:

“parameterized” str

metapub.utils.deparameterize(inp, sep='+')[source]

Somewhat-undo parameterization in string. Replace separators (sep) with spaces.

Parameters:

inp – (str)
sep – (str) default: ‘+’

Returns:

“deparameterized” string

metapub.utils.remove_html_markup(inp)[source]

Remove html and xml tags from text. Preserves HTML entities like &

Parameters:: inp – (str)
Returns:: string with HTML and XML markup removed.

metapub.utils.lowercase_keys(dct)[source]: Takes an input dictionary, returns dictionary with all keys lowercased.

metapub.validate module

metapub.validate.assert_is_good_doi(doi)[source]

metapub.validate.assert_is_good_pmid(pmid)[source]