metapub package
Subpackages
- metapub.findit package
- Subpackages
- metapub.findit.journals package
- Submodules
- metapub.findit.journals.aaas module
- metapub.findit.journals.biochemsoc module
- metapub.findit.journals.bmc module
- metapub.findit.journals.cantdo_list module
- metapub.findit.journals.cell module
- metapub.findit.journals.degruyter module
- metapub.findit.journals.dustri module
- metapub.findit.journals.endo module
- metapub.findit.journals.jama module
- metapub.findit.journals.jstage module
- metapub.findit.journals.karger module
- metapub.findit.journals.lancet module
- metapub.findit.journals.misc_doi module
- metapub.findit.journals.misc_pii module
- metapub.findit.journals.misc_vip module
- metapub.findit.journals.nature module
- metapub.findit.journals.scielo module
- metapub.findit.journals.sciencedirect module
- metapub.findit.journals.spandidos module
- metapub.findit.journals.springer module
- metapub.findit.journals.todo module
- metapub.findit.journals.wiley module
- metapub.findit.journals.wolterskluwer module
- metapub.findit.journals package
- Submodules
- metapub.findit.dances module
- metapub.findit.findit module
- metapub.findit.logic module
- Subpackages
- metapub.urlreverse package
- Submodules
- metapub.urlreverse.hostname2doiprefix module
- metapub.urlreverse.hostname2jrnl module
- metapub.urlreverse.methods module
DXDOI()get_journal_name_from_url()get_pnas_doi_from_link()get_elifesciences_doi_from_link()get_bmj_doi_from_link()get_spandidos_doi_from_link()get_karger_doi_from_link()get_jstage_doi_from_link()get_sciencedirect_doi_from_link()get_cell_doi_from_link()get_nature_doi_from_link()get_biomedcentral_doi_from_link()get_jci_doi_from_link()get_ahajournals_doi_from_link()get_early_release_doi_from_link()get_generic_doi_from_link()get_plos_doi_from_link()try_doi_methods()try_vip_methods()try_pmid_methods()
- metapub.urlreverse.urlreverse module
Submodules
metapub.base module
- metapub.base.parse_elink_response(xmlstr)[source]
return all Ids from an elink XML response
- Parameters:
xmlstr
- Returns:
list of IDs, or None if XML response empty
- class metapub.base.MetaPubObject(xml, root=None, *args, **kwargs)[source]
Bases:
objectBase class for XML parsing objects (e.g. PubMedArticle)
- __init__(xml, root=None, *args, **kwargs)[source]
Instantiate with “xml” as string or bytes containing valid XML.
Supply name of root element (string) to set virtual top level. (optional).
metapub.cache_utils module
Utilities for cache file creation and management.
- metapub.cache_utils.datetime_to_timestamp(dt, epoch=datetime.datetime(1970, 1, 1, 0, 0))[source]
takes a python datetime object and converts it to a Unix timestamp.
This is a non-timezone-aware function.
- Parameters:
dt – datetime to convert to timestamp
epoch – datetime, option specification of start of epoch [default: 1/1/1970]
- Returns:
timestamp
- metapub.cache_utils.get_cache_path(cachedir='/home/docs/.cache', filename='metapub-cache.db')[source]
checks if cachedir exists; if not, tries to create it; raises MetaPubError if it can’t be created.
if cachedir is None, returns None.
Default: DEFAULT_CACHE_DIR set in config.py (~/.cache)
Supports expansion of user directory shortcut ‘~’ to full path.
- Parameters:
cachedir – directory to store
filename – name of cache file
- Returns:
path to SQLite DB file
:raises MetaPubError
- metapub.cache_utils.cleanup_dir(cachedir)[source]
Remove all files from a cache directory and delete the directory itself.
This function is used for cache maintenance and cleanup operations. Silently handles errors if files cannot be removed.
- Parameters:
cachedir (str) – path to directory to clean up
- Returns:
None
metapub.cite module
Common functions for the formatting of academic reference citations.
- metapub.cite.author_str(author_list_or_string, as_html=False)[source]
Helper function for constructing article citations.
- Parameters:
author_list_or_string
- Returns:
author(s) str suitable for printed citation
- metapub.cite.citation(**kwargs)[source]
Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.
see cite.article and cite.book for more specific use cases.
Note that “authors” (as list) will be used preferentially over “author” (as str).
- Keywords:
as_html: (bool) returns citation with light HTML formatting. author: (str) – prints author as-is without modification authors: (list) – prints as author1 (first in list) as “Lastname_FirstInitials, et al” title: (str) journal: (str) year: (str or int) volume: (str or int) pages: (str) should be formatted “nn-mm”, e.g. “55-58” doi: (str)
- Returns:
citation (str)
- metapub.cite.article(**kwargs)[source]
Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.
This function uses the Article format citation template. For example:
McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.
- Keywords:
journal title doi authors (str or list) – if str, prints authors without modification.
- Returns:
citation (str)
- metapub.cite.book(book, **kwargs)[source]
Takes a PubMedArticle “book” and formats a citation string. This is a special type of citation built mostly for NCBI GeneReviews and not currently generalizable to other academic books (yet).
Returns a formatted citation string for a book. A “book” needs to contain the following attributes:
author title book_date_revised book_contribution_date editors journal book_publisher (may be a URL)
This function uses the Book format citation template:
book_cit_fmt = ‘{author}. {title}. {cdate} (Update {mdate}). In: {editors}, editors. {journal} (Internet). {book_publisher}’
For example:
Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.
- Parameters:
book – PubMedArticle of type “book”
use_html – (bool) whether to return with light HTML formatting
- Returns:
formatted citation string
- Return type:
- metapub.cite.bibtex(**kwargs)[source]
Returns a BibTeX formatted citation string built from the book or article author(s), title, journal, year, volume, pages, and doi if the fields exist
see cite.article and cite.book for more specific use cases.
see https://ctan.org/tex-archive/biblio/bibtex/contrib/doc/ for more on the BibTeX format
Note that “authors” (as list) will be used preferentially over “author” (as str).
- Keywords:
isbook: (bool) returns citation with standard entry type as ‘book’ author: (str) – prints author as-is without modification authors: (list) – prints as author1 (first in list) as “Lastname_FirstInitials, et al” title: (str) journal: (str) year: (str or int) volume: (str or int) pages: (str) should be formatted “nn-mm”, e.g. “55-58” doi: (str)
- Returns:
bibtex citation (str)
metapub.clinvarfetcher module
metapub.clinvarfetcher: tools for interacting with ClinVar data
- class metapub.clinvarfetcher.ClinVarFetcher(a Borg singleton object)[source]
Bases:
BorgToolkit for retrieval of ClinVar information.
Set optional ‘cachedir’ parameter to absolute path of preferred directory if desired; cachedir defaults to <current user directory> + /.cache
clinvar = ClinVarFetcher()
clinvar = ClinVarFetcher(cachedir=’/path/to/cachedir’)
Usage
Get ClinVar accession IDs for gene name (switch single_gene to True to filter out results containing more genes than the specified gene being searched, default False).
cv_ids = clinvar.ids_by_gene(‘FGFR3’, single_gene=True)
Get ClinVar accession in python dictionary format for given ID:
cv_subm = clinvar.accession(65533) # can also submit ID as string
Get list of pubmed IDs (pmids) for given ClinVar accession ID:
pmids = clinvar.pmids_for_id(65533) # can also submit ID as string
Get list of pubmed IDs (pmids) for hgvs string:
pmids = clinvar.pmids_for_hgvs(‘NM_017547.3:c.1289A>G’)
For more info, see the ClinVar eutils page: https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/
- __init__(method='eutils', cachedir='default')[source]
Initialize ClinVarFetcher for clinical variant data retrieval.
- Parameters:
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state. Provides access to NCBI’s ClinVar database for clinical significance of genetic variants, gene-disease relationships, and variant literature.
metapub.clinvarvariant module
metapub.clinvarvariant – ClinVarVariant class instantiated by supplying ESummary XML string.
- class metapub.clinvarvariant.PathogenicSummary(counts: dict[Literal['pathogenic', 'likely pathogenic', 'uncertain significance', 'likely benign', 'benign', 'conflicting interpretations', 'drug response', 'risk factor', 'association', 'protective', 'other', 'likely pathogenic, low penetrance', 'pathogenic, low penetrance', 'uncertain risk allele', 'likely risk allele', 'established risk allele', 'affects', 'conflicting data from submitters', 'not provided', 'vus-high', 'vus-mid', 'vus-low'], int], total_submitters: int, consensus: Literal['pathogenic', 'likely pathogenic', 'uncertain significance', 'likely benign', 'benign', 'conflicting interpretations', 'drug response', 'risk factor', 'association', 'protective', 'other', 'likely pathogenic, low penetrance', 'pathogenic, low penetrance', 'uncertain risk allele', 'likely risk allele', 'established risk allele', 'affects', 'conflicting data from submitters', 'not provided', 'vus-high', 'vus-mid', 'vus-low'] | None, conflicting: bool, review_status: str | None)[source]
Bases:
object- counts: dict[Literal['pathogenic', 'likely pathogenic', 'uncertain significance', 'likely benign', 'benign', 'conflicting interpretations', 'drug response', 'risk factor', 'association', 'protective', 'other', 'likely pathogenic, low penetrance', 'pathogenic, low penetrance', 'uncertain risk allele', 'likely risk allele', 'established risk allele', 'affects', 'conflicting data from submitters', 'not provided', 'vus-high', 'vus-mid', 'vus-low'], int]
- consensus: Literal['pathogenic', 'likely pathogenic', 'uncertain significance', 'likely benign', 'benign', 'conflicting interpretations', 'drug response', 'risk factor', 'association', 'protective', 'other', 'likely pathogenic, low penetrance', 'pathogenic, low penetrance', 'uncertain risk allele', 'likely risk allele', 'established risk allele', 'affects', 'conflicting data from submitters', 'not provided', 'vus-high', 'vus-mid', 'vus-low'] | None
- __init__(counts, total_submitters, consensus, conflicting, review_status)
- class metapub.clinvarvariant.ClinVarVariant(xmlstr, *args, **kwargs)[source]
Bases:
MetaPubObject- __init__(xmlstr, *args, **kwargs)[source]
Instantiate with “xml” as string or bytes containing valid XML.
Supply name of root element (string) to set virtual top level. (optional).
- property hgvs_c
Returns a list of all coding HGVS strings from the Allelle data.
- property hgvs_g
Returns a list of all genomic HGVS strings from the Allelle data.
- property hgvs_p
Returns a list of all protein effect HGVS strings from the Allelle data.
metapub.config module
metapub.convert module
Convert.pmid2doi / Convert.doi2pmid / Convert.bookid2pmid
- Usage:
convert -h convert pmid2doi <pmid> [options] convert doi2pmid <doi> [options] convert bookid2pmid <book_id> [options]
- Options:
- -h, --help
Show this help page
- -v, --version
Show this command’s version.
- -q, --quiet
Shut up all that log garbage.
- -d, --debug
No wait, give me ALL the log garbage! Superceded by –quiet.
- -a, --article
Also print out the article information (from PubMedArticle) if possible.
- -w, --work
Also print out info from the CrossRef entry, if possible.
- metapub.convert.PubMedArticle2doi(pma)[source]
Starting with a PubMedArticle object, use CrossRef to find a DOI for given article.
- Parameters:
pma (PubMedArticle)
- Returns:
doi (str) or None
- metapub.convert.pmid2doi(pmid)[source]
- starting with a pubmed ID, lookup article in pubmed. If DOI found in PubMedArticle object,
return it. Otherwise, use CrossRef to find the DOI for given article.
- Parameters:
- Returns:
doi (str) or None
- Raises:
InvalidPMID (if pmid is invalid) –
NCBIServiceError (if NCBI services are down) –
- metapub.convert.doi2pmid(doi)[source]
uses CrossRef and PubMed eutils to lookup a PMID given a known doi.
- Warning: NO validation of input DOI performed here. Use
metapub.text_mining.find_doi_in_string beforehand if needed.
If a PMID can be found, return it. Otherwise return None.
In very rare cases, use of the CrossRef->pubmed citation method used here may result in more than one pubmed ID. In this case, this function will return instead the word ‘AMBIGUOUS’.
- Parameters:
pmid – (str or int)
- Return doi:
(str) if found; ‘AMBIGUOUS’ if citation count > 1; None if no results.
- Raises:
NCBIServiceError if NCBI services are down
metapub.crossref module
- metapub.crossref.get_most_similar_work_from_crossref_results(qstring, qname, cr_results)[source]
Uses Levenshtein distance on result title to rank CrossRef results. Returns top candidate for a match from these items based on comparison title.
- Parameters:
qstring – (str) original query string for search
qname – (str) name of query item (e.g. “title”)
cr_results – (dict) crossref results as returned by habanero
- Returns:
{‘title_ld’: <score>, ‘work’: <CrossRefWork or None>}
- class metapub.crossref.CrossRefWork(**kwargs)[source]
Bases:
objectRepresents one ‘work’ from CrossRef search results.
- property first_page
Returns first page (number) of article as string, or None if self.page is empty.
- property citation
Returns a formal citation string for this work.
- property pubyear
- property pubmonth
- property pubdate
- property author1
- property author1_last_fm
- property authors_str_lastfirst
Returns this work’s authors as a semicolon-separated string – LASTNAME FIRSTInitial.
- property author_list
Returns this work’s authors as a flat list (Firstname Lastname), retaining order given by Crossref.
- property author_list_last_fm
Returns this work’s authors as a flat list (Lastname FirstInitial), retaining order given by Crossref.
- class metapub.crossref.CrossRefFetcher(**kwargs)[source]
Bases:
BorgValid field queries for this route are: affiliation, degree, event-acronym, bibliographic, container-title, publisher-name, author, event-theme, standards-body-acronym, chair, event-location, translator, funder-name, event-name, publisher-location, title, standards-body-name, contributor, editor, event-sponsor
- article_by_doi(doi)[source]
Returns a CrossRefWork object loaded by querying the Crossref works/DOI REST endpoint.
- Parameters:
doi – (str)
- Return type:
- Raises:
HTTPError (404) if DOI not found.
- Raises:
Exception for network/service issues
- article_by_pma(pma, ideal_ld=0.95, min_ld=0.8)[source]
From a PubMedArticle object, use as much info as needed to get as precise a match on CrossRef as is possible.
- 1st attempt: Title + Journal. Runs Levenshtein distance on results; if any results have
a better similarity ratio than ideal_ld, the top of these results will be returned. Otherwise, the first item with a score better than min_ld will be kept and compared against 1nd attempt results.
- 2nd attempt: Title + First Author. Same process as 1st attempt but with any candidates
found in 1st attempt submitted for comparison.
Finally: Return None or CrossRefWork from best candidate that exceeds min_ld requirement.
- Parameters:
pma – PubMedArticle object
ideal_ld – (float) [default: set in global at top of crossref.py]
min_ld – (float) [default: set in global at top of crossref.py]
- Return type:
- article_by_title(title, **kwargs)[source]
Use CrossRef to find a work by its title. Returns first item in the list.
Keywords are passed unmodified to crossref.works() [habanero].
- Parameters:
title – str
- Return type:
CrossRefWork or None (if no results)
metapub.dx_doi module
- class metapub.dx_doi.DxDOI(retries=1, **kwargs)[source]
Bases:
BorgLooks up DOIs in dx.doi.org and caches results in an SQLite cache. This is a Borg singleton object.
- check_doi(doi, whitespace=False)[source]
Checks validity of supplied doi.
If whitespace is True (default False), allows supplied doi to contain whitespace.
- Parameters:
doi – (str)
whitespace – (bool)
- Returns:
doi (str) – verified DOI)
:raise BadDOI if supplied DOI fails regular expression check.
- resolve(doi, check_doi=True, whitespace=False, skip_cache=False)[source]
Takes a doi (string), returns a url to article page on journal website.
if check_doi is True (default True), checks DOI before submitting query to dx.doi.org.
if whitespace is True (default False), allows prospective dois to contain whitespace when checked.
if skip_cache is True (default False), doesn’t check cache for pre-existing results (loads from remote dx.doi.org).
- Parameters:
doi – (str)
check_doi – (bool)
whitespace – (bool)
skip_cache – (bool)
- Returns:
url (str)
- Raises:
BadDOI – if supplied DOI failed regular expression check
DxDOIError – if not-ok HTTP status code while loading url
ConnectionError – if problem making dx.doi.org connection
metapub.eutils_common module
- metapub.eutils_common.get_eutils_client(cache_path, cache=None)[source]
- Parameters:
cache_path – valid filesystem path to SQLite cache file
api_key – (optional) NCBI API Key obtainable from https://www.ncbi.nlm.nih.gov
- Returns:
lightweight NCBI client object (drop-in replacement for eutils)
metapub.exceptions module
- exception metapub.exceptions.MetaPubError[source]
Bases:
ExceptionBase Exception class from which all other exceptions in this library derive.
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception metapub.exceptions.BaseXMLError[source]
Bases:
MetaPubErrorRaised when XML needed to instantiate an object fails at the most basic level.
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception metapub.exceptions.InvalidPMID[source]
Bases:
MetaPubErrorRaised when NCBI efetch of a pubmed ID results in “invalid” response.
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception metapub.exceptions.InvalidBookID[source]
Bases:
MetaPubErrorRaised when attempting to lookup an NCBI book with something that doesn’t look like a Book ID.
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception metapub.exceptions.CrossRefConnectionError[source]
Bases:
MetaPubErrorRaised when a well-formed CrossRef query results in a server error.
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception metapub.exceptions.NoPDFLink(reason, *args, **kwargs)[source]
Bases:
MetaPubErrorRaised when a FindIt url lookup fails for some specific reason that is particular to the journal or publisher.
This Exception provides extended attributes:
reason : human-readable “reason” why URL lookup failed. url : last url attempted status_code : last HTTP code returned in attempt (if any) missing : list of data items missing from last attempt (if any)
This Exception is mostly used internally in FindIt as flow control.
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception metapub.exceptions.AccessDenied(reason, *args, **kwargs)[source]
Bases:
NoPDFLinkRaised when a FindIt url lookup fails for some specific reason that is particular to the journal or publisher.
- __init__(reason, *args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception metapub.exceptions.BadDOI[source]
Bases:
MetaPubErrorRaised when DxDOI class tests validity of DOI and it fails to pass muster.
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
- exception metapub.exceptions.DxDOIError[source]
Bases:
MetaPubErrorRaised when a bad status code comes from loading dx.doi.org
- __init__(*args, **kwargs)
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
metapub.medgenconcept module
metapub.medgenconcept – MedGenConcept class instantiated by supplying ESummary XML string.
- class metapub.medgenconcept.MedGenConcept(xmlstr, *args, **kwargs)[source]
Bases:
MetaPubObject- __init__(xmlstr, *args, **kwargs)[source]
Instantiate with “xml” as string or bytes containing valid XML.
Supply name of root element (string) to set virtual top level. (optional).
- property synonyms
Returns a list of the ‘name’ values from self.names.
- property medgen_uid
Synonym for “uid”. Sometimes when juggling concepts from multiple places, this helps.
metapub.medgenfetcher module
metapub.MedGenFetcher – tools to deal with NCBI’s E-utilities interface to the MedGen db
- class metapub.medgenfetcher.MedGenFetcher(a Borg singleton object)[source]
Bases:
BorgAn interaction layer for querying to return MedGenConcept objects.
Currently available methods: eutils
Basic Usage:
fetch = MedGenFetcher()
To specify a service method (more coming soon):
fetch = MedGenFetcher(‘eutils’)
To return a MedGenConcept from a known UID:
concept = fetch.concept_by_uid(known_UID)
To return a list of UIDs relevant to a given term known in medgen:
uids = fetch.uids_by_term(some_term)
To get a medgen UID given a known Concept ID (cui):
uid = fetch.uid_for_cui(known_cui)
- __init__(method='eutils', cachedir='default')[source]
Initialize MedGenFetcher for medical genetics concept retrieval.
- Parameters:
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state. Provides access to NCBI’s MedGen database for medical genetics concepts, diseases, and gene-phenotype relationships.
metapub.ncbi_errors module
NCBI Service Error Detection and User-Friendly Error Messages
This module provides intelligent error detection for NCBI service outages and converts cryptic network/XML errors into clear, actionable user messages.
- class metapub.ncbi_errors.ServiceStatus(is_available, error_type=None, error_message=None, response_time=None, status_code=None)[source]
Bases:
objectStatus information for NCBI services.
- __init__(is_available, error_type=None, error_message=None, response_time=None, status_code=None)
- class metapub.ncbi_errors.NCBIErrorDetector[source]
Bases:
objectDetects and categorizes NCBI service errors.
- metapub.ncbi_errors.check_ncbi_status(url='https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi')[source]
Quick function to check NCBI service status.
- metapub.ncbi_errors.diagnose_ncbi_error(exception, url=None)[source]
Quick function to diagnose NCBI-related errors.
- metapub.ncbi_errors.format_user_error(exception, url=None)[source]
Format a user-friendly error message with suggestions.
- exception metapub.ncbi_errors.NCBIServiceError(message, error_type='unknown', suggestions=None)[source]
Bases:
ExceptionCustom exception for NCBI service issues.
- add_note()
Exception.add_note(note) – add a note to the exception
- args
- with_traceback()
Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.
metapub.ncbi_health_check module
NCBI Service Health Check Utility
A command-line tool to check the status of various NCBI services used by metapub. Helps diagnose service outages and determine which endpoints are affected.
- Usage:
python ncbi_health_check.py # Check all services python ncbi_health_check.py –quick # Check only essential services python ncbi_health_check.py –json # Output results as JSON
- class metapub.ncbi_health_check.ServiceResult(name, url, status, response_time, status_code=None, error_message=None, details=None)[source]
Bases:
objectResult of checking a single NCBI service.
- __init__(name, url, status, response_time, status_code=None, error_message=None, details=None)
- class metapub.ncbi_health_check.NCBIHealthChecker(timeout=10)[source]
Bases:
objectHealth checker for NCBI services.
metapub.pubmed_clinicalqueries module
metapub.pubmedarticle module
metapub.pubmedarticle – PubMedArticle class instantiated by supplying ncbi XML string.
- class metapub.pubmedarticle.PubMedArticle(xmlstr, *args, **kwargs)[source]
Bases:
MetaPubObjectThis PubMedArticle class receives an XML string as its required argument and parses it into its constituent parts, exposing them as attributes.
- Usage:
paper = PubMedArticle(xml_string)
To query services to return an article by pmid, use PubMedFetcher, which returns PubMedArticle objects.
When xmlstr is parsed, the pubmed_type attribute will be set to one of ‘article’ or ‘book’, depending on whether PubmedBookArticle or PubmedArticle headings are found in the supplied xmlstr at instantiation.
Since this class needs to work seamlessly in production whether it’s a book or an article, the PubmedArticle attributes will always be available (set to None in many cases for PubmedBookArticle, e.g. volume, issue, journal), but PubmedBookArticle attributes will only be set when pubmed_type=’book’.
- PubMedBook special handling of certain attributes:
abstract: a joined string from self.book_abstracts
title: comes from ArticleTitle
- Special attributes for PubmedBookArticle (pubmed_type=’book’):
book_id (default: None) - string from IdType=”bookaccession”, e.g. “NBK1403”
book_title (default: None) - string with name of book (as differentiated from ArticleTitle)
book_publisher (default: None) - dict containing {‘name’: string, ‘location’: string}
book_sections (default: []) - dict with key->value pairs as section_name->SectionTitle
book_contribution_date (default: None) - python datetime date
book_date_revised (default: None) - python datetime date
book_history (default: []) - dictionary with key->value pairs as PubStatus -> python datetime
book_language (default: None) - string (e.g. “eng”)
book_editors (default: []) - list containing names from ‘editors’ AuthorList
book_abstracts (default: []) - dict with key->value pairs as Label->AbstractText.text)
book_medium (default: None) - string (e.g. “Internet”)
book_synonyms (default: None) - list of disease synonyms (applicable to “gene” book)
book_publication_status (default: None) - string (e.g. “ppublish”)
- __init__(xmlstr, *args, **kwargs)[source]
Initialize PubMedArticle from NCBI XML data.
- Parameters:
xmlstr (str) – XML string from NCBI containing PubmedArticle or PubmedBookArticle data.
*args – Additional positional arguments passed to parent class.
**kwargs – Additional keyword arguments passed to parent class.
Note
The XML type is automatically detected to handle both regular articles and book chapters. The pubmed_type attribute will be set to ‘article’ or ‘book’ accordingly, and appropriate attributes will be populated.
- to_dict()[source]
Convert PubMedArticle to dictionary representation.
- Returns:
- Dictionary containing all article attributes except
internal XML content and processing attributes.
- Return type:
Dict[str, Any]
Note
Excludes ‘content’, ‘xml’, and ‘_root’ attributes from the output to provide a clean data representation suitable for serialization.
- property citation
Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.
Article Example:
McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.
Book Example (GeneReviews):
Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.
- property citation_html
Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, and pages.
Article Example:
McNally EM, <i>et al</i>. Genetic mutations and mechanisms in dilated cardiomyopathy. <i>Journal of Clinical Investigation</i>. 2013; <b>123</b>:19-26. doi: 10.1172/JCI62862.
GeneReviews Example: Tranebjarg L, <i>et al</i>. <i>Jervell and Lange-Nielsen syndrome</i>. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, <i>et al</i>., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.
- property citation_bibtex
- property pubdate
Normalized publication date as datetime object.
Returns the best available publication date from PubMed XML in order of preference: 1. Article PubDate (Year/Month/Day or MedlineDate) 2. Book contribution date 3. History dates (pubmed, entrez, etc.)
- Returns:
Publication date as datetime object, or None if no date found
- Return type:
datetime or None
Example
article = fetch.article_by_pmid(‘12345’) if article.pubdate:
print(f”Published: {article.pubdate.strftime(‘%Y-%m-%d’)}”)
metapub.pubmedcentral module
An assortment of functions providing access to various web APIs.
The pubmedcentral.* functions abstract the submission of one of the following acceptable IDs to the Pubmed Central ID Conversion API as a lookup to get another ID mapping to the same pubmed article:
doi Digital Object Identifier
pmid Pubmed ID
pmcid Pubmed Central ID (includes Versioned Identifier)
Available functions:
get_pmid_for_otherid(string)
get_doi_for_otherid(string)
get_pmcid_for_otherid(string)
- metapub.pubmedcentral.get_pmid_for_otherid(otherid)[source]
Use the PMC ID conversion API to attempt to convert either PMCID or DOI to a PMID. Returns PMID if successful, or None if there is no ‘pmid’ item in the response.
- Parameters:
otherid – (str)
- Return pmid:
(str)
- Return type:
- metapub.pubmedcentral.get_pmcid_for_otherid(otherid)[source]
Use the PMC ID conversion API to attempt to convert either PMID or DOI to a PMCID. Returns PMCID if successful, or None if there is no ‘pmcid’ item in the response.
- Parameters:
otherid – (str)
- Return pmcid:
(str)
- Return type:
- metapub.pubmedcentral.get_doi_for_otherid(otherid)[source]
Use the PMC ID conversion API to attempt to convert either PMID or PMCID to a DOI. Returns DOI if successful, or None if there is no ‘doi’ item in the response.
Note: this method has a very low success rate for retrieving DOIs. Check out the CrossRef object, i.e. from metapub import CrossRef which excels at resolving citations into DOIs (and DOIs into citations).
- Parameters:
otherid – (str)
- Return doi:
(str)
- Return type:
metapub.pubmedfetcher module
metapub.PubMedFetcher – tools to deal with NCBI’s E-utilities interface to PubMed
- metapub.pubmedfetcher.get_uids_from_esearch_result(xmlstr)[source]
Extract unique identifiers from an ESearch XML result.
- Parameters:
xmlstr (str) – XML string returned from NCBI ESearch query.
- Returns:
List of PMID strings extracted from the XML.
- Return type:
List[str]
- Raises:
NCBIServiceError – If XML parsing fails due to NCBI service issues.
Parse XML results from ELink query for related PMIDs.
- Parameters:
xmlstr (str) – XML string returned from NCBI ELink query.
- Returns:
- Dictionary mapping relationship types to lists of PMIDs.
Common keys include ‘pubmed’, ‘reviews’, ‘cited’, etc.
- Return type:
- Raises:
NCBIServiceError – If XML parsing fails due to NCBI service issues.
- class metapub.pubmedfetcher.PubMedFetcher(a Borg singleton object backed by an optional SQLite cache)[source]
Bases:
BorgAn interaction layer for querying via specified method to return PubMedArticle objects.
Currently available methods: eutils
Basic Usage:
fetch = PubMedFetcher()
To specify a service method (more coming soon):
fetch = PubMedFetcher(‘eutils’)
To return an article by querying the service with a known PMID or NCBI Book ID:
paper = fetch.article_by_pmid(‘123456’) book = fetch.article_by_pmid(‘NBK1234’)
Similar methods exist for returning papers by DOI and PM Central id:
paper = fetch.article_by_doi(‘10.1038/ng.379’) paper = fetch.article_by_pmcid(‘PMC3458974’)
Finally, you can search for PMIDs via citation details by using the pmids_for_citation method, for which you usually only need 3 out of 5 details to triangulate on a good result.
- pmids = fetch.pmids_for_citation(journal=’Science’, year=’2008’, volume=’4’,
first_page=’7’, author_name=’Grant’)
- __init__(method='eutils', **kwargs)[source]
Initialize PubMedFetcher with specified service method.
- Parameters:
method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.
**kwargs –
Additional keyword arguments. cachedir (str, optional): Custom directory for caching responses.
If not provided, uses default cache directory.
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state.
- pmids_for_clinical_query(query, category, optimization='broad', since=None, until=None, retstart=0, retmax=250, pmc_only=False, **kwargs)[source]
Takes a query and a category (required, see below) and returns a list of pubmed IDs returned by NCBI for that query.
See also PubMedFetcher.pmids_for_query for other parameters.
available categories:
therapy diagnosis etiology prognosis prediction
- available optimizations:
broad (default) narrow
- Param:
query (string)
- Param:
category (string)
- Param:
optimization (string) [default: broad]
- Returns:
list of pubmed IDs
- pmids_for_medical_genetics_query(query, category='all', since=None, until=None, retstart=0, retmax=250, pmc_only=False, **kwargs)[source]
Takes a query and category (see below) and returns a list of pubmed IDs. IDs returned by NCBI for that query.
See also PubMedFetcher.pmids_for_query for other parameters.
available categories:
all (default) diagnosis differential_diagnosis clinical_description management genetic_counseling genetic_testing
- Param:
query (string)
- Param:
category (string) [default: all]
- Returns:
list of pubmed IDs
- pmids_for_citation(**kwargs)[source]
- returns list of pmids for given citation. requires at least 3/5 of these keyword arguments:
jtitle or journal (journal title) year or date volume spage or first_page (starting page / first page) aulast (first author’s last name) or author1_first_lastfm (as produced by PubMedArticle class)
Strings submitted for journal/jtitle will be run through metapub.utils.remove_chars to deal with HTML- encoded characters and to remove punctuation.
For supplied pmid, return related ids of related pubmed articles, organized into a dictionary keyed by type of relation. The keys include:
pubmed (all related links)
citedin (papers that cited this paper)
five (the “five” that pubmed displays as the top related results)
reviews (review papers that cite this paper)
combined (?)
query example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?retmode=xml&dbfrom=pubmed&id=14873513&cmd=neighbor
- Raises:
NCBIServiceError if NCBI ELink service is down
metapub.pubmedfetcher_cli module
pubmed_article: utility for fetching an article by PMID.
- Usage:
pubmed_article <pmid>
Options:
- -h, --help
Print this screen.
- -v, --version
Print the version of this program.
- -a, --abstract
Include the abstract.
- -f, --full
Print the full article, if possible. (experimental)
metapub.text_mining module
- metapub.text_mining.findall_ncbi_bookIDs(text)[source]
GeneReviews books look like this: NBK1210 (see https://www.ncbi.nlm.nih.gov/pubmed/?term=NBK1210 )
- Parameters:
text
- Returns book_ids:
list of IDs (possibly empty)
- metapub.text_mining.is_ncbi_bookID(book_id)[source]
Returns whether supplied book_id appears to be an NCBI book ID (e.g. “NBK1010”).
- metapub.text_mining.findall_pmcIDs(text)[source]
PubmedCentral IDs look like this: PMC123456
- Parameters:
text
- Returns pmc_ids:
list of IDs (possibly empty)
- metapub.text_mining.is_pmcid(pmcid)[source]
Returns boolean on whether supplied pmcid looks like a PubMedCentral ID (e.g. “PMC31345”).
- metapub.text_mining.pick_pmid(text)[source]
- return longest numerical string from text (string) as the pmid.
if text is empty or there are no pmids, return None.
- Parameters:
text – (str)
- Returns:
pmid (str) or None
- metapub.text_mining.findall_dois_in_text(inp, whitespace=False)[source]
Returns all seen DOIs in submitted text.
- if whitespace arg set to True, look for DOIs like the following:
10.1002 / pd.354
- …but return with whitespace stripped:
10.1002/pd.354
- Parameters:
inp – (str)
whitespace – (bool)
- Returns:
list of DOIs found in inp
- metapub.text_mining.find_doi_in_string(inp, whitespace=False)[source]
Returns the first seen DOI in the input string.
- Parameters:
inp – (str)
whitespace – (bool)
- Returns:
string containing first found DOI, or None
- metapub.text_mining.scrape_doi_from_article_page(url)[source]
Takes an article link (url), loads its page, and searches its content for DOIs, returning the first one it finds.
The first DOI found on the page being the correct one for the article at hand seems to be a reasonable and workable assumption in general.
- Parameters:
url – (str)
- Returns:
doi or None
- Raises:
Exception for network/connection issues
metapub.utils module
- metapub.utils.remove_chars(inp, chars='[],.()<>\'/?;:"&', urldecode=False)[source]
Remove target characters from input string.
- Parameters:
inp – (str)
chars – (str) characters to remove [default: utils.PUNCS_WE_DONT_LIKE]
urldecode – (bool) whether to first urldecode the input string [default: False]
- metapub.utils.hostname_of(url)[source]
Takes a url (may or may not contain protocol prefix) and returns the simplest base form of the hostname in the supplied URL.
If hostname starts with ‘www.’, this will be stripped out.
Examples
http://www.nature.com/pr/journal/v49/n1/full/pr20018a.html –> nature.com https://webhome.weizmann.ac.il –> webhome.weizmann.ac.il https://www.ncbi.nlm.nih.gov/pubmed/17108762 –> ncbi.nlm.nih.gov
- Parameters:
url – (str)
- Return hostname:
(str)
- metapub.utils.rootdomain_of(url)[source]
Returns the root domain of hostname of supplied URL.
Examples
http://blood.oxfordjournals.org –> oxfordjournals.org https://webhome.weizmann.ac.il –> ac.il https://regex101.com/ –> regex101.com https://www.ncbi.nlm.nih.gov/pubmed/17108762 –> nih.gov
- Parameters:
url – (str)
- Return rootdomain:
(str)
- metapub.utils.asciify(inp)[source]
Nuke all the unicode from orbit. It’s the only way to be sure.
WARNING: this function is mostly used for Python2 compatibility and other legacy stuff, and may be removed in upcoming versions of metapub.
- Parameters:
inp – (str)
- Returns:
string converted to pure, American ASCII
- metapub.utils.squash_spaces(inp)[source]
Convert multiple ‘ ‘ chars to a single space.
- Parameters:
inp – (str)
- Returns:
same string with only one space where multiple spaces were.
- metapub.utils.parameterize(inp, sep='+')[source]
Make strings suitable for submission to GET-based query service.
Strips out the characters named in metapub.utils.PUNCS_WE_DONT_LIKE
If inp is None, return empty string.
- Parameters:
inp – (str or None): input to be parameterized
sep – (str): separator to use in place of spaces (default=’+’)
- Returns:
“parameterized” str
- metapub.utils.deparameterize(inp, sep='+')[source]
Somewhat-undo parameterization in string. Replace separators (sep) with spaces.
- Parameters:
inp – (str)
sep – (str) default: ‘+’
- Returns:
“deparameterized” string