Data Fetcher Classes

The core of Metapub consists of several fetcher classes that provide access to different biomedical databases. All fetchers use the Borg singleton pattern and include comprehensive caching.

🔄 Borg Singleton Pattern

Metapub fetchers use the Borg pattern, which means all instances of the same fetcher class share the same state (cache, configuration, etc.). This provides several benefits:

  • Shared cache: Multiple PubMedFetcher() instances automatically share cached data

  • Consistent configuration: API keys and settings apply across all instances

  • Memory efficiency: No duplicate caches or redundant API calls

  • Consistency: Safe to use across different parts of your application

# These two fetchers share the same cache and configuration
fetch1 = PubMedFetcher()
fetch2 = PubMedFetcher()

# Article cached by fetch1 is immediately available to fetch2
article = fetch1.article_by_pmid('12345678')
same_article = fetch2.article_by_pmid('12345678')  # Uses cache, no API call

PubMedFetcher

class metapub.PubMedFetcher(a Borg singleton object backed by an optional SQLite cache)[source]

Bases: Borg

An interaction layer for querying via specified method to return PubMedArticle objects.

Currently available methods: eutils

Basic Usage:

fetch = PubMedFetcher()

To specify a service method (more coming soon):

fetch = PubMedFetcher(‘eutils’)

To return an article by querying the service with a known PMID or NCBI Book ID:

paper = fetch.article_by_pmid(‘123456’) book = fetch.article_by_pmid(‘NBK1234’)

Similar methods exist for returning papers by DOI and PM Central id:

paper = fetch.article_by_doi(‘10.1038/ng.379’) paper = fetch.article_by_pmcid(‘PMC3458974’)

Finally, you can search for PMIDs via citation details by using the pmids_for_citation method, for which you usually only need 3 out of 5 details to triangulate on a good result.

pmids = fetch.pmids_for_citation(journal=’Science’, year=’2008’, volume=’4’,

first_page=’7’, author_name=’Grant’)

__init__(method='eutils', **kwargs)[source]

Initialize PubMedFetcher with specified service method.

Parameters:
  • method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.

  • **kwargs

    Additional keyword arguments. cachedir (str, optional): Custom directory for caching responses.

    If not provided, uses default cache directory.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state.

pmids_for_clinical_query(query, category, optimization='broad', since=None, until=None, retstart=0, retmax=250, pmc_only=False, **kwargs)[source]

Takes a query and a category (required, see below) and returns a list of pubmed IDs returned by NCBI for that query.

See also PubMedFetcher.pmids_for_query for other parameters.

available categories:

therapy diagnosis etiology prognosis prediction

available optimizations:

broad (default) narrow

Param:

query (string)

Param:

category (string)

Param:

optimization (string) [default: broad]

Returns:

list of pubmed IDs

pmids_for_medical_genetics_query(query, category='all', since=None, until=None, retstart=0, retmax=250, pmc_only=False, **kwargs)[source]

Takes a query and category (see below) and returns a list of pubmed IDs. IDs returned by NCBI for that query.

See also PubMedFetcher.pmids_for_query for other parameters.

available categories:

all (default) diagnosis differential_diagnosis clinical_description management genetic_counseling genetic_testing

Param:

query (string)

Param:

category (string) [default: all]

Returns:

list of pubmed IDs

pmids_for_citation(**kwargs)[source]
returns list of pmids for given citation. requires at least 3/5 of these keyword arguments:

jtitle or journal (journal title) year or date volume spage or first_page (starting page / first page) aulast (first author’s last name) or author1_first_lastfm (as produced by PubMedArticle class)

Strings submitted for journal/jtitle will be run through metapub.utils.remove_chars to deal with HTML- encoded characters and to remove punctuation.

related_pmids(pmid)[source]

For supplied pmid, return related ids of related pubmed articles, organized into a dictionary keyed by type of relation. The keys include:

  • pubmed (all related links)

  • citedin (papers that cited this paper)

  • five (the “five” that pubmed displays as the top related results)

  • reviews (review papers that cite this paper)

  • combined (?)

query example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?retmode=xml&dbfrom=pubmed&id=14873513&cmd=neighbor

Raises:

NCBIServiceError if NCBI ELink service is down

pmid_for_bookID(book_id)[source]

For supplied NCBI Book ID, use the pubmed advanced query API to find its PMID.

Not all NCBI Books have PMIDs. If there is no associated PMID, this returns None.

Parameters:

book_id – (str) e.g. “NBK2020”

Returns:

(str) or None – PMID if found, None otherwise.

The PubMedFetcher is the primary interface for accessing PubMed literature via NCBI’s E-utilities API. It provides methods for:

  • Article retrieval by PMID, DOI, or PMC ID

  • Literature searches with complex query support

  • Citation-based lookups for bibliographic matching

  • Related article discovery using NCBI’s eLink service

NCBI E-utilities Documentation: PubMed E-utilities | PubMed Search Field Descriptions

Key Methods

PubMedFetcher.__init__(method='eutils', **kwargs)[source]

Initialize PubMedFetcher with specified service method.

Parameters:
  • method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.

  • **kwargs

    Additional keyword arguments. cachedir (str, optional): Custom directory for caching responses.

    If not provided, uses default cache directory.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state.

PubMedFetcher.pmids_for_citation(**kwargs)[source]
returns list of pmids for given citation. requires at least 3/5 of these keyword arguments:

jtitle or journal (journal title) year or date volume spage or first_page (starting page / first page) aulast (first author’s last name) or author1_first_lastfm (as produced by PubMedArticle class)

Strings submitted for journal/jtitle will be run through metapub.utils.remove_chars to deal with HTML- encoded characters and to remove punctuation.

PubMedFetcher.related_pmids(pmid)[source]

For supplied pmid, return related ids of related pubmed articles, organized into a dictionary keyed by type of relation. The keys include:

  • pubmed (all related links)

  • citedin (papers that cited this paper)

  • five (the “five” that pubmed displays as the top related results)

  • reviews (review papers that cite this paper)

  • combined (?)

query example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?retmode=xml&dbfrom=pubmed&id=14873513&cmd=neighbor

Raises:

NCBIServiceError if NCBI ELink service is down

Example Usage

from metapub import PubMedFetcher

# Initialize fetcher
fetch = PubMedFetcher()

# Get specific article
article = fetch.article_by_pmid('33157158')
print(f"Title: {article.title}")
print(f"Journal: {article.journal}")
print(f"DOI: {article.doi}")

# Search for articles
pmids = fetch.pmids_for_query(
    query='CRISPR gene editing',
    since='2020/01/01',
    retmax=100
)

# Citation-based lookup
citation_pmids = fetch.pmids_for_citation(
    journal='Nature',
    year=2023,
    volume=615,
    first_page=123,
    aulast='Smith'
)

MedGenFetcher

class metapub.MedGenFetcher(a Borg singleton object)[source]

Bases: Borg

An interaction layer for querying to return MedGenConcept objects.

Currently available methods: eutils

Basic Usage:

fetch = MedGenFetcher()

To specify a service method (more coming soon):

fetch = MedGenFetcher(‘eutils’)

To return a MedGenConcept from a known UID:

concept = fetch.concept_by_uid(known_UID)

To return a list of UIDs relevant to a given term known in medgen:

uids = fetch.uids_by_term(some_term)

To get a medgen UID given a known Concept ID (cui):

uid = fetch.uid_for_cui(known_cui)

__init__(method='eutils', cachedir='default')[source]

Initialize MedGenFetcher for medical genetics concept retrieval.

Parameters:
  • method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.

  • cachedir (str, optional) – Directory for caching responses. Use ‘default’ for system cache directory. Defaults to ‘default’.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state. Provides access to NCBI’s MedGen database for medical genetics concepts, diseases, and gene-phenotype relationships.

The MedGenFetcher provides access to NCBI’s MedGen database for medical genetics concepts and disease-gene relationships.

NCBI MedGen Documentation: MedGen Database | MedGen Help

Key Methods

MedGenFetcher.__init__(method='eutils', cachedir='default')[source]

Initialize MedGenFetcher for medical genetics concept retrieval.

Parameters:
  • method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.

  • cachedir (str, optional) – Directory for caching responses. Use ‘default’ for system cache directory. Defaults to ‘default’.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state. Provides access to NCBI’s MedGen database for medical genetics concepts, diseases, and gene-phenotype relationships.

Example Usage

from metapub import MedGenFetcher

# Initialize fetcher
mg = MedGenFetcher()

# Search for genetic condition
uids = mg.uids_by_term('Brugada syndrome')

# Get detailed concept information
for uid in uids[:3]:  # First 3 results
    concept = mg.concept_by_uid(uid)
    print(f"Name: {concept.name}")
    print(f"CUI: {concept.cui}")
    print(f"Definition: {concept.definition}")

    # Get related literature
    pmids = mg.pubmeds_for_cui(concept.cui)
    print(f"Related papers: {len(pmids)}")

ClinVarFetcher

class metapub.ClinVarFetcher(a Borg singleton object)[source]

Bases: Borg

Toolkit for retrieval of ClinVar information.

Set optional ‘cachedir’ parameter to absolute path of preferred directory if desired; cachedir defaults to <current user directory> + /.cache

clinvar = ClinVarFetcher()

clinvar = ClinVarFetcher(cachedir=’/path/to/cachedir’)

Usage

Get ClinVar accession IDs for gene name (switch single_gene to True to filter out results containing more genes than the specified gene being searched, default False).

cv_ids = clinvar.ids_by_gene(‘FGFR3’, single_gene=True)

Get ClinVar accession in python dictionary format for given ID:

cv_subm = clinvar.accession(65533) # can also submit ID as string

Get list of pubmed IDs (pmids) for given ClinVar accession ID:

pmids = clinvar.pmids_for_id(65533) # can also submit ID as string

Get list of pubmed IDs (pmids) for hgvs string:

pmids = clinvar.pmids_for_hgvs(‘NM_017547.3:c.1289A>G’)

For more info, see the ClinVar eutils page: https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/

__init__(method='eutils', cachedir='default')[source]

Initialize ClinVarFetcher for clinical variant data retrieval.

Parameters:
  • method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.

  • cachedir (str, optional) – Directory for caching responses. Use ‘default’ for system cache directory. Defaults to ‘default’.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state. Provides access to NCBI’s ClinVar database for clinical significance of genetic variants, gene-disease relationships, and variant literature.

The ClinVarFetcher provides access to NCBI’s ClinVar database for clinical significance of genetic variants.

NCBI ClinVar Documentation: ClinVar Database | ClinVar API Guide

Note: unlike the ClinVar clinical significance classes, clinical_significance values are in all lowercase–this was a conscious decision documented futher here

Key Methods

ClinVarFetcher.__init__(method='eutils', cachedir='default')[source]

Initialize ClinVarFetcher for clinical variant data retrieval.

Parameters:
  • method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.

  • cachedir (str, optional) – Directory for caching responses. Use ‘default’ for system cache directory. Defaults to ‘default’.

Raises:

NotImplementedError – If an unsupported method is specified.

Note

This is a Borg singleton - all instances share the same state. Provides access to NCBI’s ClinVar database for clinical significance of genetic variants, gene-disease relationships, and variant literature.

Example Usage

from metapub import ClinVarFetcher

# Initialize fetcher
cv = ClinVarFetcher()

# Find variants for a gene
variant_ids = cv.ids_by_gene('BRCA1', single_gene=True)

# Get detailed variant information
for var_id in variant_ids[:5]:  # First 5 variants
    variant = cv.variant(var_id)
    print(f"Accession: {variant.accession}")
    print(f"HGVS: {variant.hgvs_c}")
    print(f"Clinical significance: {variant.clinical_significance}")
    print(f"Molecular consequences: {variant.molecular_consequences}")

    # Get supporting literature
    pmids = cv.pmids_for_id(var_id)
    print(f"Supporting papers: {len(pmids)}")

CrossRefFetcher

class metapub.CrossRefFetcher(**kwargs)[source]

Bases: Borg

Valid field queries for this route are: affiliation, degree, event-acronym, bibliographic, container-title, publisher-name, author, event-theme, standards-body-acronym, chair, event-location, translator, funder-name, event-name, publisher-location, title, standards-body-name, contributor, editor, event-sponsor

__init__(**kwargs)[source]
article_by_doi(doi)[source]

Returns a CrossRefWork object loaded by querying the Crossref works/DOI REST endpoint.

Parameters:

doi – (str)

Return type:

CrossRefWork

Raises:

HTTPError (404) if DOI not found.

Raises:

Exception for network/service issues

article_by_pma(pma, ideal_ld=0.95, min_ld=0.8)[source]

From a PubMedArticle object, use as much info as needed to get as precise a match on CrossRef as is possible.

1st attempt: Title + Journal. Runs Levenshtein distance on results; if any results have

a better similarity ratio than ideal_ld, the top of these results will be returned. Otherwise, the first item with a score better than min_ld will be kept and compared against 1nd attempt results.

2nd attempt: Title + First Author. Same process as 1st attempt but with any candidates

found in 1st attempt submitted for comparison.

Finally: Return None or CrossRefWork from best candidate that exceeds min_ld requirement.

Parameters:
  • pma – PubMedArticle object

  • ideal_ld – (float) [default: set in global at top of crossref.py]

  • min_ld – (float) [default: set in global at top of crossref.py]

Return type:

CrossRefWork

article_by_title(title, **kwargs)[source]

Use CrossRef to find a work by its title. Returns first item in the list.

Keywords are passed unmodified to crossref.works() [habanero].

Parameters:

title – str

Return type:

CrossRefWork or None (if no results)

The CrossRefFetcher provides access to CrossRef’s API for DOI resolution and publication metadata when PubMed data is incomplete.

CrossRef API Documentation: CrossRef REST API | Works API Reference

Example Usage

from metapub import CrossRefFetcher, PubMedFetcher

# Initialize fetchers
fetch = PubMedFetcher()
cr = CrossRefFetcher()

# Get article that might be missing DOI in PubMed
article = fetch.article_by_pmid('12345678')

if not article.doi:
    # Try CrossRef as fallback
    work = cr.article_by_pma(article)
    if work and work.score > 80:  # High confidence match
        print(f"Found DOI via CrossRef: {work.doi}")
        print(f"Match score: {work.score}")

Advanced Configuration

Custom Cache Directory

import os

# Set custom cache directory
os.environ['METAPUB_CACHE_DIR'] = '/path/to/large/cache'

# Or specify per-fetcher
fetch = PubMedFetcher(cachedir='/custom/cache/path')

NCBI API Key Setup

📈 Why Use an API Key?

NCBI provides free API keys that increase your rate limits from 3 to 10 requests per second, essential for production applications and large-scale data collection.

🔑 Getting Your API Key

  1. Apply for a key: NCBI API Key Registration

  2. No approval needed - keys are issued immediately

  3. Free for academic and commercial use

⚙️ Configuration Options

import os

# Method 1: Environment variable (recommended)
os.environ['NCBI_API_KEY'] = 'your_api_key_here'

# Method 2: Direct parameter
fetch = PubMedFetcher(api_key='your_api_key_here')

# Method 3: Config file
# Create ~/.metapub/config with:
# [DEFAULT]
# ncbi_api_key = your_api_key_here

🚀 Rate Limit Benefits

  • Without API key: 3 requests/second

  • With API key: 10 requests/second

  • Large datasets: 3x faster processing

  • Production reliability: Reduced throttling errors

Error Handling Patterns

from metapub.exceptions import MetaPubError, InvalidPMID, NCBIServiceError

try:
    article = fetch.article_by_pmid('12345678')
except InvalidPMID:
    print("Invalid PMID provided")
except NCBIServiceError as e:
    print(f"NCBI service issue: {e.user_message}")
    print(f"Suggested actions: {e.suggested_actions}")
except MetaPubError as e:
    print(f"General MetaPub error: {e}")

Performance Considerations

Batch Processing

# Process large lists efficiently
pmids = ['12345678', '23456789', '34567890']  # ... many more

for i, pmid in enumerate(pmids):
    if i % 100 == 0:
        print(f"Progress: {i}/{len(pmids)}")

    try:
        article = fetch.article_by_pmid(pmid)
        # Process article...
    except Exception as e:
        print(f"Error with {pmid}: {e}")
        continue

Cache Warming

# Pre-warm cache for known PMIDs
def warm_cache(pmid_list):
    for pmid in pmid_list:
        try:
            # Just accessing loads into cache
            article = fetch.article_by_pmid(pmid)
        except Exception:
            continue