Data Fetcher Classes
The core of Metapub consists of several fetcher classes that provide access to different biomedical databases. All fetchers use the Borg singleton pattern and include comprehensive caching.
🔄 Borg Singleton Pattern
Metapub fetchers use the Borg pattern, which means all instances of the same fetcher class share the same state (cache, configuration, etc.). This provides several benefits:
Shared cache: Multiple
PubMedFetcher()instances automatically share cached dataConsistent configuration: API keys and settings apply across all instances
Memory efficiency: No duplicate caches or redundant API calls
Consistency: Safe to use across different parts of your application
# These two fetchers share the same cache and configuration
fetch1 = PubMedFetcher()
fetch2 = PubMedFetcher()
# Article cached by fetch1 is immediately available to fetch2
article = fetch1.article_by_pmid('12345678')
same_article = fetch2.article_by_pmid('12345678') # Uses cache, no API call
PubMedFetcher
- class metapub.PubMedFetcher(a Borg singleton object backed by an optional SQLite cache)[source]
Bases:
BorgAn interaction layer for querying via specified method to return PubMedArticle objects.
Currently available methods: eutils
Basic Usage:
fetch = PubMedFetcher()
To specify a service method (more coming soon):
fetch = PubMedFetcher(‘eutils’)
To return an article by querying the service with a known PMID or NCBI Book ID:
paper = fetch.article_by_pmid(‘123456’) book = fetch.article_by_pmid(‘NBK1234’)
Similar methods exist for returning papers by DOI and PM Central id:
paper = fetch.article_by_doi(‘10.1038/ng.379’) paper = fetch.article_by_pmcid(‘PMC3458974’)
Finally, you can search for PMIDs via citation details by using the pmids_for_citation method, for which you usually only need 3 out of 5 details to triangulate on a good result.
- pmids = fetch.pmids_for_citation(journal=’Science’, year=’2008’, volume=’4’,
first_page=’7’, author_name=’Grant’)
- __init__(method='eutils', **kwargs)[source]
Initialize PubMedFetcher with specified service method.
- Parameters:
method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.
**kwargs –
Additional keyword arguments. cachedir (str, optional): Custom directory for caching responses.
If not provided, uses default cache directory.
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state.
- pmids_for_clinical_query(query, category, optimization='broad', since=None, until=None, retstart=0, retmax=250, pmc_only=False, **kwargs)[source]
Takes a query and a category (required, see below) and returns a list of pubmed IDs returned by NCBI for that query.
See also PubMedFetcher.pmids_for_query for other parameters.
available categories:
therapy diagnosis etiology prognosis prediction
- available optimizations:
broad (default) narrow
- Param:
query (string)
- Param:
category (string)
- Param:
optimization (string) [default: broad]
- Returns:
list of pubmed IDs
- pmids_for_medical_genetics_query(query, category='all', since=None, until=None, retstart=0, retmax=250, pmc_only=False, **kwargs)[source]
Takes a query and category (see below) and returns a list of pubmed IDs. IDs returned by NCBI for that query.
See also PubMedFetcher.pmids_for_query for other parameters.
available categories:
all (default) diagnosis differential_diagnosis clinical_description management genetic_counseling genetic_testing
- Param:
query (string)
- Param:
category (string) [default: all]
- Returns:
list of pubmed IDs
- pmids_for_citation(**kwargs)[source]
- returns list of pmids for given citation. requires at least 3/5 of these keyword arguments:
jtitle or journal (journal title) year or date volume spage or first_page (starting page / first page) aulast (first author’s last name) or author1_first_lastfm (as produced by PubMedArticle class)
Strings submitted for journal/jtitle will be run through metapub.utils.remove_chars to deal with HTML- encoded characters and to remove punctuation.
For supplied pmid, return related ids of related pubmed articles, organized into a dictionary keyed by type of relation. The keys include:
pubmed (all related links)
citedin (papers that cited this paper)
five (the “five” that pubmed displays as the top related results)
reviews (review papers that cite this paper)
combined (?)
query example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?retmode=xml&dbfrom=pubmed&id=14873513&cmd=neighbor
- Raises:
NCBIServiceError if NCBI ELink service is down
The PubMedFetcher is the primary interface for accessing PubMed literature via NCBI’s E-utilities API. It provides methods for:
Article retrieval by PMID, DOI, or PMC ID
Literature searches with complex query support
Citation-based lookups for bibliographic matching
Related article discovery using NCBI’s eLink service
NCBI E-utilities Documentation: PubMed E-utilities | PubMed Search Field Descriptions
Key Methods
- PubMedFetcher.__init__(method='eutils', **kwargs)[source]
Initialize PubMedFetcher with specified service method.
- Parameters:
method (str, optional) – Service method to use. Currently only ‘eutils’ is supported. Defaults to ‘eutils’.
**kwargs –
Additional keyword arguments. cachedir (str, optional): Custom directory for caching responses.
If not provided, uses default cache directory.
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state.
- PubMedFetcher.pmids_for_citation(**kwargs)[source]
- returns list of pmids for given citation. requires at least 3/5 of these keyword arguments:
jtitle or journal (journal title) year or date volume spage or first_page (starting page / first page) aulast (first author’s last name) or author1_first_lastfm (as produced by PubMedArticle class)
Strings submitted for journal/jtitle will be run through metapub.utils.remove_chars to deal with HTML- encoded characters and to remove punctuation.
- PubMedFetcher.related_pmids(pmid)[source]
For supplied pmid, return related ids of related pubmed articles, organized into a dictionary keyed by type of relation. The keys include:
pubmed (all related links)
citedin (papers that cited this paper)
five (the “five” that pubmed displays as the top related results)
reviews (review papers that cite this paper)
combined (?)
query example: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?retmode=xml&dbfrom=pubmed&id=14873513&cmd=neighbor
- Raises:
NCBIServiceError if NCBI ELink service is down
Example Usage
from metapub import PubMedFetcher
# Initialize fetcher
fetch = PubMedFetcher()
# Get specific article
article = fetch.article_by_pmid('33157158')
print(f"Title: {article.title}")
print(f"Journal: {article.journal}")
print(f"DOI: {article.doi}")
# Search for articles
pmids = fetch.pmids_for_query(
query='CRISPR gene editing',
since='2020/01/01',
retmax=100
)
# Citation-based lookup
citation_pmids = fetch.pmids_for_citation(
journal='Nature',
year=2023,
volume=615,
first_page=123,
aulast='Smith'
)
MedGenFetcher
- class metapub.MedGenFetcher(a Borg singleton object)[source]
Bases:
BorgAn interaction layer for querying to return MedGenConcept objects.
Currently available methods: eutils
Basic Usage:
fetch = MedGenFetcher()
To specify a service method (more coming soon):
fetch = MedGenFetcher(‘eutils’)
To return a MedGenConcept from a known UID:
concept = fetch.concept_by_uid(known_UID)
To return a list of UIDs relevant to a given term known in medgen:
uids = fetch.uids_by_term(some_term)
To get a medgen UID given a known Concept ID (cui):
uid = fetch.uid_for_cui(known_cui)
- __init__(method='eutils', cachedir='default')[source]
Initialize MedGenFetcher for medical genetics concept retrieval.
- Parameters:
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state. Provides access to NCBI’s MedGen database for medical genetics concepts, diseases, and gene-phenotype relationships.
The MedGenFetcher provides access to NCBI’s MedGen database for medical genetics concepts and disease-gene relationships.
NCBI MedGen Documentation: MedGen Database | MedGen Help
Key Methods
- MedGenFetcher.__init__(method='eutils', cachedir='default')[source]
Initialize MedGenFetcher for medical genetics concept retrieval.
- Parameters:
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state. Provides access to NCBI’s MedGen database for medical genetics concepts, diseases, and gene-phenotype relationships.
Example Usage
from metapub import MedGenFetcher
# Initialize fetcher
mg = MedGenFetcher()
# Search for genetic condition
uids = mg.uids_by_term('Brugada syndrome')
# Get detailed concept information
for uid in uids[:3]: # First 3 results
concept = mg.concept_by_uid(uid)
print(f"Name: {concept.name}")
print(f"CUI: {concept.cui}")
print(f"Definition: {concept.definition}")
# Get related literature
pmids = mg.pubmeds_for_cui(concept.cui)
print(f"Related papers: {len(pmids)}")
ClinVarFetcher
- class metapub.ClinVarFetcher(a Borg singleton object)[source]
Bases:
BorgToolkit for retrieval of ClinVar information.
Set optional ‘cachedir’ parameter to absolute path of preferred directory if desired; cachedir defaults to <current user directory> + /.cache
clinvar = ClinVarFetcher()
clinvar = ClinVarFetcher(cachedir=’/path/to/cachedir’)
Usage
Get ClinVar accession IDs for gene name (switch single_gene to True to filter out results containing more genes than the specified gene being searched, default False).
cv_ids = clinvar.ids_by_gene(‘FGFR3’, single_gene=True)
Get ClinVar accession in python dictionary format for given ID:
cv_subm = clinvar.accession(65533) # can also submit ID as string
Get list of pubmed IDs (pmids) for given ClinVar accession ID:
pmids = clinvar.pmids_for_id(65533) # can also submit ID as string
Get list of pubmed IDs (pmids) for hgvs string:
pmids = clinvar.pmids_for_hgvs(‘NM_017547.3:c.1289A>G’)
For more info, see the ClinVar eutils page: https://www.ncbi.nlm.nih.gov/clinvar/docs/maintenance_use/
- __init__(method='eutils', cachedir='default')[source]
Initialize ClinVarFetcher for clinical variant data retrieval.
- Parameters:
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state. Provides access to NCBI’s ClinVar database for clinical significance of genetic variants, gene-disease relationships, and variant literature.
The ClinVarFetcher provides access to NCBI’s ClinVar database for clinical significance of genetic variants.
NCBI ClinVar Documentation: ClinVar Database | ClinVar API Guide
Note: unlike the ClinVar clinical significance classes, clinical_significance values are in all lowercase–this was a conscious decision documented futher here
Key Methods
- ClinVarFetcher.__init__(method='eutils', cachedir='default')[source]
Initialize ClinVarFetcher for clinical variant data retrieval.
- Parameters:
- Raises:
NotImplementedError – If an unsupported method is specified.
Note
This is a Borg singleton - all instances share the same state. Provides access to NCBI’s ClinVar database for clinical significance of genetic variants, gene-disease relationships, and variant literature.
Example Usage
from metapub import ClinVarFetcher
# Initialize fetcher
cv = ClinVarFetcher()
# Find variants for a gene
variant_ids = cv.ids_by_gene('BRCA1', single_gene=True)
# Get detailed variant information
for var_id in variant_ids[:5]: # First 5 variants
variant = cv.variant(var_id)
print(f"Accession: {variant.accession}")
print(f"HGVS: {variant.hgvs_c}")
print(f"Clinical significance: {variant.clinical_significance}")
print(f"Molecular consequences: {variant.molecular_consequences}")
# Get supporting literature
pmids = cv.pmids_for_id(var_id)
print(f"Supporting papers: {len(pmids)}")
CrossRefFetcher
- class metapub.CrossRefFetcher(**kwargs)[source]
Bases:
BorgValid field queries for this route are: affiliation, degree, event-acronym, bibliographic, container-title, publisher-name, author, event-theme, standards-body-acronym, chair, event-location, translator, funder-name, event-name, publisher-location, title, standards-body-name, contributor, editor, event-sponsor
- article_by_doi(doi)[source]
Returns a CrossRefWork object loaded by querying the Crossref works/DOI REST endpoint.
- Parameters:
doi – (str)
- Return type:
- Raises:
HTTPError (404) if DOI not found.
- Raises:
Exception for network/service issues
- article_by_pma(pma, ideal_ld=0.95, min_ld=0.8)[source]
From a PubMedArticle object, use as much info as needed to get as precise a match on CrossRef as is possible.
- 1st attempt: Title + Journal. Runs Levenshtein distance on results; if any results have
a better similarity ratio than ideal_ld, the top of these results will be returned. Otherwise, the first item with a score better than min_ld will be kept and compared against 1nd attempt results.
- 2nd attempt: Title + First Author. Same process as 1st attempt but with any candidates
found in 1st attempt submitted for comparison.
Finally: Return None or CrossRefWork from best candidate that exceeds min_ld requirement.
- Parameters:
pma – PubMedArticle object
ideal_ld – (float) [default: set in global at top of crossref.py]
min_ld – (float) [default: set in global at top of crossref.py]
- Return type:
- article_by_title(title, **kwargs)[source]
Use CrossRef to find a work by its title. Returns first item in the list.
Keywords are passed unmodified to crossref.works() [habanero].
- Parameters:
title – str
- Return type:
CrossRefWork or None (if no results)
The CrossRefFetcher provides access to CrossRef’s API for DOI resolution and publication metadata when PubMed data is incomplete.
CrossRef API Documentation: CrossRef REST API | Works API Reference
Example Usage
from metapub import CrossRefFetcher, PubMedFetcher
# Initialize fetchers
fetch = PubMedFetcher()
cr = CrossRefFetcher()
# Get article that might be missing DOI in PubMed
article = fetch.article_by_pmid('12345678')
if not article.doi:
# Try CrossRef as fallback
work = cr.article_by_pma(article)
if work and work.score > 80: # High confidence match
print(f"Found DOI via CrossRef: {work.doi}")
print(f"Match score: {work.score}")
Advanced Configuration
Custom Cache Directory
import os
# Set custom cache directory
os.environ['METAPUB_CACHE_DIR'] = '/path/to/large/cache'
# Or specify per-fetcher
fetch = PubMedFetcher(cachedir='/custom/cache/path')
NCBI API Key Setup
📈 Why Use an API Key?
NCBI provides free API keys that increase your rate limits from 3 to 10 requests per second, essential for production applications and large-scale data collection.
🔑 Getting Your API Key
Apply for a key: NCBI API Key Registration
No approval needed - keys are issued immediately
Free for academic and commercial use
⚙️ Configuration Options
import os
# Method 1: Environment variable (recommended)
os.environ['NCBI_API_KEY'] = 'your_api_key_here'
# Method 2: Direct parameter
fetch = PubMedFetcher(api_key='your_api_key_here')
# Method 3: Config file
# Create ~/.metapub/config with:
# [DEFAULT]
# ncbi_api_key = your_api_key_here
🚀 Rate Limit Benefits
Without API key: 3 requests/second
With API key: 10 requests/second
Large datasets: 3x faster processing
Production reliability: Reduced throttling errors
Error Handling Patterns
from metapub.exceptions import MetaPubError, InvalidPMID, NCBIServiceError
try:
article = fetch.article_by_pmid('12345678')
except InvalidPMID:
print("Invalid PMID provided")
except NCBIServiceError as e:
print(f"NCBI service issue: {e.user_message}")
print(f"Suggested actions: {e.suggested_actions}")
except MetaPubError as e:
print(f"General MetaPub error: {e}")
Performance Considerations
Batch Processing
# Process large lists efficiently
pmids = ['12345678', '23456789', '34567890'] # ... many more
for i, pmid in enumerate(pmids):
if i % 100 == 0:
print(f"Progress: {i}/{len(pmids)}")
try:
article = fetch.article_by_pmid(pmid)
# Process article...
except Exception as e:
print(f"Error with {pmid}: {e}")
continue
Cache Warming
# Pre-warm cache for known PMIDs
def warm_cache(pmid_list):
for pmid in pmid_list:
try:
# Just accessing loads into cache
article = fetch.article_by_pmid(pmid)
except Exception:
continue