Data Fetcher Classes
===================
The core of Metapub consists of several fetcher classes that provide access to different biomedical databases. All fetchers use the Borg singleton pattern and include comprehensive caching.
**🔄 Borg Singleton Pattern**
Metapub fetchers use the Borg pattern, which means all instances of the same fetcher class share the same state (cache, configuration, etc.). This provides several benefits:
- **Shared cache:** Multiple ``PubMedFetcher()`` instances automatically share cached data
- **Consistent configuration:** API keys and settings apply across all instances
- **Memory efficiency:** No duplicate caches or redundant API calls
- **Consistency:** Safe to use across different parts of your application
.. code-block:: python
# These two fetchers share the same cache and configuration
fetch1 = PubMedFetcher()
fetch2 = PubMedFetcher()
# Article cached by fetch1 is immediately available to fetch2
article = fetch1.article_by_pmid('12345678')
same_article = fetch2.article_by_pmid('12345678') # Uses cache, no API call
PubMedFetcher
------------
.. currentmodule:: metapub
.. autoclass:: PubMedFetcher
:members:
:show-inheritance:
The PubMedFetcher is the primary interface for accessing PubMed literature via NCBI's E-utilities API. It provides methods for:
* **Article retrieval** by PMID, DOI, or PMC ID
* **Literature searches** with complex query support
* **Citation-based lookups** for bibliographic matching
* **Related article discovery** using NCBI's eLink service
**NCBI E-utilities Documentation:** `PubMed E-utilities `_ | `PubMed Search Field Descriptions `_
Key Methods
~~~~~~~~~~
.. automethod:: PubMedFetcher.__init__
.. automethod:: PubMedFetcher.article_by_pmid
.. automethod:: PubMedFetcher.article_by_doi
.. automethod:: PubMedFetcher.article_by_pmcid
.. automethod:: PubMedFetcher.pmids_for_query
.. automethod:: PubMedFetcher.pmids_for_citation
.. automethod:: PubMedFetcher.related_pmids
Example Usage
~~~~~~~~~~~~
.. code-block:: python
from metapub import PubMedFetcher
# Initialize fetcher
fetch = PubMedFetcher()
# Get specific article
article = fetch.article_by_pmid('33157158')
print(f"Title: {article.title}")
print(f"Journal: {article.journal}")
print(f"DOI: {article.doi}")
# Search for articles
pmids = fetch.pmids_for_query(
query='CRISPR gene editing',
since='2020/01/01',
retmax=100
)
# Citation-based lookup
citation_pmids = fetch.pmids_for_citation(
journal='Nature',
year=2023,
volume=615,
first_page=123,
aulast='Smith'
)
MedGenFetcher
------------
.. autoclass:: MedGenFetcher
:members:
:show-inheritance:
The MedGenFetcher provides access to NCBI's MedGen database for medical genetics concepts and disease-gene relationships.
**NCBI MedGen Documentation:** `MedGen Database `_ | `MedGen Help `_
Key Methods
~~~~~~~~~~
.. automethod:: MedGenFetcher.__init__
.. automethod:: MedGenFetcher.uids_by_term
.. automethod:: MedGenFetcher.concept_by_uid
.. automethod:: MedGenFetcher.concept_by_cui
.. automethod:: MedGenFetcher.uid_for_cui
.. automethod:: MedGenFetcher.pubmeds_for_cui
Example Usage
~~~~~~~~~~~~
.. code-block:: python
from metapub import MedGenFetcher
# Initialize fetcher
mg = MedGenFetcher()
# Search for genetic condition
uids = mg.uids_by_term('Brugada syndrome')
# Get detailed concept information
for uid in uids[:3]: # First 3 results
concept = mg.concept_by_uid(uid)
print(f"Name: {concept.name}")
print(f"CUI: {concept.cui}")
print(f"Definition: {concept.definition}")
# Get related literature
pmids = mg.pubmeds_for_cui(concept.cui)
print(f"Related papers: {len(pmids)}")
ClinVarFetcher
-------------
.. autoclass:: ClinVarFetcher
:members:
:show-inheritance:
The ClinVarFetcher provides access to NCBI's ClinVar database for clinical significance of genetic variants.
**NCBI ClinVar Documentation:** `ClinVar Database `_ | `ClinVar API Guide `_
**Note:** unlike the ClinVar clinical significance classes, `clinical_significance` values are in all lowercase--this was a conscious decision documented futher `here `_
Key Methods
~~~~~~~~~~
.. automethod:: ClinVarFetcher.__init__
.. automethod:: ClinVarFetcher.ids_by_gene
.. automethod:: ClinVarFetcher.variant
.. automethod:: ClinVarFetcher.pmids_for_id
.. automethod:: ClinVarFetcher.pmids_for_hgvs
Example Usage
~~~~~~~~~~~~
.. code-block:: python
from metapub import ClinVarFetcher
# Initialize fetcher
cv = ClinVarFetcher()
# Find variants for a gene
variant_ids = cv.ids_by_gene('BRCA1', single_gene=True)
# Get detailed variant information
for var_id in variant_ids[:5]: # First 5 variants
variant = cv.variant(var_id)
print(f"Accession: {variant.accession}")
print(f"HGVS: {variant.hgvs_c}")
print(f"Clinical significance: {variant.clinical_significance}")
print(f"Molecular consequences: {variant.molecular_consequences}")
# Get supporting literature
pmids = cv.pmids_for_id(var_id)
print(f"Supporting papers: {len(pmids)}")
CrossRefFetcher
--------------
.. autoclass:: CrossRefFetcher
:members:
:show-inheritance:
The CrossRefFetcher provides access to CrossRef's API for DOI resolution and publication metadata when PubMed data is incomplete.
**CrossRef API Documentation:** `CrossRef REST API `_ | `Works API Reference `_
Example Usage
~~~~~~~~~~~~
.. code-block:: python
from metapub import CrossRefFetcher, PubMedFetcher
# Initialize fetchers
fetch = PubMedFetcher()
cr = CrossRefFetcher()
# Get article that might be missing DOI in PubMed
article = fetch.article_by_pmid('12345678')
if not article.doi:
# Try CrossRef as fallback
work = cr.article_by_pma(article)
if work and work.score > 80: # High confidence match
print(f"Found DOI via CrossRef: {work.doi}")
print(f"Match score: {work.score}")
Advanced Configuration
---------------------
Custom Cache Directory
~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
import os
# Set custom cache directory
os.environ['METAPUB_CACHE_DIR'] = '/path/to/large/cache'
# Or specify per-fetcher
fetch = PubMedFetcher(cachedir='/custom/cache/path')
NCBI API Key Setup
~~~~~~~~~~~~~~~~~
**📈 Why Use an API Key?**
NCBI provides free API keys that increase your rate limits from 3 to 10 requests per second, essential for production applications and large-scale data collection.
**🔑 Getting Your API Key**
1. **Apply for a key:** `NCBI API Key Registration `_
2. **No approval needed** - keys are issued immediately
3. **Free for academic and commercial use**
**⚙️ Configuration Options**
.. code-block:: python
import os
# Method 1: Environment variable (recommended)
os.environ['NCBI_API_KEY'] = 'your_api_key_here'
# Method 2: Direct parameter
fetch = PubMedFetcher(api_key='your_api_key_here')
# Method 3: Config file
# Create ~/.metapub/config with:
# [DEFAULT]
# ncbi_api_key = your_api_key_here
**🚀 Rate Limit Benefits**
- **Without API key:** 3 requests/second
- **With API key:** 10 requests/second
- **Large datasets:** 3x faster processing
- **Production reliability:** Reduced throttling errors
Error Handling Patterns
~~~~~~~~~~~~~~~~~~~~~~
.. code-block:: python
from metapub.exceptions import MetaPubError, InvalidPMID, NCBIServiceError
try:
article = fetch.article_by_pmid('12345678')
except InvalidPMID:
print("Invalid PMID provided")
except NCBIServiceError as e:
print(f"NCBI service issue: {e.user_message}")
print(f"Suggested actions: {e.suggested_actions}")
except MetaPubError as e:
print(f"General MetaPub error: {e}")
Performance Considerations
-------------------------
Batch Processing
~~~~~~~~~~~~~~~
.. code-block:: python
# Process large lists efficiently
pmids = ['12345678', '23456789', '34567890'] # ... many more
for i, pmid in enumerate(pmids):
if i % 100 == 0:
print(f"Progress: {i}/{len(pmids)}")
try:
article = fetch.article_by_pmid(pmid)
# Process article...
except Exception as e:
print(f"Error with {pmid}: {e}")
continue
Cache Warming
~~~~~~~~~~~~
.. code-block:: python
# Pre-warm cache for known PMIDs
def warm_cache(pmid_list):
for pmid in pmid_list:
try:
# Just accessing loads into cache
article = fetch.article_by_pmid(pmid)
except Exception:
continue