API Overview
Metapub provides a comprehensive Python API for accessing biomedical literature and databases. The library is organized into several core modules, each serving specific functionality.
Core Modules
Data Retrieval Classes
These are the primary classes for fetching data from various NCBI databases:
- PubMedFetcher
Primary interface for PubMed/NCBI literature searches. Supports article retrieval by PMID, DOI, PMC ID, and complex query searches.
- MedGenFetcher
Access to NCBI’s MedGen database for medical genetics concepts, disease-gene relationships, and clinical phenotypes.
- ClinVarFetcher
Interface to ClinVar database for clinical significance of genetic variants and variant-literature associations.
- CrossRefFetcher
CrossRef API integration for DOI resolution and publication metadata when PubMed data is incomplete.
Data Model Classes
These classes represent structured data returned by the fetcher classes:
- PubMedArticle
Rich representation of a scientific article with automatic parsing of titles, authors, abstracts, MeSH terms, and bibliographic details.
- MedGenConcept
Medical genetics concept with CUI identifiers, definitions, synonyms, and related literature.
- ClinVarVariant
Clinical variant with HGVS notation, clinical significance, molecular consequences, and supporting evidence.
Full-Text Discovery
- FindIt
Sophisticated system for locating full-text PDFs using publisher-specific strategies. Supports 68+ major publishers (97.1% coverage) with embargo detection, CrossRef API integration, and legal access verification. Includes pre-populated journal registry for out-of-the-box functionality.
Utility Functions
Text Mining and Validation
Conversion and Citation
Error Handling
Common Usage Patterns
Basic Article Retrieval
from metapub import PubMedFetcher
# Initialize fetcher (singleton pattern)
fetch = PubMedFetcher()
# Get article by PMID
article = fetch.article_by_pmid('12345678')
print(f"{article.title} - {article.journal} ({article.year})")
Literature Search
# Search for articles
pmids = fetch.pmids_for_query('machine learning genomics', retmax=50)
# Process results
for pmid in pmids:
article = fetch.article_by_pmid(pmid)
print(f"PMID {pmid}: {article.title}")
Full-Text Discovery
from metapub import FindIt
# Find PDF for an article
src = FindIt('12345678') # PMID
if src.url:
print(f"PDF available: {src.url}")
else:
print(f"No access: {src.reason}")
Medical Genetics Research
from metapub import MedGenFetcher, ClinVarFetcher
# Research genetic condition
mg = MedGenFetcher()
concepts = mg.concepts_for_term('cystic fibrosis')
# Find clinical variants
cv = ClinVarFetcher()
variants = cv.variants_for_gene('CFTR')
Architecture Notes
Singleton Pattern
Most fetcher classes use the Borg singleton pattern, meaning all instances share the same state and cache. This ensures efficient resource usage and consistent caching across your application.
Caching Strategy
SQLite-based caching for all API responses
Configurable cache directories via environment variables
TTL-based cache expiration to ensure data freshness
Cache warming capabilities for batch processing
Error Handling
Intelligent error diagnosis distinguishes between service outages and code issues
Automatic retry logic for transient network failures
Comprehensive exception hierarchy for specific error handling
Graceful degradation when services are unavailable
API Keys and Rate Limiting
NCBI API key support via environment variables for higher rate limits
Built-in rate limiting respects NCBI guidelines
Request batching for efficient bulk operations
See Also
Quick Start Guide - Getting started with basic usage
Advanced Usage - Advanced patterns and publisher-specific features
Tutorials - Complete workflows for research tasks
Examples - Practical code examples and patterns