Advanced Usage
==============

This section covers advanced patterns and sophisticated features demonstrated in the demo scripts.

FindIt: Publisher-Specific PDF Access
-------------------------------------

FindIt provides sophisticated publisher-specific URL resolution for academic papers:

Basic FindIt Usage
~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import FindIt
   
   # Basic usage
   src = FindIt('25575644')  # PMID
   
   if src.url:
       print(f"PDF available: {src.url}")
       print(f"Journal: {src.pma.journal}")
   else:
       print(f"No access: {src.reason}")
       if src.backup_url:
           print(f"Backup URL: {src.backup_url}")

Publisher Registry
~~~~~~~~~~~~~~~~~

FindIt includes a comprehensive, pre-populated journal registry with 68+ publishers (97.1% coverage) that ships with the package. This provides out-of-the-box functionality without requiring setup or database initialization:

.. code-block:: python

   # Registry is automatically available - no setup needed
   from metapub.findit.registry import JournalRegistry
   
   registry = JournalRegistry()  # Uses shipped database
   stats = registry.get_stats()
   print(f"Publishers: {stats['publishers']}")
   print(f"Journals: {stats['journals']}")

Advanced FindIt Options
~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # With error retry
   src = FindIt(pmid='12345678', retry_errors=True)
   
   # NIH access mode
   src = FindIt(pmid='12345678', use_nih=True)
   
   # Debug mode for troubleshooting
   src = FindIt(pmid='12345678', debug=True)
   
   # Skip verification for speed
   src = FindIt(pmid='12345678', verify=False)

Embargo Detection
~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import FindIt
   
   src = FindIt('25575644')
   
   # Check embargo status
   embargo_date = src.pma.history.get('pmc-release', None)
   is_embargoed = False
   
   if src.reason.startswith("PAYWALL") and "embargo" in src.reason:
       is_embargoed = True
       print(f"Article is embargoed until: {embargo_date}")

Publisher Coverage Examples
~~~~~~~~~~~~~~~~~~~~~~~~~~

FindIt handles many publisher-specific patterns:

.. code-block:: python

   # Test PMIDs for different publishers
   test_pmids = {
       'Nature': ['16419642', '18830250', '12187393'],
       'BMC': ['25943194', '20170543', '25927199'], 
       'ScienceDirect': ['20000000', '25735572', '24565554'],
       'Wiley': ['14981756', '10474162', '10470409'],
       'JAMA': ['25742465', '23754022', '25739104']
   }
   
   for publisher, pmids in test_pmids.items():
       print(f"\n{publisher} results:")
       for pmid in pmids:
           src = FindIt(pmid)
           status = "✓" if src.url else "✗"
           print(f"  {status} {pmid}: {src.pma.journal}")

Clinical and Medical Genetics Queries
-------------------------------------

Specialized Search Types
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import PubMedFetcher
   
   fetch = PubMedFetcher()
   
   # Clinical queries with categories
   pmids = fetch.pmids_for_clinical_query(
       'Global developmental delay', 
       'etiology', 
       'broad'  # or 'narrow'
   )
   
   # Medical genetics queries
   pmids = fetch.pmids_for_medical_genetics_query(
       'Brugada Syndrome',
       'diagnosis'  # or 'genetic_counseling', 'prognosis'
   )

Advanced Citation Lookup
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Find article by detailed citation
   params = {
       'jtitle': 'Genetics in Medicine',
       'year': 2017,
       'volume': 19, 
       'first_page': 1105,
       'aulast': 'Nykamp'
   }
   
   pmids = fetch.pmids_for_citation(**params)
   
   # Alternative parameter names
   params2 = {
       'journal': 'Nature',
       'year': 2023,
       'volume': 615,
       'spage': 123,  # start page
       'authors': 'Smith; Jones; Brown'
   }

MedGen and ClinVar Integration
-----------------------------

Disease-Gene Mapping
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import MedGenFetcher
   
   mg = MedGenFetcher()
   
   # Disease to gene mapping
   term = "diabetes"
   uids = mg.uids_by_term(term)
   
   for uid in uids[:5]:  # First 5 results
       concept = mg.concept_by_uid(uid)
       print(f"CUI: {concept.cui}")
       print(f"Name: {concept.name}")
       print(f"Definition: {concept.definition}")
       
       # Get related PMIDs
       pmids = mg.pubmeds_for_cui(concept.cui)
       print(f"Related articles: {len(pmids)}")

Gene-Condition Mapping
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Gene to condition mapping
   gene = "CFTR"
   uids = mg.uids_by_term(f"{gene}[gene]")
   
   for uid in uids:
       concept = mg.concept_by_uid(uid)
       if concept.cui:
           print(f"Gene {gene} associated with: {concept.name}")

ClinVar Variant Analysis
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import ClinVarFetcher
   
   cv = ClinVarFetcher()
   
   # Get variant by its ClinVar ID
   variant = cv.variant('810732', id_from='clinvar')
   """
   This is the ID you see under "Variation ID" on the ClinVar browser: https://www.ncbi.nlm.nih.gov/clinvar/variation/810732/
   
   Specifying id_from='entrez' allows you to query by Entrez's ID.
   """
   
   print(f"Variation name: {variant.variation_name}")
   print(f"HGVS notation: {variant.hgvs_c}")
   print(f"Clinical significance: {variant.clinical_significance}")
   print(f"Molecular consequences: {variant.molecular_consequences}")

CrossRef Integration
-------------------

DOI Resolution with Fallbacks
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import PubMedFetcher, CrossRefFetcher
   
   fetch = PubMedFetcher()
   CR = CrossRefFetcher()
   
   def get_doi_with_fallback(pmid):
       # Try PubMed first
       pma = fetch.article_by_pmid(pmid)
       if pma.doi:
           return pma.doi
       
       # Fallback to CrossRef
       work = CR.article_by_pma(pma)
       if work and work.score > 80:  # High confidence match
           return work.doi
       
       return None

Batch Processing with CrossRef
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import csv
   from metapub.exceptions import InvalidPMID
   
   pmids = ['12345678', '23456789', '34567890']
   
   with open('pmid_doi_mapping.csv', 'w', newline='') as csvfile:
       writer = csv.writer(csvfile)
       writer.writerow(['PMID', 'DOI', 'Title', 'Status'])
       
       for pmid in pmids:
           try:
               pma = fetch.article_by_pmid(pmid)
               doi = get_doi_with_fallback(pmid)
               writer.writerow([pmid, doi or '', pma.title, 'SUCCESS'])
           except InvalidPMID:
               writer.writerow([pmid, '', '', 'INVALID_PMID'])
           except Exception as e:
               writer.writerow([pmid, '', '', f'ERROR: {e}'])

Error Handling Patterns
-----------------------

Robust Error Handling
~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub.exceptions import MetaPubError, InvalidPMID
   import logging
   
   # Configure logging for debugging
   logging.getLogger('metapub').setLevel(logging.DEBUG)
   logging.getLogger('requests').setLevel(logging.WARNING)
   
   def safe_article_fetch(pmid):
       try:
           article = fetch.article_by_pmid(pmid)
           return article
       except InvalidPMID:
           print(f"Invalid PMID: {pmid}")
           return None
       except MetaPubError as e:
           print(f"MetaPub error for {pmid}: {e}")
           return None
       except Exception as e:
           print(f"Unexpected error for {pmid}: {e}")
           return None

Network Error Recovery
~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import time
   from requests.exceptions import RequestException
   
   def fetch_with_retry(pmid, max_retries=3):
       for attempt in range(max_retries):
           try:
               return fetch.article_by_pmid(pmid)
           except RequestException as e:
               if attempt < max_retries - 1:
                   print(f"Network error, retrying in 5 seconds... ({attempt + 1}/{max_retries})")
                   time.sleep(5)
               else:
                   raise e

Performance Optimization
------------------------

Caching System Overview
~~~~~~~~~~~~~~~~~~~~~~

Metapub includes a sophisticated caching system designed to minimize API requests and improve performance. The system has evolved to use SQLite-based persistent storage with thread-safe operations.

**Key Features:**

- **Persistent Storage**: SQLite database for responses that survive process restarts
- **Thread Safety**: All cache operations are thread-safe using locks
- **NCBI Compliance**: Automatic rate limiting respects NCBI guidelines (3 req/sec without API key, 10 req/sec with)
- **Response Validation**: Only valid XML responses are cached; HTML error pages are rejected
- **Legacy Compatibility**: Works with existing cache files from previous versions

Cache Configuration
~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import os
   from metapub import PubMedFetcher
   from metapub.ncbi_client import NCBIClient
   
   # Method 1: Environment variables (traditional)
   os.environ['METAPUB_CACHE_DIR'] = '/path/to/large/cache'
   os.environ['NCBI_API_KEY'] = 'your_api_key_here'
   
   fetch = PubMedFetcher()
   
   # Method 2: Direct NCBIClient usage (new system)
   client = NCBIClient(
       api_key='your_api_key_here',
       cache_path='/path/to/cache/ncbi_cache.db',
       requests_per_second=10,  # Will be capped to NCBI limits
       tool='my_research_tool',
       email='researcher@university.edu'
   )

Understanding Cache Behavior
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub.ncbi_client import SimpleCache
   
   # Direct cache manipulation
   cache = SimpleCache('/path/to/cache.db')
   
   # Cache uses URL + parameters as keys
   url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
   params = {'db': 'pubmed', 'id': '12345678', 'retmode': 'xml'}
   
   # Check if response is cached
   cached_response = cache.get(url, params)
   if cached_response:
       print("Response found in cache")
   else:
       print("Fresh API request needed")
   
   # Manual cache storage (normally done automatically)
   cache.set(url, params, xml_response_string)

Rate Limiting and Performance
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub.ncbi_client import RateLimiter
   import time
   
   # Understanding rate limits
   rate_limiter = RateLimiter(requests_per_second=3)  # Without API key
   
   start_time = time.time()
   for i in range(5):
       rate_limiter.wait_if_needed()
       print(f"Request {i+1} at {time.time() - start_time:.2f}s")
       # Your API request here
   
   # Output shows requests spaced by ~0.33 seconds (3 per second)

Cache Database Schema
~~~~~~~~~~~~~~~~~~~~

The cache uses a simple SQLite schema compatible with existing cache files:

.. code-block:: sql
   
   CREATE TABLE cache (
       key BLOB PRIMARY KEY,      -- URL + sorted parameters
       value BLOB,                -- Cached response data
       created INTEGER,           -- Unix timestamp
       value_compressed BOOL DEFAULT 0  -- Legacy compression flag
   );

Advanced Cache Management
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import sqlite3
   import os
   from metapub.cache_utils import get_cache_path, cleanup_dir
   
   # Inspect cache contents
   cache_path = get_cache_path()
   if cache_path and os.path.exists(cache_path):
       with sqlite3.connect(cache_path) as conn:
           # Count cached entries
           count = conn.execute("SELECT COUNT(*) FROM cache").fetchone()[0]
           print(f"Cache contains {count} entries")
           
           # Find oldest entries
           oldest = conn.execute(
               "SELECT created FROM cache ORDER BY created LIMIT 1"
           ).fetchone()
           if oldest:
               import datetime
               oldest_date = datetime.datetime.fromtimestamp(oldest[0])
               print(f"Oldest entry: {oldest_date}")
   
   # Clear entire cache directory
   if cache_path:
       cache_dir = os.path.dirname(cache_path)
       cleanup_dir(cache_dir)
       print("Cache cleared")

Traditional vs Modern Caching System
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Traditional System:**
- Dictionary-style access with pickle serialization
- Backward compatible with existing cache files
- Used by PubMedFetcher and other high-level classes

**Modern System (NCBIClient):**
- URL-based caching with parameter normalization
- JSON serialization for complex objects
- Better thread safety and error handling
- Validation prevents caching of error responses

.. code-block:: python

   # Traditional style (still supported)
   from metapub import PubMedFetcher
   fetch = PubMedFetcher()  # Uses traditional caching
   
   # Modern style (recommended for new code)
   from metapub.ncbi_client import NCBIClient
   client = NCBIClient(cache_path='/path/to/cache.db')
   response = client.efetch(db='pubmed', id='12345678')

Caching Strategies

Batch Processing Optimization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Process PMIDs in batches
   def process_pmids_batch(pmids, batch_size=100):
       results = []
       
       for i in range(0, len(pmids), batch_size):
           batch = pmids[i:i + batch_size]
           print(f"Processing batch {i//batch_size + 1}...")
           
           for pmid in batch:
               try:
                   article = fetch.article_by_pmid(pmid)
                   results.append((pmid, article))
               except Exception as e:
                   print(f"Error with {pmid}: {e}")
           
           # Rate limiting between batches
           time.sleep(1)
       
       return results

Preloading and Cache Warming
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Preload FindIt cache for a list of PMIDs
   def preload_findit_cache(pmid_file):
       with open(pmid_file, 'r') as f:
           pmids = [line.strip() for line in f if line.strip()]
       
       print(f"Preloading FindIt cache for {len(pmids)} PMIDs...")
       
       for i, pmid in enumerate(pmids):
           if i % 100 == 0:
               print(f"Progress: {i}/{len(pmids)}")
           
           try:
               src = FindIt(pmid)
               # Just accessing it loads into cache
           except Exception as e:
               print(f"Error preloading {pmid}: {e}")

URL Reverse Engineering
----------------------

Extract Identifiers from URLs
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub.urlreverse import UrlReverse
   
   # Extract DOI and PMID from URLs
   urls = [
       'https://doi.org/10.1038/nature12373',
       'https://pubmed.ncbi.nlm.nih.gov/12345678/',
       'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3458974/'
   ]
   
   for url in urls:
       urlrev = UrlReverse(url)
       print(f"URL: {url}")
       print(f"DOI: {urlrev.doi}")
       print(f"PMID: {urlrev.pmid}")
       print(f"PMC: {urlrev.pmcid}")
       print("Steps taken:")
       for step in urlrev.steps:
           print(f"  * {step}")
       print()

Troubleshooting and Debugging
----------------------------

Common Issues and Solutions
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Enable detailed logging
   import logging
   logging.basicConfig(level=logging.DEBUG)
   
   # Check NCBI service health
   from metapub.ncbi_health_check import main as health_check
   health_check()  # Run health check
   
   # Validate PMIDs before processing
   import re
   pmid_pattern = re.compile(r'^\d+$')
   
   def is_valid_pmid(pmid):
       return pmid_pattern.match(str(pmid)) is not None
   
   # Clear cache if having issues
   import shutil
   from metapub.cache_utils import get_cache_path
   
   cache_dir = get_cache_path()
   if os.path.exists(cache_dir):
       shutil.rmtree(cache_dir)
       print(f"Cleared cache directory: {cache_dir}")