FindIt: Full-Text PDF Discovery
=================================

.. currentmodule:: metapub.findit

The FindIt module provides sophisticated capabilities for locating full-text PDFs of academic papers using publisher-specific strategies and legal access verification.

FindIt Class
-----------

.. autoclass:: findit.FindIt
   :members:
   :show-inheritance:

The FindIt class is the primary interface for PDF discovery. It employs publisher-specific "dances" (custom algorithms) to locate downloadable PDFs while respecting publisher policies and embargo restrictions.

**Network Timeout Configuration** (v0.11+)
   The FindIt class now includes configurable timeout parameters to prevent infinite stalling during PDF discovery:
   
   - **request_timeout**: Maximum time (seconds) to wait for HTTP responses (default: 10)
   - **max_redirects**: Maximum number of HTTP redirects to follow (default: 3)
   
   These parameters are passed to all network requests throughout the FindIt system to ensure reliable operation.

Key Features
~~~~~~~~~~~

**Publisher Coverage**
   - 68+ major academic publishers supported (97.1% coverage)
   - Publisher-specific URL patterns and access methods
   - Dynamic strategy selection based on journal/publisher
   
**Access Verification**
   - HTTP verification of PDF availability
   - Embargo detection and date checking
   - Legal access validation
   - CrossRef API integration for blocked access workarounds

**Caching System**
   - SQLite-based result caching
   - Configurable cache directories
   - Smart cache invalidation for error cases

**Intelligent Error Reporting**
   - Structured error categories with actionable information
   - Always includes attempted URLs for debugging
   - Distinguishes between paywall, technical, and data issues
   - Developer-friendly reason codes for automated handling

Core Methods
~~~~~~~~~~~

.. automethod:: findit.FindIt.__init__

.. automethod:: findit.FindIt.load

.. automethod:: findit.FindIt.load_from_cache

Result Attributes
~~~~~~~~~~~~~~~

After initialization, FindIt objects provide these key attributes:

**url** (str or None)
   Direct link to downloadable PDF if found

**reason** (str or None)  
   Detailed explanation when PDF is not available, always includes attempted URL:
   
   - ``"MISSING: ..."`` - Required data not available (DOI, volume/issue, etc.)
   - ``"PAYWALL: ..."`` - Requires subscription/payment  
   - ``"DENIED: ..."`` - Access forbidden or login required
   - ``"TXERROR: ..."`` - Technical/network/server error
   - ``"NOFORMAT: ..."`` - Publisher doesn't provide expected format
   - ``"NOTFOUND: ..."`` - Content not found at expected location
   
   All reason messages include ``- attempted: {URL}`` for debugging.

**backup_url** (str or None)
   Alternative URL when primary fails

**pma** (PubMedArticle)
   Associated article metadata

**doi_score** (int)
   Confidence score for DOI match (0-100)

FindIt Error Handling Philosophy
--------------------------------

FindIt employs a sophisticated error reporting system that provides meaningful, actionable information to developers about why a PDF link could not be obtained. This system distinguishes between different types of failures and always includes the attempted URL for debugging purposes.

Error Categories and Usage
~~~~~~~~~~~~~~~~~~~~~~~~~~

FindIt uses three main error classification approaches:

**NoPDFLink Exception - Expected Operational Failures**

Used when the system cannot produce a PDF link due to expected operational conditions:

.. code-block:: python

   from metapub import FindIt
   from metapub.exceptions import NoPDFLink
   
   try:
       src = FindIt('12345678')  # Article without DOI
   except NoPDFLink as e:
       print(str(e))
       # "MISSING: DOI required for SAGE journals - attempted: none"

**AccessDenied Exception - Publisher Restrictions**  

Used when publishers explicitly deny access due to paywall or subscription requirements:

.. code-block:: python

   from metapub.exceptions import AccessDenied
   
   try:
       src = FindIt('16419642')  # Nature paywall article
   except AccessDenied as e:
       print(str(e))
       # "PAYWALL: Nature requires subscription - attempted: https://nature.com/articles/..."

**TXERROR Prefix - Technical Failures**

Used within NoPDFLink messages when technical issues prevent accessing content:

.. code-block:: python

   try:
       src = FindIt('12345678')  # Server timeout
   except NoPDFLink as e:
       print(str(e))
       # "TXERROR: Connection timeout after 30s - attempted: https://publisher.com/..."

Error Message Format
~~~~~~~~~~~~~~~~~~~

All error messages follow a consistent structure:

.. code-block:: text

   {ERROR_TYPE}: {Description} - attempted: {URL}

**Error Type Prefixes:**

- ``MISSING:`` - Required data not available (DOI, volume/issue, etc.)
- ``NOFORMAT:`` - Publisher doesn't provide expected format
- ``PAYWALL:`` - Subscription or payment required  
- ``DENIED:`` - Access forbidden or login required
- ``TXERROR:`` - Technical, network, or server error
- ``NOTFOUND:`` - Content not found at expected location

**Always Includes Attempted URL:**

Every error message includes the URL(s) that were attempted, allowing developers to:

- Debug access issues manually
- Understand what URLs the system tried  
- Implement alternative access methods
- Report publisher-specific problems

Developer Usage Patterns
~~~~~~~~~~~~~~~~~~~~~~~~

The structured error information enables sophisticated error handling:

**Basic Error Categorization**

.. code-block:: python

   from metapub import FindIt
   from metapub.exceptions import NoPDFLink, AccessDenied
   
   def handle_findit_result(pmid):
       try:
           src = FindIt(pmid)
           if src.url:
               return {'status': 'success', 'url': src.url}
           else:
               return {'status': 'no_pdf', 'reason': src.reason}
               
       except AccessDenied as e:
           # Publisher paywall/subscription required
           return {
               'status': 'paywall', 
               'reason': str(e),
               'action': 'purchase_required'
           }
           
       except NoPDFLink as e:
           error_msg = str(e)
           if 'TXERROR' in error_msg:
               # Technical issue - retry later
               return {
                   'status': 'technical_error',
                   'reason': error_msg,
                   'action': 'retry_later'
               }
           elif 'MISSING' in error_msg:
               # Data issue - try alternative approach
               return {
                   'status': 'data_missing',
                   'reason': error_msg,
                   'action': 'try_alternative'
               }
           else:
               return {'status': 'unknown_error', 'reason': error_msg}

**Automated Response to Error Types**

.. code-block:: python

   import time
   from collections import defaultdict
   
   def batch_findit_with_smart_retry(pmids, max_retries=3):
       """Process PMIDs with intelligent error handling and retry logic."""
       results = []
       retry_queue = defaultdict(list)  # Group by error type for batch retry
       
       for pmid in pmids:
           try:
               src = FindIt(pmid)
               if src.url:
                   results.append({'pmid': pmid, 'url': src.url, 'status': 'success'})
               elif src.reason:
                   # Store reason for analysis
                   results.append({'pmid': pmid, 'reason': src.reason, 'status': 'failed'})
               
           except AccessDenied as e:
               # Paywall - annotate for purchase consideration
               results.append({
                   'pmid': pmid, 
                   'status': 'paywall',
                   'reason': str(e),
                   'needs_purchase': True
               })
               
           except NoPDFLink as e:
               error_msg = str(e)
               if 'TXERROR' in error_msg:
                   # Technical errors - queue for retry
                   retry_queue['technical'].append(pmid)
                   results.append({
                       'pmid': pmid,
                       'status': 'technical_error', 
                       'reason': error_msg,
                       'will_retry': True
                   })
               else:
                   # Data/format errors - unlikely to succeed on retry
                   results.append({
                       'pmid': pmid,
                       'status': 'permanent_failure',
                       'reason': error_msg
                   })
       
       # Retry technical errors after delay
       if retry_queue['technical'] and max_retries > 0:
           print(f"Retrying {len(retry_queue['technical'])} technical failures...")
           time.sleep(5)  # Wait for transient issues to resolve
           
           retry_results = batch_findit_with_smart_retry(
               retry_queue['technical'], 
               max_retries - 1
           )
           
           # Update original results with retry outcomes
           for retry_result in retry_results:
               # Find and update the original failed result
               for i, result in enumerate(results):
                   if result['pmid'] == retry_result['pmid']:
                       results[i] = retry_result
                       break
       
       return results

**Error Analysis and Reporting**

.. code-block:: python

   def analyze_findit_errors(results):
       """Analyze FindIt results to identify patterns and actionable insights."""
       error_stats = defaultdict(int)
       paywall_publishers = defaultdict(int)
       technical_issues = []
       
       for result in results:
           if result['status'] == 'paywall':
               # Extract publisher from error message for purchase planning
               reason = result['reason']
               if 'Nature' in reason:
                   paywall_publishers['Nature Publishing'] += 1
               elif 'Elsevier' in reason:
                   paywall_publishers['Elsevier/ScienceDirect'] += 1
               # Add more publisher patterns as needed
               
           elif result['status'] == 'technical_error':
               technical_issues.append(result['reason'])
               
           error_stats[result['status']] += 1
       
       print("=== FindIt Error Analysis ===")  
       print(f"Success rate: {error_stats['success']}/{len(results)} ({error_stats['success']/len(results)*100:.1f}%)")
       print(f"Paywall articles: {error_stats['paywall']}")
       print(f"Technical errors: {error_stats['technical_error']}")
       
       if paywall_publishers:
           print("\n=== Publishers Requiring Subscription ===")
           for publisher, count in paywall_publishers.items():
               print(f"  {publisher}: {count} articles")
               
       if technical_issues:
           print(f"\n=== Technical Issues (consider retrying) ===")
           for issue in set(technical_issues):  # Unique issues only
               count = technical_issues.count(issue)
               print(f"  {issue} (×{count})")
       
       return {
           'error_stats': dict(error_stats),
           'paywall_publishers': dict(paywall_publishers),
           'technical_issues': technical_issues
       }

Error Message Examples
~~~~~~~~~~~~~~~~~~~~~

**Missing Data Errors:**

.. code-block:: text

   MISSING: DOI required for SAGE journals - attempted: none
   MISSING: pii needed for ScienceDirect lookup - attempted: https://sciencedirect.com/...
   MISSING: volume/issue/pii data - cannot construct Nature URL - attempted: none

**Access Denied Errors:**

.. code-block:: text

   PAYWALL: Nature requires subscription - attempted: https://nature.com/articles/s41586-020-2936-y.pdf
   DENIED: JAMA requires login - attempted: https://jamanetwork.com/journals/jama/fullarticle/...
   PAYWALL: Elsevier paywall detected - attempted: https://sciencedirect.com/science/article/pii/...

**Technical Errors:**

.. code-block:: text

   TXERROR: Server returned 503 Service Unavailable - attempted: https://publisher.com/...
   TXERROR: Connection timeout after 10s - attempted: https://journals.sagepub.com/...
   TXERROR: dx.doi.org lookup failed (Network error) - attempted: http://dx.doi.org/10.1038/...
   TXERROR: Too many redirects (>3) - attempted: https://publisher.com/...

**Publisher Format Issues:**

.. code-block:: text

   NOFORMAT: BMC article has no PDF version - attempted: https://bmcgenomics.biomedcentral.com/...
   NOTFOUND: Article not found on Nature platform - attempted: https://nature.com/..., traditional URL

Benefits for Developers
~~~~~~~~~~~~~~~~~~~~~~

This comprehensive error handling system provides:

1. **Clear Action Path** - Developers know exactly what went wrong and why
2. **Debugging Information** - Attempted URLs allow manual verification  
3. **Automated Categorization** - Error types enable programmatic responses
4. **Publisher Intelligence** - Identify which publishers require subscriptions
5. **Technical Issue Detection** - Distinguish between transient and permanent failures
6. **Batch Processing Optimization** - Group similar errors for efficient handling

The goal is to make FindIt failures informative and actionable rather than opaque, enabling developers to build robust applications that handle PDF access gracefully.

Usage Patterns
-------------

Basic PDF Discovery
~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import FindIt
   
   # Find PDF by PMID
   src = FindIt('33157158')
   
   if src.url:
       print(f"✓ PDF available: {src.url}")
       print(f"Journal: {src.pma.journal}")
   else:
       print(f"✗ No access: {src.reason}")
       if src.backup_url:
           print(f"Backup URL: {src.backup_url}")

Advanced Configuration
~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Advanced options
   src = FindIt(
       pmid='12345678',
       use_nih=True,           # Use NIH access when available
       verify=False,           # Skip URL verification for speed
       retry_errors=True,      # Retry cached error results
       debug=True,             # Enable debug logging
       cachedir='/custom/cache',  # Custom cache location
       request_timeout=15,     # Custom request timeout (seconds)
       max_redirects=5         # Custom redirect limit
   )

Network Timeout Configuration
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Configure network behavior
   src = FindIt(
       pmid='12345678',
       request_timeout=20,     # Wait up to 20 seconds for responses
       max_redirects=2         # Follow max 2 redirects
   )
   
   # For faster processing with tighter limits
   src = FindIt(
       pmid='12345678', 
       request_timeout=5,      # Quick timeout for batch processing
       max_redirects=1         # Minimal redirects
   )
   
   # Default values (recommended for most use cases)
   src = FindIt('12345678')   # Uses timeout=10s, redirects=3

DOI-Based Discovery
~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Find PDF by DOI instead of PMID
   src = FindIt(doi='10.1038/nature12373')
   
   if src.url:
       print(f"Found via DOI: {src.url}")
   else:
       print(f"DOI lookup failed: {src.reason}")

Publisher-Specific Examples
--------------------------

Nature Publishing Group
~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Nature articles - often available through institutional access
   nature_pmids = ['16419642', '18830250', '12187393']
   
   for pmid in nature_pmids:
       src = FindIt(pmid)
       print(f"PMID {pmid}: {src.pma.journal}")
       if src.url:
           print(f"  ✓ Available: {src.url}")
       else:
           print(f"  ✗ {src.reason}")

BMC and Open Access
~~~~~~~~~~~~~~~~~

.. code-block:: python

   # BMC journals - typically open access
   bmc_pmids = ['25943194', '20170543', '25927199']
   
   for pmid in bmc_pmids:
       src = FindIt(pmid)
       print(f"PMID {pmid}: {src.pma.journal}")
       if src.url:
           print(f"  ✓ Open access: {src.url}")
       else:
           print(f"  ✗ Unexpected: {src.reason}")

Embargo Detection
~~~~~~~~~~~~~~~~

.. code-block:: python

   # Check for embargoed content
   src = FindIt('25575644')  # Example embargoed article
   
   embargo_date = src.pma.history.get('pmc-release', None)
   
   if src.reason and 'embargo' in src.reason.lower():
       print(f"Article is embargoed")
       if embargo_date:
           print(f"Available after: {embargo_date}")
   elif src.url:
       print(f"Immediate access: {src.url}")

Batch Processing
--------------

Processing Multiple Articles
~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   import csv
   import time
   from metapub import FindIt
   
   def batch_findit_analysis(pmids, output_file='findit_results.csv'):
       """Analyze PDF availability for a list of PMIDs."""
       results = []
       
       with open(output_file, 'w', newline='') as csvfile:
           fieldnames = ['pmid', 'journal', 'title', 'url_available', 
                        'url', 'reason', 'embargo_status']
           writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
           writer.writeheader()
           
           for i, pmid in enumerate(pmids):
               print(f"Processing {pmid} ({i+1}/{len(pmids)})")
               
               try:
                   src = FindIt(pmid, retry_errors=True)
                   
                   # Check embargo status
                   embargo_date = src.pma.history.get('pmc-release', None)
                   is_embargoed = (
                       src.reason and 
                       src.reason.startswith("PAYWALL") and 
                       "embargo" in src.reason
                   )
                   
                   result = {
                       'pmid': pmid,
                       'journal': src.pma.journal,
                       'title': src.pma.title,
                       'url_available': bool(src.url),
                       'url': src.url or '',
                       'reason': src.reason or '',
                       'embargo_status': 'embargoed' if is_embargoed else 'not_embargoed'
                   }
                   
                   writer.writerow(result)
                   results.append(result)
                   
               except Exception as e:
                   print(f"Error processing {pmid}: {e}")
               
               # Rate limiting
               time.sleep(0.5)
       
       return results

   # Usage
   pmids = ['12345678', '23456789', '34567890']
   results = batch_findit_analysis(pmids)

Result Analysis
~~~~~~~~~~~~~

.. code-block:: python

   import pandas as pd
   
   def analyze_findit_results(results):
       """Analyze FindIt batch processing results."""
       df = pd.DataFrame(results)
       
       print("=== PDF Access Analysis ===")
       print(f"Total articles: {len(df)}")
       print(f"PDFs available: {df['url_available'].sum()} ({df['url_available'].mean()*100:.1f}%)")
       print(f"Embargoed: {(df['embargo_status'] == 'embargoed').sum()}")
       
       print("\n=== Access by Journal ===")
       journal_stats = df.groupby('journal').agg({
           'url_available': ['count', 'sum', 'mean']
       }).round(3)
       journal_stats.columns = ['total', 'available', 'access_rate']
       print(journal_stats.sort_values('access_rate', ascending=False))
       
       print("\n=== Failure Reasons ===")
       failed = df[~df['url_available']]
       if len(failed) > 0:
           reason_counts = failed['reason'].value_counts()
           print(reason_counts)
   
   # Analyze results
   analyze_findit_results(results)

Advanced Features
----------------

Network Timeout and Reliability Improvements
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

**Version 0.11+ Timeout System**

The FindIt system now includes comprehensive network timeout controls to prevent infinite stalling during PDF discovery. This addresses cases where publisher servers might become unresponsive or network connections hang indefinitely.

**Key Improvements:**

- **Request Timeouts**: All HTTP requests have configurable timeouts (default: 10 seconds)
- **Redirect Limits**: Maximum redirects are enforced to prevent infinite redirect loops (default: 3)
- **Consistent Application**: Timeout controls apply to all publisher-specific dance functions
- **Backward Compatibility**: All timeout parameters are optional with sensible defaults

**Configuration Examples:**

.. code-block:: python

   # Default behavior (recommended)
   src = FindIt('12345678')  # 10s timeout, 3 redirects max
   
   # Conservative settings for unreliable networks
   src = FindIt('12345678', request_timeout=20, max_redirects=5)
   
   # Aggressive settings for fast batch processing
   src = FindIt('12345678', request_timeout=5, max_redirects=1)
   
   # Disable redirects entirely
   src = FindIt('12345678', max_redirects=0)

**Error Handling:**

Network timeout issues are now reported clearly in error messages:

.. code-block:: text

   TXERROR: Connection timeout after 10s - attempted: https://publisher.com/article
   TXERROR: Too many redirects (>3) - attempted: https://journals.example.com/...

**Publisher-Specific Behavior:**

Some publishers (e.g., IOP, JAMA) use CrossRef API fallbacks when direct access is blocked. The timeout parameters apply to both primary and fallback access methods, ensuring reliable operation across all publishers.

**Performance Impact:**

- **Faster Failure Detection**: Network issues are detected within 10 seconds instead of hanging indefinitely
- **Batch Processing**: Timeout controls make batch operations more predictable and reliable
- **Resource Management**: Prevents accumulation of hanging network connections

Cache Management
~~~~~~~~~~~~~~

.. code-block:: python

   # Custom cache configuration
   import os
   from metapub.cache_utils import get_cache_path
   
   # Set global cache directory
   os.environ['METAPUB_CACHE_DIR'] = '/large/cache/partition'
   
   # Or disable caching entirely
   src = FindIt('12345678', cachedir=None)
   
   # Clear cache for fresh results
   import shutil
   cache_dir = get_cache_path('default', 'findit.db')
   if os.path.exists(cache_dir):
       os.remove(cache_dir)
       print("FindIt cache cleared")

Error Recovery
~~~~~~~~~~~~

.. code-block:: python

   from requests.exceptions import ConnectionError, Timeout
   
   def robust_findit(pmid, max_retries=3):
       """FindIt with automatic retry on network errors."""
       for attempt in range(max_retries):
           try:
               # Use longer timeout on retries
               timeout = 10 + (5 * attempt)  # 10s, 15s, 20s
               src = FindIt(pmid, request_timeout=timeout)
               return src
           except (ConnectionError, Timeout) as e:
               if attempt < max_retries - 1:
                   print(f"Network error, retrying... ({attempt + 1}/{max_retries})")
                   time.sleep(2 ** attempt)  # Exponential backoff
               else:
                   print(f"Failed after {max_retries} attempts: {e}")
                   return None
   
   # Usage
   src = robust_findit('12345678')
   if src and src.url:
       print(f"Success: {src.url}")

Publisher Strategy Information
----------------------------

Supported Publishers
~~~~~~~~~~~~~~~~~~

The FindIt system includes specialized strategies for:

**Open Access Publishers**
   - BMC (BioMed Central)
   - PLOS (Public Library of Science)
   - PMC (PubMed Central)

**Commercial Publishers**
   - Nature Publishing Group
   - Elsevier (ScienceDirect)
   - Wiley
   - Springer
   - American Chemical Society

**Society Publishers**
   - American Association for the Advancement of Science (AAAS)
   - JAMA Network
   - Biochemical Society

**Regional Publishers**
   - J-STAGE (Japan)
   - Karger (Switzerland)
   - Dustri (Germany)

Strategy Selection
~~~~~~~~~~~~~~~~

.. code-block:: python

   # The system automatically selects strategies based on:
   # 1. Journal title patterns
   # 2. Publisher information
   # 3. DOI prefixes
   # 4. URL patterns in metadata
   
   src = FindIt('12345678', debug=True)
   # Debug mode shows strategy selection process

Integration with Other Modules
-----------------------------

Combined Workflows
~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import PubMedFetcher, FindIt
   
   # Literature review with full-text access checking
   fetch = PubMedFetcher()
   
   # Search for articles
   pmids = fetch.pmids_for_query('CRISPR therapeutics', retmax=50)
   
   accessible_articles = []
   
   for pmid in pmids:
       # Get article metadata
       article = fetch.article_by_pmid(pmid)
       
       # Check PDF availability
       src = FindIt(pmid)
       
       if src.url:
           accessible_articles.append({
               'pmid': pmid,
               'title': article.title,
               'journal': article.journal,
               'year': article.year,
               'pdf_url': src.url
           })
   
   print(f"Found {len(accessible_articles)} articles with PDFs out of {len(pmids)} total")

See Also
--------

- :doc:`advanced` - Advanced FindIt patterns and publisher-specific examples
- :doc:`tutorials` - Complete workflows using FindIt for batch processing
- :doc:`api_fetchers` - Integration with other fetcher classes
- :doc:`examples` - Practical FindIt usage examples