FindIt: Full-Text PDF Discovery
The FindIt module provides sophisticated capabilities for locating full-text PDFs of academic papers using publisher-specific strategies and legal access verification.
FindIt Class
The FindIt class is the primary interface for PDF discovery. It employs publisher-specific “dances” (custom algorithms) to locate downloadable PDFs while respecting publisher policies and embargo restrictions.
- Network Timeout Configuration (v0.11+)
The FindIt class now includes configurable timeout parameters to prevent infinite stalling during PDF discovery:
request_timeout: Maximum time (seconds) to wait for HTTP responses (default: 10)
max_redirects: Maximum number of HTTP redirects to follow (default: 3)
These parameters are passed to all network requests throughout the FindIt system to ensure reliable operation.
Key Features
- Publisher Coverage
68+ major academic publishers supported (97.1% coverage)
Publisher-specific URL patterns and access methods
Dynamic strategy selection based on journal/publisher
- Access Verification
HTTP verification of PDF availability
Embargo detection and date checking
Legal access validation
CrossRef API integration for blocked access workarounds
- Caching System
SQLite-based result caching
Configurable cache directories
Smart cache invalidation for error cases
- Intelligent Error Reporting
Structured error categories with actionable information
Always includes attempted URLs for debugging
Distinguishes between paywall, technical, and data issues
Developer-friendly reason codes for automated handling
Core Methods
Result Attributes
After initialization, FindIt objects provide these key attributes:
- url (str or None)
Direct link to downloadable PDF if found
- reason (str or None)
Detailed explanation when PDF is not available, always includes attempted URL:
"MISSING: ..."- Required data not available (DOI, volume/issue, etc.)"PAYWALL: ..."- Requires subscription/payment"DENIED: ..."- Access forbidden or login required"TXERROR: ..."- Technical/network/server error"NOFORMAT: ..."- Publisher doesn’t provide expected format"NOTFOUND: ..."- Content not found at expected location
All reason messages include
- attempted: {URL}for debugging.- backup_url (str or None)
Alternative URL when primary fails
- pma (PubMedArticle)
Associated article metadata
- doi_score (int)
Confidence score for DOI match (0-100)
FindIt Error Handling Philosophy
FindIt employs a sophisticated error reporting system that provides meaningful, actionable information to developers about why a PDF link could not be obtained. This system distinguishes between different types of failures and always includes the attempted URL for debugging purposes.
Error Categories and Usage
FindIt uses three main error classification approaches:
NoPDFLink Exception - Expected Operational Failures
Used when the system cannot produce a PDF link due to expected operational conditions:
from metapub import FindIt
from metapub.exceptions import NoPDFLink
try:
src = FindIt('12345678') # Article without DOI
except NoPDFLink as e:
print(str(e))
# "MISSING: DOI required for SAGE journals - attempted: none"
AccessDenied Exception - Publisher Restrictions
Used when publishers explicitly deny access due to paywall or subscription requirements:
from metapub.exceptions import AccessDenied
try:
src = FindIt('16419642') # Nature paywall article
except AccessDenied as e:
print(str(e))
# "PAYWALL: Nature requires subscription - attempted: https://nature.com/articles/..."
TXERROR Prefix - Technical Failures
Used within NoPDFLink messages when technical issues prevent accessing content:
try:
src = FindIt('12345678') # Server timeout
except NoPDFLink as e:
print(str(e))
# "TXERROR: Connection timeout after 30s - attempted: https://publisher.com/..."
Error Message Format
All error messages follow a consistent structure:
{ERROR_TYPE}: {Description} - attempted: {URL}
Error Type Prefixes:
MISSING:- Required data not available (DOI, volume/issue, etc.)NOFORMAT:- Publisher doesn’t provide expected formatPAYWALL:- Subscription or payment requiredDENIED:- Access forbidden or login requiredTXERROR:- Technical, network, or server errorNOTFOUND:- Content not found at expected location
Always Includes Attempted URL:
Every error message includes the URL(s) that were attempted, allowing developers to:
Debug access issues manually
Understand what URLs the system tried
Implement alternative access methods
Report publisher-specific problems
Developer Usage Patterns
The structured error information enables sophisticated error handling:
Basic Error Categorization
from metapub import FindIt
from metapub.exceptions import NoPDFLink, AccessDenied
def handle_findit_result(pmid):
try:
src = FindIt(pmid)
if src.url:
return {'status': 'success', 'url': src.url}
else:
return {'status': 'no_pdf', 'reason': src.reason}
except AccessDenied as e:
# Publisher paywall/subscription required
return {
'status': 'paywall',
'reason': str(e),
'action': 'purchase_required'
}
except NoPDFLink as e:
error_msg = str(e)
if 'TXERROR' in error_msg:
# Technical issue - retry later
return {
'status': 'technical_error',
'reason': error_msg,
'action': 'retry_later'
}
elif 'MISSING' in error_msg:
# Data issue - try alternative approach
return {
'status': 'data_missing',
'reason': error_msg,
'action': 'try_alternative'
}
else:
return {'status': 'unknown_error', 'reason': error_msg}
Automated Response to Error Types
import time
from collections import defaultdict
def batch_findit_with_smart_retry(pmids, max_retries=3):
"""Process PMIDs with intelligent error handling and retry logic."""
results = []
retry_queue = defaultdict(list) # Group by error type for batch retry
for pmid in pmids:
try:
src = FindIt(pmid)
if src.url:
results.append({'pmid': pmid, 'url': src.url, 'status': 'success'})
elif src.reason:
# Store reason for analysis
results.append({'pmid': pmid, 'reason': src.reason, 'status': 'failed'})
except AccessDenied as e:
# Paywall - annotate for purchase consideration
results.append({
'pmid': pmid,
'status': 'paywall',
'reason': str(e),
'needs_purchase': True
})
except NoPDFLink as e:
error_msg = str(e)
if 'TXERROR' in error_msg:
# Technical errors - queue for retry
retry_queue['technical'].append(pmid)
results.append({
'pmid': pmid,
'status': 'technical_error',
'reason': error_msg,
'will_retry': True
})
else:
# Data/format errors - unlikely to succeed on retry
results.append({
'pmid': pmid,
'status': 'permanent_failure',
'reason': error_msg
})
# Retry technical errors after delay
if retry_queue['technical'] and max_retries > 0:
print(f"Retrying {len(retry_queue['technical'])} technical failures...")
time.sleep(5) # Wait for transient issues to resolve
retry_results = batch_findit_with_smart_retry(
retry_queue['technical'],
max_retries - 1
)
# Update original results with retry outcomes
for retry_result in retry_results:
# Find and update the original failed result
for i, result in enumerate(results):
if result['pmid'] == retry_result['pmid']:
results[i] = retry_result
break
return results
Error Analysis and Reporting
def analyze_findit_errors(results):
"""Analyze FindIt results to identify patterns and actionable insights."""
error_stats = defaultdict(int)
paywall_publishers = defaultdict(int)
technical_issues = []
for result in results:
if result['status'] == 'paywall':
# Extract publisher from error message for purchase planning
reason = result['reason']
if 'Nature' in reason:
paywall_publishers['Nature Publishing'] += 1
elif 'Elsevier' in reason:
paywall_publishers['Elsevier/ScienceDirect'] += 1
# Add more publisher patterns as needed
elif result['status'] == 'technical_error':
technical_issues.append(result['reason'])
error_stats[result['status']] += 1
print("=== FindIt Error Analysis ===")
print(f"Success rate: {error_stats['success']}/{len(results)} ({error_stats['success']/len(results)*100:.1f}%)")
print(f"Paywall articles: {error_stats['paywall']}")
print(f"Technical errors: {error_stats['technical_error']}")
if paywall_publishers:
print("\n=== Publishers Requiring Subscription ===")
for publisher, count in paywall_publishers.items():
print(f" {publisher}: {count} articles")
if technical_issues:
print(f"\n=== Technical Issues (consider retrying) ===")
for issue in set(technical_issues): # Unique issues only
count = technical_issues.count(issue)
print(f" {issue} (×{count})")
return {
'error_stats': dict(error_stats),
'paywall_publishers': dict(paywall_publishers),
'technical_issues': technical_issues
}
Error Message Examples
Missing Data Errors:
MISSING: DOI required for SAGE journals - attempted: none
MISSING: pii needed for ScienceDirect lookup - attempted: https://sciencedirect.com/...
MISSING: volume/issue/pii data - cannot construct Nature URL - attempted: none
Access Denied Errors:
PAYWALL: Nature requires subscription - attempted: https://nature.com/articles/s41586-020-2936-y.pdf
DENIED: JAMA requires login - attempted: https://jamanetwork.com/journals/jama/fullarticle/...
PAYWALL: Elsevier paywall detected - attempted: https://sciencedirect.com/science/article/pii/...
Technical Errors:
TXERROR: Server returned 503 Service Unavailable - attempted: https://publisher.com/...
TXERROR: Connection timeout after 10s - attempted: https://journals.sagepub.com/...
TXERROR: dx.doi.org lookup failed (Network error) - attempted: http://dx.doi.org/10.1038/...
TXERROR: Too many redirects (>3) - attempted: https://publisher.com/...
Publisher Format Issues:
NOFORMAT: BMC article has no PDF version - attempted: https://bmcgenomics.biomedcentral.com/...
NOTFOUND: Article not found on Nature platform - attempted: https://nature.com/..., traditional URL
Benefits for Developers
This comprehensive error handling system provides:
Clear Action Path - Developers know exactly what went wrong and why
Debugging Information - Attempted URLs allow manual verification
Automated Categorization - Error types enable programmatic responses
Publisher Intelligence - Identify which publishers require subscriptions
Technical Issue Detection - Distinguish between transient and permanent failures
Batch Processing Optimization - Group similar errors for efficient handling
The goal is to make FindIt failures informative and actionable rather than opaque, enabling developers to build robust applications that handle PDF access gracefully.
Usage Patterns
Basic PDF Discovery
from metapub import FindIt
# Find PDF by PMID
src = FindIt('33157158')
if src.url:
print(f"✓ PDF available: {src.url}")
print(f"Journal: {src.pma.journal}")
else:
print(f"✗ No access: {src.reason}")
if src.backup_url:
print(f"Backup URL: {src.backup_url}")
Advanced Configuration
# Advanced options
src = FindIt(
pmid='12345678',
use_nih=True, # Use NIH access when available
verify=False, # Skip URL verification for speed
retry_errors=True, # Retry cached error results
debug=True, # Enable debug logging
cachedir='/custom/cache', # Custom cache location
request_timeout=15, # Custom request timeout (seconds)
max_redirects=5 # Custom redirect limit
)
Network Timeout Configuration
# Configure network behavior
src = FindIt(
pmid='12345678',
request_timeout=20, # Wait up to 20 seconds for responses
max_redirects=2 # Follow max 2 redirects
)
# For faster processing with tighter limits
src = FindIt(
pmid='12345678',
request_timeout=5, # Quick timeout for batch processing
max_redirects=1 # Minimal redirects
)
# Default values (recommended for most use cases)
src = FindIt('12345678') # Uses timeout=10s, redirects=3
DOI-Based Discovery
# Find PDF by DOI instead of PMID
src = FindIt(doi='10.1038/nature12373')
if src.url:
print(f"Found via DOI: {src.url}")
else:
print(f"DOI lookup failed: {src.reason}")
Publisher-Specific Examples
Nature Publishing Group
# Nature articles - often available through institutional access
nature_pmids = ['16419642', '18830250', '12187393']
for pmid in nature_pmids:
src = FindIt(pmid)
print(f"PMID {pmid}: {src.pma.journal}")
if src.url:
print(f" ✓ Available: {src.url}")
else:
print(f" ✗ {src.reason}")
BMC and Open Access
# BMC journals - typically open access
bmc_pmids = ['25943194', '20170543', '25927199']
for pmid in bmc_pmids:
src = FindIt(pmid)
print(f"PMID {pmid}: {src.pma.journal}")
if src.url:
print(f" ✓ Open access: {src.url}")
else:
print(f" ✗ Unexpected: {src.reason}")
Embargo Detection
# Check for embargoed content
src = FindIt('25575644') # Example embargoed article
embargo_date = src.pma.history.get('pmc-release', None)
if src.reason and 'embargo' in src.reason.lower():
print(f"Article is embargoed")
if embargo_date:
print(f"Available after: {embargo_date}")
elif src.url:
print(f"Immediate access: {src.url}")
Batch Processing
Processing Multiple Articles
import csv
import time
from metapub import FindIt
def batch_findit_analysis(pmids, output_file='findit_results.csv'):
"""Analyze PDF availability for a list of PMIDs."""
results = []
with open(output_file, 'w', newline='') as csvfile:
fieldnames = ['pmid', 'journal', 'title', 'url_available',
'url', 'reason', 'embargo_status']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for i, pmid in enumerate(pmids):
print(f"Processing {pmid} ({i+1}/{len(pmids)})")
try:
src = FindIt(pmid, retry_errors=True)
# Check embargo status
embargo_date = src.pma.history.get('pmc-release', None)
is_embargoed = (
src.reason and
src.reason.startswith("PAYWALL") and
"embargo" in src.reason
)
result = {
'pmid': pmid,
'journal': src.pma.journal,
'title': src.pma.title,
'url_available': bool(src.url),
'url': src.url or '',
'reason': src.reason or '',
'embargo_status': 'embargoed' if is_embargoed else 'not_embargoed'
}
writer.writerow(result)
results.append(result)
except Exception as e:
print(f"Error processing {pmid}: {e}")
# Rate limiting
time.sleep(0.5)
return results
# Usage
pmids = ['12345678', '23456789', '34567890']
results = batch_findit_analysis(pmids)
Result Analysis
import pandas as pd
def analyze_findit_results(results):
"""Analyze FindIt batch processing results."""
df = pd.DataFrame(results)
print("=== PDF Access Analysis ===")
print(f"Total articles: {len(df)}")
print(f"PDFs available: {df['url_available'].sum()} ({df['url_available'].mean()*100:.1f}%)")
print(f"Embargoed: {(df['embargo_status'] == 'embargoed').sum()}")
print("\n=== Access by Journal ===")
journal_stats = df.groupby('journal').agg({
'url_available': ['count', 'sum', 'mean']
}).round(3)
journal_stats.columns = ['total', 'available', 'access_rate']
print(journal_stats.sort_values('access_rate', ascending=False))
print("\n=== Failure Reasons ===")
failed = df[~df['url_available']]
if len(failed) > 0:
reason_counts = failed['reason'].value_counts()
print(reason_counts)
# Analyze results
analyze_findit_results(results)
Advanced Features
Network Timeout and Reliability Improvements
Version 0.11+ Timeout System
The FindIt system now includes comprehensive network timeout controls to prevent infinite stalling during PDF discovery. This addresses cases where publisher servers might become unresponsive or network connections hang indefinitely.
Key Improvements:
Request Timeouts: All HTTP requests have configurable timeouts (default: 10 seconds)
Redirect Limits: Maximum redirects are enforced to prevent infinite redirect loops (default: 3)
Consistent Application: Timeout controls apply to all publisher-specific dance functions
Backward Compatibility: All timeout parameters are optional with sensible defaults
Configuration Examples:
# Default behavior (recommended)
src = FindIt('12345678') # 10s timeout, 3 redirects max
# Conservative settings for unreliable networks
src = FindIt('12345678', request_timeout=20, max_redirects=5)
# Aggressive settings for fast batch processing
src = FindIt('12345678', request_timeout=5, max_redirects=1)
# Disable redirects entirely
src = FindIt('12345678', max_redirects=0)
Error Handling:
Network timeout issues are now reported clearly in error messages:
TXERROR: Connection timeout after 10s - attempted: https://publisher.com/article
TXERROR: Too many redirects (>3) - attempted: https://journals.example.com/...
Publisher-Specific Behavior:
Some publishers (e.g., IOP, JAMA) use CrossRef API fallbacks when direct access is blocked. The timeout parameters apply to both primary and fallback access methods, ensuring reliable operation across all publishers.
Performance Impact:
Faster Failure Detection: Network issues are detected within 10 seconds instead of hanging indefinitely
Batch Processing: Timeout controls make batch operations more predictable and reliable
Resource Management: Prevents accumulation of hanging network connections
Cache Management
# Custom cache configuration
import os
from metapub.cache_utils import get_cache_path
# Set global cache directory
os.environ['METAPUB_CACHE_DIR'] = '/large/cache/partition'
# Or disable caching entirely
src = FindIt('12345678', cachedir=None)
# Clear cache for fresh results
import shutil
cache_dir = get_cache_path('default', 'findit.db')
if os.path.exists(cache_dir):
os.remove(cache_dir)
print("FindIt cache cleared")
Error Recovery
from requests.exceptions import ConnectionError, Timeout
def robust_findit(pmid, max_retries=3):
"""FindIt with automatic retry on network errors."""
for attempt in range(max_retries):
try:
# Use longer timeout on retries
timeout = 10 + (5 * attempt) # 10s, 15s, 20s
src = FindIt(pmid, request_timeout=timeout)
return src
except (ConnectionError, Timeout) as e:
if attempt < max_retries - 1:
print(f"Network error, retrying... ({attempt + 1}/{max_retries})")
time.sleep(2 ** attempt) # Exponential backoff
else:
print(f"Failed after {max_retries} attempts: {e}")
return None
# Usage
src = robust_findit('12345678')
if src and src.url:
print(f"Success: {src.url}")
Publisher Strategy Information
Supported Publishers
The FindIt system includes specialized strategies for:
- Open Access Publishers
BMC (BioMed Central)
PLOS (Public Library of Science)
PMC (PubMed Central)
- Commercial Publishers
Nature Publishing Group
Elsevier (ScienceDirect)
Wiley
Springer
American Chemical Society
- Society Publishers
American Association for the Advancement of Science (AAAS)
JAMA Network
Biochemical Society
- Regional Publishers
J-STAGE (Japan)
Karger (Switzerland)
Dustri (Germany)
Strategy Selection
# The system automatically selects strategies based on:
# 1. Journal title patterns
# 2. Publisher information
# 3. DOI prefixes
# 4. URL patterns in metadata
src = FindIt('12345678', debug=True)
# Debug mode shows strategy selection process
Integration with Other Modules
Combined Workflows
from metapub import PubMedFetcher, FindIt
# Literature review with full-text access checking
fetch = PubMedFetcher()
# Search for articles
pmids = fetch.pmids_for_query('CRISPR therapeutics', retmax=50)
accessible_articles = []
for pmid in pmids:
# Get article metadata
article = fetch.article_by_pmid(pmid)
# Check PDF availability
src = FindIt(pmid)
if src.url:
accessible_articles.append({
'pmid': pmid,
'title': article.title,
'journal': article.journal,
'year': article.year,
'pdf_url': src.url
})
print(f"Found {len(accessible_articles)} articles with PDFs out of {len(pmids)} total")
See Also
Advanced Usage - Advanced FindIt patterns and publisher-specific examples
Tutorials - Complete workflows using FindIt for batch processing
Data Fetcher Classes - Integration with other fetcher classes
Examples - Practical FindIt usage examples