Tutorials
=========

Real-World Workflows and Use Cases
----------------------------------

This section provides step-by-step tutorials for common research workflows using Metapub.

Tutorial 1: Building a Literature Review Dataset
-----------------------------------------------

This tutorial shows how to systematically collect and analyze papers for a literature review.

Step 1: Define Your Search Strategy
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import PubMedFetcher
   import pandas as pd
   from datetime import datetime
   
   fetch = PubMedFetcher()
   
   # Define search parameters
   search_terms = [
       'machine learning AND genomics',
       'artificial intelligence AND genetics', 
       'deep learning AND biomarker'
   ]
   
   date_range = {
       'since': '2020/01/01',
       'until': '2024/12/31'
   }

Step 2: Collect PMIDs
~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   all_pmids = set()  # Use set to avoid duplicates
   
   for term in search_terms:
       print(f"Searching for: {term}")
       pmids = fetch.pmids_for_query(
           query=term,
           retmax=500,  # Adjust based on needs
           **date_range
       )
       all_pmids.update(pmids)
       print(f"Found {len(pmids)} papers")
   
   print(f"Total unique papers: {len(all_pmids)}")

Step 3: Extract Article Metadata
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub.exceptions import InvalidPMID
   
   articles_data = []
   
   for i, pmid in enumerate(all_pmids):
       if i % 50 == 0:
           print(f"Processed {i}/{len(all_pmids)} articles")
       
       try:
           article = fetch.article_by_pmid(pmid)
           
           # Extract key information
           data = {
               'pmid': pmid,
               'title': article.title,
               'journal': article.journal,
               'year': article.year,
               'doi': article.doi,
               'authors': '; '.join([str(author) for author in article.authors]),
               'abstract': article.abstract,
               'mesh_terms': '; '.join(article.mesh_headings) if article.mesh_headings else '',
               'publication_types': '; '.join(article.publication_types) if article.publication_types else ''
           }
           articles_data.append(data)
           
       except InvalidPMID:
           print(f"Invalid PMID: {pmid}")
       except Exception as e:
           print(f"Error processing {pmid}: {e}")

Step 4: Analyze and Export
~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   # Create DataFrame
   df = pd.DataFrame(articles_data)
   
   # Basic analysis
   print(f"Total articles collected: {len(df)}")
   print(f"Year range: {df['year'].min()} - {df['year'].max()}")
   print(f"Top 10 journals:")
   print(df['journal'].value_counts().head(10))
   
   # Export results
   df.to_csv(f'literature_review_{datetime.now().strftime("%Y%m%d")}.csv', index=False)
   print("Results exported to CSV")

Tutorial 2: FindIt Batch Processing for Full-Text Access
--------------------------------------------------------

This tutorial demonstrates how to systematically check full-text availability for a collection of papers.

Step 1: Prepare PMID List
~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import FindIt
   import csv
   import time
   
   # Load PMIDs from various sources
   def load_pmids_from_file(filename):
       pmids = []
       with open(filename, 'r') as f:
           for line in f:
               pmid = line.strip()
               if pmid.isdigit():
                   pmids.append(pmid)
       return pmids
   
   # Or from previous search
   pmids = ['25575644', '25700512', '25554792']  # Example PMIDs

Step 2: Batch FindIt Processing
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   def process_findit_batch(pmids, output_file='findit_results.csv'):
       results = []
       
       with open(output_file, 'w', newline='') as csvfile:
           fieldnames = ['pmid', 'journal', 'title', 'url_available', 'url', 'reason', 'backup_url', 'embargo_status']
           writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
           writer.writeheader()
           
           for i, pmid in enumerate(pmids):
               print(f"Processing {pmid} ({i+1}/{len(pmids)})")
               
               try:
                   src = FindIt(pmid, retry_errors=True)
                   
                   # Check embargo status
                   embargo_date = src.pma.history.get('pmc-release', None)
                   embargo_status = 'embargoed' if (
                       src.reason.startswith("PAYWALL") and "embargo" in src.reason
                   ) else 'not_embargoed'
                   
                   result = {
                       'pmid': pmid,
                       'journal': src.pma.journal,
                       'title': src.pma.title,
                       'url_available': bool(src.url),
                       'url': src.url or '',
                       'reason': src.reason,
                       'backup_url': src.backup_url or '',
                       'embargo_status': embargo_status
                   }
                   
                   writer.writerow(result)
                   results.append(result)
                   
               except Exception as e:
                   print(f"Error processing {pmid}: {e}")
                   
               # Rate limiting
               time.sleep(0.5)
       
       return results

Step 3: Analyze Access Patterns
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   def analyze_access_results(results):
       df = pd.DataFrame(results)
       
       print("=== Full-Text Access Analysis ===")
       print(f"Total articles: {len(df)}")
       print(f"URL available: {df['url_available'].sum()} ({df['url_available'].mean()*100:.1f}%)")
       print(f"Embargoed articles: {(df['embargo_status'] == 'embargoed').sum()}")
       
       print("\n=== Access by Journal ===")
       journal_access = df.groupby('journal')['url_available'].agg(['count', 'sum', 'mean'])
       journal_access.columns = ['total', 'available', 'access_rate']
       journal_access['access_rate'] = journal_access['access_rate'] * 100
       print(journal_access.sort_values('access_rate', ascending=False))
       
       print("\n=== Common Failure Reasons ===")
       failed = df[~df['url_available']]
       print(failed['reason'].value_counts().head(10))

Tutorial 3: Clinical Genetics Research Workflow
----------------------------------------------

This tutorial shows how to research genetic conditions using MedGen and ClinVar integration.

Step 1: Condition to Gene Discovery
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import MedGenFetcher, ClinVarFetcher, PubMedFetcher
   
   mg = MedGenFetcher()
   cv = ClinVarFetcher()
   fetch = PubMedFetcher()
   
   def research_condition(condition_name):
       print(f"=== Researching: {condition_name} ===")
       
       # Step 1: Find MedGen concepts
       concepts = mg.concepts_for_term(condition_name)
       
       if not concepts:
           print("No MedGen concepts found")
           return
       
       main_concept = concepts[0]  # Use primary concept
       print(f"Main concept: {main_concept.name} (CUI: {main_concept.cui})")
       print(f"Definition: {main_concept.definition}")
       
       return main_concept

Step 2: Find Associated Genes
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   def find_associated_genes(concept):
       # Get related PMIDs from MedGen
       pmids = mg.pubmeds_for_cui(concept.cui)
       
       print(f"Found {len(pmids)} related articles")
       
       # Analyze abstracts for gene mentions
       gene_mentions = {}
       
       for pmid in pmids[:20]:  # Limit for demo
           try:
               article = fetch.article_by_pmid(pmid)
               if article.abstract:
                   # Simple gene pattern matching (improve as needed)
                   import re
                   gene_pattern = r'\b[A-Z][A-Z0-9]{2,}\b'  # Basic gene pattern
                   genes = re.findall(gene_pattern, article.abstract)
                   
                   for gene in genes:
                       if gene not in gene_mentions:
                           gene_mentions[gene] = 0
                       gene_mentions[gene] += 1
                       
           except Exception as e:
               continue
       
       # Sort by frequency
       top_genes = sorted(gene_mentions.items(), key=lambda x: x[1], reverse=True)
       print(f"Top mentioned genes: {top_genes[:10]}")
       
       return top_genes

Step 3: ClinVar Variant Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   def analyze_clinvar_variants(gene_list):
       for gene, count in gene_list[:5]:  # Top 5 genes
           print(f"\n=== ClinVar variants for {gene} ===")
           
           try:
               # Search for variants in this gene
               variants = cv.variants_for_gene(gene)
               
               if variants:
                   print(f"Found {len(variants)} variants")
                   
                   # Analyze clinical significance
                   significance_counts = {}
                   for variant in variants[:10]:  # Limit for demo
                       sig = variant.clinical_significance
                       if sig:
                           significance_counts[sig] = significance_counts.get(sig, 0) + 1
                   
                   print("Clinical significance distribution:")
                   for sig, count in significance_counts.items():
                       print(f"  {sig}: {count}")
               
           except Exception as e:
               print(f"Error analyzing {gene}: {e}")

Step 4: Generate Research Summary
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   def generate_research_summary(condition_name):
       # Run the full workflow
       concept = research_condition(condition_name)
       if not concept:
           return
       
       genes = find_associated_genes(concept)
       analyze_clinvar_variants(genes)
       
       # Generate bibliography
       pmids = mg.pubmeds_for_cui(concept.cui)
       
       print(f"\n=== Key References for {condition_name} ===")
       for pmid in pmids[:5]:  # Top 5 references
           try:
               article = fetch.article_by_pmid(pmid)
               print(f"PMID {pmid}: {article.title}")
               print(f"  {article.journal} ({article.year})")
               print(f"  DOI: {article.doi}")
               print()
           except Exception:
               continue
   
   # Run the analysis
   generate_research_summary("Brugada syndrome")

Tutorial 4: Journal Analysis and Metrics
---------------------------------------

This tutorial shows how to analyze publication patterns and journal metrics.

Step 1: Collect Journal Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   def analyze_journal_publication_patterns(journal_name, years_back=5):
       from datetime import datetime, timedelta
       
       current_year = datetime.now().year
       start_year = current_year - years_back
       
       yearly_data = []
       
       for year in range(start_year, current_year + 1):
           print(f"Analyzing {journal_name} for {year}")
           
           pmids = fetch.pmids_for_query(
               journal=journal_name,
               year=year,
               retmax=1000  # Adjust as needed
           )
           
           # Sample articles for analysis
           sample_size = min(50, len(pmids))
           sample_pmids = pmids[:sample_size]
           
           articles = []
           for pmid in sample_pmids:
               try:
                   article = fetch.article_by_pmid(pmid)
                   articles.append(article)
               except Exception:
                   continue
           
           yearly_data.append({
               'year': year,
               'total_articles': len(pmids),
               'analyzed_articles': articles
           })
       
       return yearly_data

Step 2: Analyze Publication Trends
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   def analyze_publication_trends(yearly_data):
       import matplotlib.pyplot as plt
       
       years = [data['year'] for data in yearly_data]
       counts = [data['total_articles'] for data in yearly_data]
       
       # Publication volume trend
       plt.figure(figsize=(10, 6))
       plt.plot(years, counts, marker='o')
       plt.title('Publication Volume Over Time')
       plt.xlabel('Year')
       plt.ylabel('Number of Articles')
       plt.grid(True)
       plt.show()
       
       # Analyze author patterns
       all_authors = []
       for data in yearly_data:
           for article in data['analyzed_articles']:
               if article.authors:
                   all_authors.extend([str(author) for author in article.authors])
       
       from collections import Counter
       author_counts = Counter(all_authors)
       print("Most prolific authors:")
       for author, count in author_counts.most_common(10):
           print(f"  {author}: {count} papers")

Tutorial 5: Enriching PubMed Results with CrossRef Data
------------------------------------------------------

This tutorial demonstrates how to use CrossRef to enrich PubMed articles with
additional metadata like citation counts and licensing information.

Step 1: Search PubMed and Collect Articles
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

   from metapub import PubMedFetcher, CrossRefFetcher

   fetch = PubMedFetcher()
   cr = CrossRefFetcher()

   # Search PubMed for your topic
   pmids = fetch.pmids_for_query('CRISPR gene therapy', retmax=20)
   print(f"Found {len(pmids)} PubMed results")

Step 2: Enrich with CrossRef Metadata
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

``CrossRefFetcher`` can look up articles by DOI, by title, or directly from a
``PubMedArticle`` object. The ``article_by_pma`` method uses title similarity
matching to find the best CrossRef match.

.. code-block:: python

   enriched = []

   for pmid in pmids:
       article = fetch.article_by_pmid(pmid)

       # Look up on CrossRef using the PubMedArticle directly
       cr_work = cr.article_by_pma(article)

       result = {
           'pmid': pmid,
           'title': article.title,
           'journal': article.journal,
           'year': article.year,
           'doi': article.doi,
       }

       if cr_work:
           result['citation_count'] = cr_work.cited_by_count
           result['cr_publisher'] = cr_work.publisher

       enriched.append(result)
       print(f"PMID {pmid}: {article.title[:60]}...")

Step 3: Look Up a Single Article by DOI or Title
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

You can also query CrossRef directly when you have a DOI or title:

.. code-block:: python

   # By DOI
   work = cr.article_by_doi('10.1038/s41586-020-2649-2')
   print(f"{work.title} — cited {work.cited_by_count} times")

   # By title (returns best match)
   work = cr.article_by_title('CRISPR-Cas9 gene editing for sickle cell disease')
   if work:
       print(f"Found: {work.title}")
       print(f"DOI: {work.doi}")
       print(f"Publisher: {work.publisher}")