Data Model Classes
Metapub provides rich data model classes that represent structured information from biomedical databases. These classes automatically parse complex XML responses into convenient Python objects.
PubMedArticle
- class metapub.PubMedArticle(xmlstr, *args, **kwargs)[source]
Bases:
MetaPubObjectThis PubMedArticle class receives an XML string as its required argument and parses it into its constituent parts, exposing them as attributes.
- Usage:
paper = PubMedArticle(xml_string)
To query services to return an article by pmid, use PubMedFetcher, which returns PubMedArticle objects.
When xmlstr is parsed, the pubmed_type attribute will be set to one of ‘article’ or ‘book’, depending on whether PubmedBookArticle or PubmedArticle headings are found in the supplied xmlstr at instantiation.
Since this class needs to work seamlessly in production whether it’s a book or an article, the PubmedArticle attributes will always be available (set to None in many cases for PubmedBookArticle, e.g. volume, issue, journal), but PubmedBookArticle attributes will only be set when pubmed_type=’book’.
- PubMedBook special handling of certain attributes:
abstract: a joined string from self.book_abstracts
title: comes from ArticleTitle
- Special attributes for PubmedBookArticle (pubmed_type=’book’):
book_id (default: None) - string from IdType=”bookaccession”, e.g. “NBK1403”
book_title (default: None) - string with name of book (as differentiated from ArticleTitle)
book_publisher (default: None) - dict containing {‘name’: string, ‘location’: string}
book_sections (default: []) - dict with key->value pairs as section_name->SectionTitle
book_contribution_date (default: None) - python datetime date
book_date_revised (default: None) - python datetime date
book_history (default: []) - dictionary with key->value pairs as PubStatus -> python datetime
book_language (default: None) - string (e.g. “eng”)
book_editors (default: []) - list containing names from ‘editors’ AuthorList
book_abstracts (default: []) - dict with key->value pairs as Label->AbstractText.text)
book_medium (default: None) - string (e.g. “Internet”)
book_synonyms (default: None) - list of disease synonyms (applicable to “gene” book)
book_publication_status (default: None) - string (e.g. “ppublish”)
- __init__(xmlstr, *args, **kwargs)[source]
Initialize PubMedArticle from NCBI XML data.
- Parameters:
xmlstr (str) – XML string from NCBI containing PubmedArticle or PubmedBookArticle data.
*args – Additional positional arguments passed to parent class.
**kwargs – Additional keyword arguments passed to parent class.
Note
The XML type is automatically detected to handle both regular articles and book chapters. The pubmed_type attribute will be set to ‘article’ or ‘book’ accordingly, and appropriate attributes will be populated.
- to_dict()[source]
Convert PubMedArticle to dictionary representation.
- Returns:
- Dictionary containing all article attributes except
internal XML content and processing attributes.
- Return type:
Dict[str, Any]
Note
Excludes ‘content’, ‘xml’, and ‘_root’ attributes from the output to provide a clean data representation suitable for serialization.
- property citation
Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.
Article Example:
McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.
Book Example (GeneReviews):
Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.
- property citation_html
Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, and pages.
Article Example:
McNally EM, <i>et al</i>. Genetic mutations and mechanisms in dilated cardiomyopathy. <i>Journal of Clinical Investigation</i>. 2013; <b>123</b>:19-26. doi: 10.1172/JCI62862.
GeneReviews Example: Tranebjarg L, <i>et al</i>. <i>Jervell and Lange-Nielsen syndrome</i>. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, <i>et al</i>., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.
- property citation_bibtex
- property pubdate
Normalized publication date as datetime object.
Returns the best available publication date from PubMed XML in order of preference: 1. Article PubDate (Year/Month/Day or MedlineDate) 2. Book contribution date 3. History dates (pubmed, entrez, etc.)
- Returns:
Publication date as datetime object, or None if no date found
- Return type:
datetime or None
Example
article = fetch.article_by_pmid(‘12345’) if article.pubdate:
print(f”Published: {article.pubdate.strftime(‘%Y-%m-%d’)}”)
The PubMedArticle class is the core data model for representing scientific articles from PubMed. It automatically parses NCBI XML into structured attributes.
Key Attributes
- Identifiers
pmid- PubMed IDdoi- Digital Object Identifierpmc- PubMed Central IDpii- Publisher Item Identifier
- Bibliographic Information
title- Article titlejournal- Journal nameyear- Publication yearvolume- Journal volumeissue- Journal issuepages- Page rangefirst_page- Starting pagelast_page- Ending page
- Authors and Content
authors- List ofPubMedAuthorobjectsauthor_list- Simple list of author name stringsabstract- Article abstract textkeywords- Author-supplied keywords
- Classification
mesh_headings- Medical Subject Headings (MeSH) termspublication_types- Type classifications (e.g., “Clinical Trial”)chemicals- Chemical substances mentioned
- Dates and History
history- Publication timeline with key datesreceived_date- When manuscript was receivedaccepted_date- When manuscript was accepted
Key Methods
- PubMedArticle.__init__(xmlstr, *args, **kwargs)[source]
Initialize PubMedArticle from NCBI XML data.
- Parameters:
xmlstr (str) – XML string from NCBI containing PubmedArticle or PubmedBookArticle data.
*args – Additional positional arguments passed to parent class.
**kwargs – Additional keyword arguments passed to parent class.
Note
The XML type is automatically detected to handle both regular articles and book chapters. The pubmed_type attribute will be set to ‘article’ or ‘book’ accordingly, and appropriate attributes will be populated.
- PubMedArticle.to_dict()[source]
Convert PubMedArticle to dictionary representation.
- Returns:
- Dictionary containing all article attributes except
internal XML content and processing attributes.
- Return type:
Dict[str, Any]
Note
Excludes ‘content’, ‘xml’, and ‘_root’ attributes from the output to provide a clean data representation suitable for serialization.
Properties
- property PubMedArticle.citation
Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.
Article Example:
McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.
Book Example (GeneReviews):
Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.
- property PubMedArticle.citation_html
Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, and pages.
Article Example:
McNally EM, <i>et al</i>. Genetic mutations and mechanisms in dilated cardiomyopathy. <i>Journal of Clinical Investigation</i>. 2013; <b>123</b>:19-26. doi: 10.1172/JCI62862.
GeneReviews Example: Tranebjarg L, <i>et al</i>. <i>Jervell and Lange-Nielsen syndrome</i>. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, <i>et al</i>., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.
Example Usage
from metapub import PubMedFetcher
fetch = PubMedFetcher()
article = fetch.article_by_pmid('33157158')
# Basic information
print(f"Title: {article.title}")
print(f"Journal: {article.journal} ({article.year})")
print(f"Volume {article.volume}, Issue {article.issue}, Pages {article.pages}")
# Authors
print(f"Authors: {len(article.authors)}")
for author in article.authors:
print(f" {author.lastname}, {author.firstname}")
# Content
print(f"Abstract: {article.abstract[:200]}...")
print(f"Keywords: {', '.join(article.keywords) if article.keywords else 'None'}")
# Classification
print(f"MeSH terms: {', '.join(article.mesh_headings[:5])}") # First 5
print(f"Publication types: {', '.join(article.publication_types)}")
# Generate citation
print(f"Citation: {article.citation}")
# Export to dictionary
data = article.to_dict()
Book Articles
PubMedArticle also handles NCBI book chapters with additional attributes:
# When pubmed_type == 'book'
if article.pubmed_type == 'book':
print(f"Book ID: {article.book_id}")
print(f"Book title: {article.book_title}")
print(f"Publisher: {article.book_publisher}")
print(f"Editors: {', '.join(article.book_editors)}")
MedGenConcept
- class metapub.MedGenConcept(xmlstr, *args, **kwargs)[source]
Bases:
MetaPubObject- __init__(xmlstr, *args, **kwargs)[source]
Instantiate with “xml” as string or bytes containing valid XML.
Supply name of root element (string) to set virtual top level. (optional).
- property synonyms
Returns a list of the ‘name’ values from self.names.
- property medgen_uid
Synonym for “uid”. Sometimes when juggling concepts from multiple places, this helps.
Represents medical genetics concepts from NCBI’s MedGen database.
Key Attributes
- Identifiers
cui- Concept Unique Identifieruid- MedGen UIDname- Primary concept name
- Content
definition- Concept definitionsynonyms- Alternative namessemantic_types- Semantic classifications
- Relationships
related_concepts- Related MedGen conceptssources- Source vocabularies
Example Usage
from metapub import MedGenFetcher
mg = MedGenFetcher()
concepts = mg.concepts_for_term('cystic fibrosis')
for concept in concepts[:3]: # First 3 results
print(f"Name: {concept.name}")
print(f"CUI: {concept.cui}")
print(f"Definition: {concept.definition}")
if concept.synonyms:
print(f"Synonyms: {', '.join(concept.synonyms[:3])}") # First 3
print(f"Semantic types: {', '.join(concept.semantic_types)}")
ClinVarVariant
- class metapub.ClinVarVariant(xmlstr, *args, **kwargs)[source]
Bases:
MetaPubObject- __init__(xmlstr, *args, **kwargs)[source]
Instantiate with “xml” as string or bytes containing valid XML.
Supply name of root element (string) to set virtual top level. (optional).
- property hgvs_c
Returns a list of all coding HGVS strings from the Allelle data.
- property hgvs_g
Returns a list of all genomic HGVS strings from the Allelle data.
- property hgvs_p
Returns a list of all protein effect HGVS strings from the Allelle data.
Represents genetic variants from NCBI’s ClinVar database with clinical significance information.
Key Attributes
- Identifiers
accession- ClinVar accession numbervariation_id- Variation IDallele_id- Allele ID
- Genomic Information
hgvs_c- HGVS coding sequence notationhgvs_p- HGVS protein sequence notationhgvs_g- HGVS genomic notationgene_symbol- Associated gene symbolmolecular_consequences- Predicted effects
- Clinical Data
clinical_significance- Clinical interpretation (in lowercase)review_status- Review/evidence levellast_evaluated- Date of last evaluation
- Supporting Data
submitters- Contributing organizationsconditions- Associated conditions/diseasescitations- Supporting literature
Example Usage
from metapub import ClinVarFetcher
cv = ClinVarFetcher()
variant = cv.variant('12345') # ClinVar ID
print(f"Accession: {variant.accession}")
print(f"Gene: {variant.gene_symbol}")
print(f"HGVS notation: {variant.hgvs_c}")
print(f"Clinical significance: {variant.clinical_significance}")
print(f"Review status: {variant.review_status}")
if variant.molecular_consequences:
print(f"Molecular consequences: {', '.join(variant.molecular_consequences)}")
if variant.conditions:
print(f"Associated conditions: {', '.join(variant.conditions[:3])}") # First 3
Data Model Utilities
Validation Functions
Example usage:
from metapub.validate import is_valid_pmid, is_valid_doi
# Validate PMIDs before processing
pmids = ['12345678', 'invalid', '23456789', '']
valid_pmids = [pmid for pmid in pmids if is_valid_pmid(pmid)]
# Validate DOIs
dois = ['10.1038/nature12373', 'invalid-doi', '10.1126/science.1234567']
valid_dois = [doi for doi in dois if is_valid_doi(doi)]
Conversion Functions
- metapub.convert.pmid2doi(pmid)[source]
- starting with a pubmed ID, lookup article in pubmed. If DOI found in PubMedArticle object,
return it. Otherwise, use CrossRef to find the DOI for given article.
- Parameters:
- Returns:
doi (str) or None
- Raises:
InvalidPMID (if pmid is invalid) –
NCBIServiceError (if NCBI services are down) –
- metapub.convert.doi2pmid(doi)[source]
uses CrossRef and PubMed eutils to lookup a PMID given a known doi.
- Warning: NO validation of input DOI performed here. Use
metapub.text_mining.find_doi_in_string beforehand if needed.
If a PMID can be found, return it. Otherwise return None.
In very rare cases, use of the CrossRef->pubmed citation method used here may result in more than one pubmed ID. In this case, this function will return instead the word ‘AMBIGUOUS’.
- Parameters:
pmid – (str or int)
- Return doi:
(str) if found; ‘AMBIGUOUS’ if citation count > 1; None if no results.
- Raises:
NCBIServiceError if NCBI services are down
Example usage:
from metapub.convert import pmid2doi, doi2pmid
# Convert PMID to DOI
doi = pmid2doi('33157158')
if doi:
print(f"DOI: {doi}")
# Convert DOI to PMID
pmid = doi2pmid('10.1038/nature12373')
if pmid:
print(f"PMID: {pmid}")
Working with Multiple Data Types
Integration Example
from metapub import PubMedFetcher, MedGenFetcher, ClinVarFetcher
# Initialize fetchers
fetch = PubMedFetcher()
mg = MedGenFetcher()
cv = ClinVarFetcher()
# Research workflow: gene -> variants -> literature
gene = 'BRCA1'
# 1. Find genetic variants
variant_ids = cv.ids_by_gene(gene, single_gene=True)
# 2. Get MedGen concepts for the gene
concepts = mg.concepts_for_term(f"{gene}[gene]")
# 3. Collect all related literature
all_pmids = set()
# From variants
for var_id in variant_ids[:5]: # Limit for demo
pmids = cv.pmids_for_id(var_id)
all_pmids.update(pmids)
# From MedGen concepts
for concept in concepts[:3]: # Limit for demo
pmids = mg.pubmeds_for_cui(concept.cui)
all_pmids.update(pmids)
# 4. Analyze the literature
print(f"Found {len(all_pmids)} unique articles for {gene}")
# Sample the articles
for pmid in list(all_pmids)[:10]: # First 10
try:
article = fetch.article_by_pmid(pmid)
print(f"PMID {pmid}: {article.title}")
print(f" Journal: {article.journal} ({article.year})")
except Exception as e:
print(f" Error with PMID {pmid}: {e}")
Serialization and Export
import json
import csv
# Export article data to JSON
articles_data = []
for pmid in pmids:
article = fetch.article_by_pmid(pmid)
articles_data.append(article.to_dict())
with open('articles.json', 'w') as f:
json.dump(articles_data, f, indent=2, default=str)
# Export to CSV
with open('articles.csv', 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(['PMID', 'Title', 'Journal', 'Year', 'DOI', 'Authors'])
for article_data in articles_data:
writer.writerow([
article_data['pmid'],
article_data['title'],
article_data['journal'],
article_data['year'],
article_data['doi'],
'; '.join([str(author) for author in article_data['authors']])
])