Data Model Classes

Metapub provides rich data model classes that represent structured information from biomedical databases. These classes automatically parse complex XML responses into convenient Python objects.

PubMedArticle

class metapub.PubMedArticle(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

This PubMedArticle class receives an XML string as its required argument and parses it into its constituent parts, exposing them as attributes.

Usage:: paper = PubMedArticle(xml_string)

To query services to return an article by pmid, use PubMedFetcher, which returns PubMedArticle objects.

When xmlstr is parsed, the pubmed_type attribute will be set to one of ‘article’ or ‘book’, depending on whether PubmedBookArticle or PubmedArticle headings are found in the supplied xmlstr at instantiation.

Since this class needs to work seamlessly in production whether it’s a book or an article, the PubmedArticle attributes will always be available (set to None in many cases for PubmedBookArticle, e.g. volume, issue, journal), but PubmedBookArticle attributes will only be set when pubmed_type=’book’.

PubMedBook special handling of certain attributes:

abstract: a joined string from self.book_abstracts
title: comes from ArticleTitle

Special attributes for PubmedBookArticle (pubmed_type=’book’):

book_id (default: None) - string from IdType=”bookaccession”, e.g. “NBK1403”
book_title (default: None) - string with name of book (as differentiated from ArticleTitle)
book_publisher (default: None) - dict containing {‘name’: string, ‘location’: string}
book_sections (default: []) - dict with key->value pairs as section_name->SectionTitle
book_contribution_date (default: None) - python datetime date
book_date_revised (default: None) - python datetime date
book_history (default: []) - dictionary with key->value pairs as PubStatus -> python datetime
book_language (default: None) - string (e.g. “eng”)
book_editors (default: []) - list containing names from ‘editors’ AuthorList
book_abstracts (default: []) - dict with key->value pairs as Label->AbstractText.text)
book_medium (default: None) - string (e.g. “Internet”)
book_synonyms (default: None) - list of disease synonyms (applicable to “gene” book)
book_publication_status (default: None) - string (e.g. “ppublish”)

__init__(xmlstr, *args, **kwargs)[source]

Initialize PubMedArticle from NCBI XML data.

Parameters:

xmlstr (str) – XML string from NCBI containing PubmedArticle or PubmedBookArticle data.
*args – Additional positional arguments passed to parent class.
**kwargs – Additional keyword arguments passed to parent class.

Note

The XML type is automatically detected to handle both regular articles and book chapters. The pubmed_type attribute will be set to ‘article’ or ‘book’ accordingly, and appropriate attributes will be populated.

to_dict()[source]

Convert PubMedArticle to dictionary representation.

Returns:

Dictionary containing all article attributes except: internal XML content and processing attributes.

Return type:

Dict[str, Any]

Note

Excludes ‘content’, ‘xml’, and ‘_root’ attributes from the output to provide a clean data representation suitable for serialization.

property citation

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.

Article Example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

Book Example (GeneReviews):

Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

property citation_html

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, and pages.

Article Example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

GeneReviews Example: Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

property citation_bibtex

property pubdate

Normalized publication date as datetime object.

Returns the best available publication date from PubMed XML in order of preference: 1. Article PubDate (Year/Month/Day or MedlineDate) 2. Book contribution date 3. History dates (pubmed, entrez, etc.)

Returns:: Publication date as datetime object, or None if no date found
Return type:: datetime or None

Example

article = fetch.article_by_pmid(‘12345’) if article.pubdate:

print(f”Published: {article.pubdate.strftime(‘%Y-%m-%d’)}”)

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

The PubMedArticle class is the core data model for representing scientific articles from PubMed. It automatically parses NCBI XML into structured attributes.

Key Attributes

Identifiers

pmid - PubMed ID
doi - Digital Object Identifier
pmc - PubMed Central ID
pii - Publisher Item Identifier

Bibliographic Information

title - Article title
journal - Journal name
year - Publication year
volume - Journal volume
issue - Journal issue
pages - Page range
first_page - Starting page
last_page - Ending page

Authors and Content

authors - List of PubMedAuthor objects
author_list - Simple list of author name strings
abstract - Article abstract text
keywords - Author-supplied keywords

Classification

mesh_headings - Medical Subject Headings (MeSH) terms
publication_types - Type classifications (e.g., “Clinical Trial”)
chemicals - Chemical substances mentioned

Dates and History

history - Publication timeline with key dates
received_date - When manuscript was received
accepted_date - When manuscript was accepted

Key Methods

PubMedArticle.__init__(xmlstr, *args, **kwargs)[source]

Initialize PubMedArticle from NCBI XML data.

Parameters:

xmlstr (str) – XML string from NCBI containing PubmedArticle or PubmedBookArticle data.
*args – Additional positional arguments passed to parent class.
**kwargs – Additional keyword arguments passed to parent class.

Note

The XML type is automatically detected to handle both regular articles and book chapters. The pubmed_type attribute will be set to ‘article’ or ‘book’ accordingly, and appropriate attributes will be populated.

PubMedArticle.to_dict()[source]

Convert PubMedArticle to dictionary representation.

Returns:

Dictionary containing all article attributes except: internal XML content and processing attributes.

Return type:

Dict[str, Any]

Note

Excludes ‘content’, ‘xml’, and ‘_root’ attributes from the output to provide a clean data representation suitable for serialization.

Properties

property PubMedArticle.citation

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.

Article Example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

Book Example (GeneReviews):

Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

property PubMedArticle.citation_html

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, and pages.

Article Example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

GeneReviews Example: Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

Example Usage

from metapub import PubMedFetcher

fetch = PubMedFetcher()
article = fetch.article_by_pmid('33157158')

# Basic information
print(f"Title: {article.title}")
print(f"Journal: {article.journal} ({article.year})")
print(f"Volume {article.volume}, Issue {article.issue}, Pages {article.pages}")

# Authors
print(f"Authors: {len(article.authors)}")
for author in article.authors:
    print(f"  {author.lastname}, {author.firstname}")

# Content
print(f"Abstract: {article.abstract[:200]}...")
print(f"Keywords: {', '.join(article.keywords) if article.keywords else 'None'}")

# Classification
print(f"MeSH terms: {', '.join(article.mesh_headings[:5])}")  # First 5
print(f"Publication types: {', '.join(article.publication_types)}")

# Generate citation
print(f"Citation: {article.citation}")

# Export to dictionary
data = article.to_dict()

Book Articles

PubMedArticle also handles NCBI book chapters with additional attributes:

# When pubmed_type == 'book'
if article.pubmed_type == 'book':
    print(f"Book ID: {article.book_id}")
    print(f"Book title: {article.book_title}")
    print(f"Publisher: {article.book_publisher}")
    print(f"Editors: {', '.join(article.book_editors)}")

PubMedAuthor

class metapub.PubMedAuthor(xmlelem, *args, **kwargs)[source]

Bases: MetaPubObject

This PubMedAuthor class receives a xml element as required argument and parses it into its parts, exposing them as attributes.

Usage:: author = PubMedAuthor(xml_elem)

To retrieve the standard represenation of a author name, use the __str__ method.

(About unicode: metapub uses unicode_literals in both py3 and py2, so the str() function returns unicode, unless called by a py2k “str()” statement in which unicode_literals is off.)

__init__(xmlelem, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

Represents individual authors with detailed name parsing and affiliation information.

Key Attributes

Name Components

lastname - Author’s last name
firstname - Author’s first name
initials - First/middle initials
suffix - Name suffix (Jr., Sr., etc.)

Affiliation

affiliation - Institutional affiliation

Example Usage

# Access authors from an article
for author in article.authors:
    print(f"Name: {author.lastname}, {author.firstname}")
    print(f"Initials: {author.initials}")
    if author.affiliation:
        print(f"Affiliation: {author.affiliation}")
    print(f"Full name: {str(author)}")  # Uses __str__ method

MedGenConcept

class metapub.MedGenConcept(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

__init__(xmlstr, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]: returns a dictionary composed of all extractable properties of this concept.

property synonyms: Returns a list of the ‘name’ values from self.names.

property medgen_uid: Synonym for “uid”. Sometimes when juggling concepts from multiple places, this helps.

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

Represents medical genetics concepts from NCBI’s MedGen database.

Key Attributes

Identifiers

cui - Concept Unique Identifier
uid - MedGen UID
name - Primary concept name

Content

definition - Concept definition
synonyms - Alternative names
semantic_types - Semantic classifications

Relationships

related_concepts - Related MedGen concepts
sources - Source vocabularies

Example Usage

from metapub import MedGenFetcher

mg = MedGenFetcher()
concepts = mg.concepts_for_term('cystic fibrosis')

for concept in concepts[:3]:  # First 3 results
    print(f"Name: {concept.name}")
    print(f"CUI: {concept.cui}")
    print(f"Definition: {concept.definition}")

    if concept.synonyms:
        print(f"Synonyms: {', '.join(concept.synonyms[:3])}")  # First 3

    print(f"Semantic types: {', '.join(concept.semantic_types)}")

ClinVarVariant

class metapub.ClinVarVariant(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

__init__(xmlstr, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]: returns a dictionary composed of all extractable properties of this concept.

property hgvs_c: Returns a list of all coding HGVS strings from the Allelle data.

property hgvs_g: Returns a list of all genomic HGVS strings from the Allelle data.

property hgvs_p: Returns a list of all protein effect HGVS strings from the Allelle data.

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:

xml (str or bytes)
root (str) – (optional) name of root element

Returns:

lxml document object.

Represents genetic variants from NCBI’s ClinVar database with clinical significance information.

Key Attributes

Identifiers

accession - ClinVar accession number
variation_id - Variation ID
allele_id - Allele ID

Genomic Information

hgvs_c - HGVS coding sequence notation
hgvs_p - HGVS protein sequence notation
hgvs_g - HGVS genomic notation
gene_symbol - Associated gene symbol
molecular_consequences - Predicted effects

Clinical Data

clinical_significance - Clinical interpretation (in lowercase)
review_status - Review/evidence level
last_evaluated - Date of last evaluation

Supporting Data

submitters - Contributing organizations
conditions - Associated conditions/diseases
citations - Supporting literature

Example Usage

from metapub import ClinVarFetcher

cv = ClinVarFetcher()
variant = cv.variant('12345')  # ClinVar ID

print(f"Accession: {variant.accession}")
print(f"Gene: {variant.gene_symbol}")
print(f"HGVS notation: {variant.hgvs_c}")
print(f"Clinical significance: {variant.clinical_significance}")
print(f"Review status: {variant.review_status}")

if variant.molecular_consequences:
    print(f"Molecular consequences: {', '.join(variant.molecular_consequences)}")

if variant.conditions:
    print(f"Associated conditions: {', '.join(variant.conditions[:3])}")  # First 3

Data Model Utilities

Validation Functions

Example usage:

from metapub.validate import is_valid_pmid, is_valid_doi

# Validate PMIDs before processing
pmids = ['12345678', 'invalid', '23456789', '']
valid_pmids = [pmid for pmid in pmids if is_valid_pmid(pmid)]

# Validate DOIs
dois = ['10.1038/nature12373', 'invalid-doi', '10.1126/science.1234567']
valid_dois = [doi for doi in dois if is_valid_doi(doi)]

Conversion Functions

metapub.convert.pmid2doi(pmid)[source]

starting with a pubmed ID, lookup article in pubmed. If DOI found in PubMedArticle object,: return it. Otherwise, use CrossRef to find the DOI for given article.

Parameters:

pmid (str or int)

Returns:

doi (str) or None

Raises:

InvalidPMID (if pmid is invalid) –
NCBIServiceError (if NCBI services are down) –

metapub.convert.doi2pmid(doi)[source]

uses CrossRef and PubMed eutils to lookup a PMID given a known doi.

Warning: NO validation of input DOI performed here. Use: metapub.text_mining.find_doi_in_string beforehand if needed.

If a PMID can be found, return it. Otherwise return None.

In very rare cases, use of the CrossRef->pubmed citation method used here may result in more than one pubmed ID. In this case, this function will return instead the word ‘AMBIGUOUS’.

Parameters:: pmid – (str or int)
Return doi:: (str) if found; ‘AMBIGUOUS’ if citation count > 1; None if no results.
Raises:: NCBIServiceError if NCBI services are down

Example usage:

from metapub.convert import pmid2doi, doi2pmid

# Convert PMID to DOI
doi = pmid2doi('33157158')
if doi:
    print(f"DOI: {doi}")

# Convert DOI to PMID
pmid = doi2pmid('10.1038/nature12373')
if pmid:
    print(f"PMID: {pmid}")

Working with Multiple Data Types

Integration Example

from metapub import PubMedFetcher, MedGenFetcher, ClinVarFetcher

# Initialize fetchers
fetch = PubMedFetcher()
mg = MedGenFetcher()
cv = ClinVarFetcher()

# Research workflow: gene -> variants -> literature
gene = 'BRCA1'

# 1. Find genetic variants
variant_ids = cv.ids_by_gene(gene, single_gene=True)

# 2. Get MedGen concepts for the gene
concepts = mg.concepts_for_term(f"{gene}[gene]")

# 3. Collect all related literature
all_pmids = set()

# From variants
for var_id in variant_ids[:5]:  # Limit for demo
    pmids = cv.pmids_for_id(var_id)
    all_pmids.update(pmids)

# From MedGen concepts
for concept in concepts[:3]:  # Limit for demo
    pmids = mg.pubmeds_for_cui(concept.cui)
    all_pmids.update(pmids)

# 4. Analyze the literature
print(f"Found {len(all_pmids)} unique articles for {gene}")

# Sample the articles
for pmid in list(all_pmids)[:10]:  # First 10
    try:
        article = fetch.article_by_pmid(pmid)
        print(f"PMID {pmid}: {article.title}")
        print(f"  Journal: {article.journal} ({article.year})")
    except Exception as e:
        print(f"  Error with PMID {pmid}: {e}")

Serialization and Export

import json
import csv

# Export article data to JSON
articles_data = []
for pmid in pmids:
    article = fetch.article_by_pmid(pmid)
    articles_data.append(article.to_dict())

with open('articles.json', 'w') as f:
    json.dump(articles_data, f, indent=2, default=str)

# Export to CSV
with open('articles.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['PMID', 'Title', 'Journal', 'Year', 'DOI', 'Authors'])

    for article_data in articles_data:
        writer.writerow([
            article_data['pmid'],
            article_data['title'],
            article_data['journal'],
            article_data['year'],
            article_data['doi'],
            '; '.join([str(author) for author in article_data['authors']])
        ])