Data Model Classes

Metapub provides rich data model classes that represent structured information from biomedical databases. These classes automatically parse complex XML responses into convenient Python objects.

PubMedArticle

class metapub.PubMedArticle(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

This PubMedArticle class receives an XML string as its required argument and parses it into its constituent parts, exposing them as attributes.

Usage:

paper = PubMedArticle(xml_string)

To query services to return an article by pmid, use PubMedFetcher, which returns PubMedArticle objects.

When xmlstr is parsed, the pubmed_type attribute will be set to one of ‘article’ or ‘book’, depending on whether PubmedBookArticle or PubmedArticle headings are found in the supplied xmlstr at instantiation.

Since this class needs to work seamlessly in production whether it’s a book or an article, the PubmedArticle attributes will always be available (set to None in many cases for PubmedBookArticle, e.g. volume, issue, journal), but PubmedBookArticle attributes will only be set when pubmed_type=’book’.

PubMedBook special handling of certain attributes:
  • abstract: a joined string from self.book_abstracts

  • title: comes from ArticleTitle

Special attributes for PubmedBookArticle (pubmed_type=’book’):
  • book_id (default: None) - string from IdType=”bookaccession”, e.g. “NBK1403”

  • book_title (default: None) - string with name of book (as differentiated from ArticleTitle)

  • book_publisher (default: None) - dict containing {‘name’: string, ‘location’: string}

  • book_sections (default: []) - dict with key->value pairs as section_name->SectionTitle

  • book_contribution_date (default: None) - python datetime date

  • book_date_revised (default: None) - python datetime date

  • book_history (default: []) - dictionary with key->value pairs as PubStatus -> python datetime

  • book_language (default: None) - string (e.g. “eng”)

  • book_editors (default: []) - list containing names from ‘editors’ AuthorList

  • book_abstracts (default: []) - dict with key->value pairs as Label->AbstractText.text)

  • book_medium (default: None) - string (e.g. “Internet”)

  • book_synonyms (default: None) - list of disease synonyms (applicable to “gene” book)

  • book_publication_status (default: None) - string (e.g. “ppublish”)

__init__(xmlstr, *args, **kwargs)[source]

Initialize PubMedArticle from NCBI XML data.

Parameters:
  • xmlstr (str) – XML string from NCBI containing PubmedArticle or PubmedBookArticle data.

  • *args – Additional positional arguments passed to parent class.

  • **kwargs – Additional keyword arguments passed to parent class.

Note

The XML type is automatically detected to handle both regular articles and book chapters. The pubmed_type attribute will be set to ‘article’ or ‘book’ accordingly, and appropriate attributes will be populated.

to_dict()[source]

Convert PubMedArticle to dictionary representation.

Returns:

Dictionary containing all article attributes except

internal XML content and processing attributes.

Return type:

Dict[str, Any]

Note

Excludes ‘content’, ‘xml’, and ‘_root’ attributes from the output to provide a clean data representation suitable for serialization.

property citation

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.

Article Example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

Book Example (GeneReviews):

Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

property citation_html

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, and pages.

Article Example:

McNally EM, <i>et al</i>. Genetic mutations and mechanisms in dilated cardiomyopathy. <i>Journal of Clinical Investigation</i>. 2013; <b>123</b>:19-26. doi: 10.1172/JCI62862.

GeneReviews Example: Tranebjarg L, <i>et al</i>. <i>Jervell and Lange-Nielsen syndrome</i>. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, <i>et al</i>., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

property citation_bibtex
property pubdate

Normalized publication date as datetime object.

Returns the best available publication date from PubMed XML in order of preference: 1. Article PubDate (Year/Month/Day or MedlineDate) 2. Book contribution date 3. History dates (pubmed, entrez, etc.)

Returns:

Publication date as datetime object, or None if no date found

Return type:

datetime or None

Example

article = fetch.article_by_pmid(‘12345’) if article.pubdate:

print(f”Published: {article.pubdate.strftime(‘%Y-%m-%d’)}”)

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:
  • xml (str or bytes)

  • root (str) – (optional) name of root element

Returns:

lxml document object.

The PubMedArticle class is the core data model for representing scientific articles from PubMed. It automatically parses NCBI XML into structured attributes.

Key Attributes

Identifiers
  • pmid - PubMed ID

  • doi - Digital Object Identifier

  • pmc - PubMed Central ID

  • pii - Publisher Item Identifier

Bibliographic Information
  • title - Article title

  • journal - Journal name

  • year - Publication year

  • volume - Journal volume

  • issue - Journal issue

  • pages - Page range

  • first_page - Starting page

  • last_page - Ending page

Authors and Content
  • authors - List of PubMedAuthor objects

  • author_list - Simple list of author name strings

  • abstract - Article abstract text

  • keywords - Author-supplied keywords

Classification
  • mesh_headings - Medical Subject Headings (MeSH) terms

  • publication_types - Type classifications (e.g., “Clinical Trial”)

  • chemicals - Chemical substances mentioned

Dates and History
  • history - Publication timeline with key dates

  • received_date - When manuscript was received

  • accepted_date - When manuscript was accepted

Key Methods

PubMedArticle.__init__(xmlstr, *args, **kwargs)[source]

Initialize PubMedArticle from NCBI XML data.

Parameters:
  • xmlstr (str) – XML string from NCBI containing PubmedArticle or PubmedBookArticle data.

  • *args – Additional positional arguments passed to parent class.

  • **kwargs – Additional keyword arguments passed to parent class.

Note

The XML type is automatically detected to handle both regular articles and book chapters. The pubmed_type attribute will be set to ‘article’ or ‘book’ accordingly, and appropriate attributes will be populated.

PubMedArticle.to_dict()[source]

Convert PubMedArticle to dictionary representation.

Returns:

Dictionary containing all article attributes except

internal XML content and processing attributes.

Return type:

Dict[str, Any]

Note

Excludes ‘content’, ‘xml’, and ‘_root’ attributes from the output to provide a clean data representation suitable for serialization.

Properties

property PubMedArticle.citation

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, pages, and doi.

Article Example:

McNally EM, et al. Genetic mutations and mechanisms in dilated cardiomyopathy. Journal of Clinical Investigation. 2013; 123:19-26. doi: 10.1172/JCI62862.

Book Example (GeneReviews):

Tranebjarg L, et al. Jervell and Lange-Nielsen syndrome. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, et al., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

property PubMedArticle.citation_html

Returns a formatted citation string built from this article’s author(s), title, journal, year, volume, and pages.

Article Example:

McNally EM, <i>et al</i>. Genetic mutations and mechanisms in dilated cardiomyopathy. <i>Journal of Clinical Investigation</i>. 2013; <b>123</b>:19-26. doi: 10.1172/JCI62862.

GeneReviews Example: Tranebjarg L, <i>et al</i>. <i>Jervell and Lange-Nielsen syndrome</i>. 2002 Jul 29 (Updated 2014 Nov 20). In: Pagon RA, <i>et al</i>., editors. GeneReviews (Internet). Seattle (WA): University of Washington, Seattle; 1993-2015. Available from: https://www.ncbi.nlm.nih.gov/books/NBK1405/.

Example Usage

from metapub import PubMedFetcher

fetch = PubMedFetcher()
article = fetch.article_by_pmid('33157158')

# Basic information
print(f"Title: {article.title}")
print(f"Journal: {article.journal} ({article.year})")
print(f"Volume {article.volume}, Issue {article.issue}, Pages {article.pages}")

# Authors
print(f"Authors: {len(article.authors)}")
for author in article.authors:
    print(f"  {author.lastname}, {author.firstname}")

# Content
print(f"Abstract: {article.abstract[:200]}...")
print(f"Keywords: {', '.join(article.keywords) if article.keywords else 'None'}")

# Classification
print(f"MeSH terms: {', '.join(article.mesh_headings[:5])}")  # First 5
print(f"Publication types: {', '.join(article.publication_types)}")

# Generate citation
print(f"Citation: {article.citation}")

# Export to dictionary
data = article.to_dict()

Book Articles

PubMedArticle also handles NCBI book chapters with additional attributes:

# When pubmed_type == 'book'
if article.pubmed_type == 'book':
    print(f"Book ID: {article.book_id}")
    print(f"Book title: {article.book_title}")
    print(f"Publisher: {article.book_publisher}")
    print(f"Editors: {', '.join(article.book_editors)}")

PubMedAuthor

class metapub.PubMedAuthor(xmlelem, *args, **kwargs)[source]

Bases: MetaPubObject

This PubMedAuthor class receives a xml element as required argument and parses it into its parts, exposing them as attributes.

Usage:

author = PubMedAuthor(xml_elem)

To retrieve the standard represenation of a author name, use the __str__ method.

(About unicode: metapub uses unicode_literals in both py3 and py2, so the str() function returns unicode, unless called by a py2k “str()” statement in which unicode_literals is off.)

__init__(xmlelem, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]
static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:
  • xml (str or bytes)

  • root (str) – (optional) name of root element

Returns:

lxml document object.

Represents individual authors with detailed name parsing and affiliation information.

Key Attributes

Name Components
  • lastname - Author’s last name

  • firstname - Author’s first name

  • initials - First/middle initials

  • suffix - Name suffix (Jr., Sr., etc.)

Affiliation
  • affiliation - Institutional affiliation

Example Usage

# Access authors from an article
for author in article.authors:
    print(f"Name: {author.lastname}, {author.firstname}")
    print(f"Initials: {author.initials}")
    if author.affiliation:
        print(f"Affiliation: {author.affiliation}")
    print(f"Full name: {str(author)}")  # Uses __str__ method

MedGenConcept

class metapub.MedGenConcept(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

__init__(xmlstr, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]

returns a dictionary composed of all extractable properties of this concept.

property synonyms

Returns a list of the ‘name’ values from self.names.

property medgen_uid

Synonym for “uid”. Sometimes when juggling concepts from multiple places, this helps.

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:
  • xml (str or bytes)

  • root (str) – (optional) name of root element

Returns:

lxml document object.

Represents medical genetics concepts from NCBI’s MedGen database.

Key Attributes

Identifiers
  • cui - Concept Unique Identifier

  • uid - MedGen UID

  • name - Primary concept name

Content
  • definition - Concept definition

  • synonyms - Alternative names

  • semantic_types - Semantic classifications

Relationships
  • related_concepts - Related MedGen concepts

  • sources - Source vocabularies

Example Usage

from metapub import MedGenFetcher

mg = MedGenFetcher()
concepts = mg.concepts_for_term('cystic fibrosis')

for concept in concepts[:3]:  # First 3 results
    print(f"Name: {concept.name}")
    print(f"CUI: {concept.cui}")
    print(f"Definition: {concept.definition}")

    if concept.synonyms:
        print(f"Synonyms: {', '.join(concept.synonyms[:3])}")  # First 3

    print(f"Semantic types: {', '.join(concept.semantic_types)}")

ClinVarVariant

class metapub.ClinVarVariant(xmlstr, *args, **kwargs)[source]

Bases: MetaPubObject

__init__(xmlstr, *args, **kwargs)[source]

Instantiate with “xml” as string or bytes containing valid XML.

Supply name of root element (string) to set virtual top level. (optional).

to_dict()[source]

returns a dictionary composed of all extractable properties of this concept.

property hgvs_c

Returns a list of all coding HGVS strings from the Allelle data.

property hgvs_g

Returns a list of all genomic HGVS strings from the Allelle data.

property hgvs_p

Returns a list of all protein effect HGVS strings from the Allelle data.

static parse_xml(xml, root=None)

Takes xml (str or bytes) and (optionally) a root element definition string.

If root element defined, DOM object returned is rebased with this element as root.

Parameters:
  • xml (str or bytes)

  • root (str) – (optional) name of root element

Returns:

lxml document object.

Represents genetic variants from NCBI’s ClinVar database with clinical significance information.

Key Attributes

Identifiers
  • accession - ClinVar accession number

  • variation_id - Variation ID

  • allele_id - Allele ID

Genomic Information
  • hgvs_c - HGVS coding sequence notation

  • hgvs_p - HGVS protein sequence notation

  • hgvs_g - HGVS genomic notation

  • gene_symbol - Associated gene symbol

  • molecular_consequences - Predicted effects

Clinical Data
  • clinical_significance - Clinical interpretation (in lowercase)

  • review_status - Review/evidence level

  • last_evaluated - Date of last evaluation

Supporting Data
  • submitters - Contributing organizations

  • conditions - Associated conditions/diseases

  • citations - Supporting literature

Example Usage

from metapub import ClinVarFetcher

cv = ClinVarFetcher()
variant = cv.variant('12345')  # ClinVar ID

print(f"Accession: {variant.accession}")
print(f"Gene: {variant.gene_symbol}")
print(f"HGVS notation: {variant.hgvs_c}")
print(f"Clinical significance: {variant.clinical_significance}")
print(f"Review status: {variant.review_status}")

if variant.molecular_consequences:
    print(f"Molecular consequences: {', '.join(variant.molecular_consequences)}")

if variant.conditions:
    print(f"Associated conditions: {', '.join(variant.conditions[:3])}")  # First 3

Data Model Utilities

Validation Functions

Example usage:

from metapub.validate import is_valid_pmid, is_valid_doi

# Validate PMIDs before processing
pmids = ['12345678', 'invalid', '23456789', '']
valid_pmids = [pmid for pmid in pmids if is_valid_pmid(pmid)]

# Validate DOIs
dois = ['10.1038/nature12373', 'invalid-doi', '10.1126/science.1234567']
valid_dois = [doi for doi in dois if is_valid_doi(doi)]

Conversion Functions

metapub.convert.pmid2doi(pmid)[source]
starting with a pubmed ID, lookup article in pubmed. If DOI found in PubMedArticle object,

return it. Otherwise, use CrossRef to find the DOI for given article.

Parameters:

pmid (str or int)

Returns:

doi (str) or None

Raises:
metapub.convert.doi2pmid(doi)[source]

uses CrossRef and PubMed eutils to lookup a PMID given a known doi.

Warning: NO validation of input DOI performed here. Use

metapub.text_mining.find_doi_in_string beforehand if needed.

If a PMID can be found, return it. Otherwise return None.

In very rare cases, use of the CrossRef->pubmed citation method used here may result in more than one pubmed ID. In this case, this function will return instead the word ‘AMBIGUOUS’.

Parameters:

pmid – (str or int)

Return doi:

(str) if found; ‘AMBIGUOUS’ if citation count > 1; None if no results.

Raises:

NCBIServiceError if NCBI services are down

Example usage:

from metapub.convert import pmid2doi, doi2pmid

# Convert PMID to DOI
doi = pmid2doi('33157158')
if doi:
    print(f"DOI: {doi}")

# Convert DOI to PMID
pmid = doi2pmid('10.1038/nature12373')
if pmid:
    print(f"PMID: {pmid}")

Working with Multiple Data Types

Integration Example

from metapub import PubMedFetcher, MedGenFetcher, ClinVarFetcher

# Initialize fetchers
fetch = PubMedFetcher()
mg = MedGenFetcher()
cv = ClinVarFetcher()

# Research workflow: gene -> variants -> literature
gene = 'BRCA1'

# 1. Find genetic variants
variant_ids = cv.ids_by_gene(gene, single_gene=True)

# 2. Get MedGen concepts for the gene
concepts = mg.concepts_for_term(f"{gene}[gene]")

# 3. Collect all related literature
all_pmids = set()

# From variants
for var_id in variant_ids[:5]:  # Limit for demo
    pmids = cv.pmids_for_id(var_id)
    all_pmids.update(pmids)

# From MedGen concepts
for concept in concepts[:3]:  # Limit for demo
    pmids = mg.pubmeds_for_cui(concept.cui)
    all_pmids.update(pmids)

# 4. Analyze the literature
print(f"Found {len(all_pmids)} unique articles for {gene}")

# Sample the articles
for pmid in list(all_pmids)[:10]:  # First 10
    try:
        article = fetch.article_by_pmid(pmid)
        print(f"PMID {pmid}: {article.title}")
        print(f"  Journal: {article.journal} ({article.year})")
    except Exception as e:
        print(f"  Error with PMID {pmid}: {e}")

Serialization and Export

import json
import csv

# Export article data to JSON
articles_data = []
for pmid in pmids:
    article = fetch.article_by_pmid(pmid)
    articles_data.append(article.to_dict())

with open('articles.json', 'w') as f:
    json.dump(articles_data, f, indent=2, default=str)

# Export to CSV
with open('articles.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile)
    writer.writerow(['PMID', 'Title', 'Journal', 'Year', 'DOI', 'Authors'])

    for article_data in articles_data:
        writer.writerow([
            article_data['pmid'],
            article_data['title'],
            article_data['journal'],
            article_data['year'],
            article_data['doi'],
            '; '.join([str(author) for author in article_data['authors']])
        ])