metapub.urlreverse package

Submodules

metapub.urlreverse.hostname2doiprefix module

metapub.urlreverse.hostname2jrnl module

metapub.urlreverse.methods module

metapub.urlreverse.methods.DXDOI()[source]

metapub.urlreverse.methods.get_journal_name_from_url(url)[source]

metapub.urlreverse.methods.get_pnas_doi_from_link(url)[source]

PNAS (proceedings of the national academy of sciences of the USA)

Examples

http://www.pnas.org/content/suppl/2013/07/08/1305207110.DCSupplemental/sapp.pdf –> 10.1073/pnas.1305207110

Parameters:: url – (str)
Returns:: doi (str) or None

metapub.urlreverse.methods.get_elifesciences_doi_from_link(url)[source]

eLIFE / http://elifesciences.org

Examples

http://elifesciences.org/content/5/e12203 –> 10.7554/eLife.12203
http://elifesciences.org/content/4/e11205 –> 10.7554/eLife.11205
http://elifesciences.org/content/4/e11205-download.pdf
http://cdn.elifesciences.org/elife-articles/11205/figures-pdf/elife11205-figures.pdf?xxxx

Parameters:: url – (str)
Returns:: doi (str) or None

metapub.urlreverse.methods.get_bmj_doi_from_link(url)[source]

BMJ and subsidiaries use a VIP-ish format that can sometimes be mapped to their real DOIs. In the case that this process fails, use of the VIP->citation routines should work.

List of BMJ Journals: http://journals.bmj.com/

Examples

http://jmg.bmj.com/content/39/6/e31.full –> 10.1136/jmg.39.6.e31 http://www.bmj.com/content/353/bmj.i2195 –> 10.1136/bmj.i2195 http://www.bmj.com/content/353/bmj.i2139 –> 10.1136/bmj.i2139

Returns None (should be caught by find_doi_in_string):: http://bmjopengastro.bmj.com/doi/full/10.1136/bmjgast-2015-000075 –> 10.1136/bmjgast-2015-000075
Returns None (must use VIP->citation routines):: http://gut.bmj.com/content/65/5/767.abstract –> 10.1136/gutjnl-2015-311246

Parameters:: url – (str)
Returns:: doi (str) or None

metapub.urlreverse.methods.get_spandidos_doi_from_link(url)[source]

Spandidos urls follow several different conventions and their website seems to be undergoing some changes recently. For now, let’s just scrape the page for the first available DOI.

Examples

http://www.spandidos-publications.com/or/30/2/553 –> 10.3892/or.2013.2535 http://www.spandidos-publications.com/10.3892/or.2016.4700 –> 10.3892/or.2013.2535 http://www.spandidos-publications.com/10.3892/or.2013.2535/abstract –> 10.3892/or.2013.2535

Parameters:: url – (str)
Returns:: doi (str) or None

metapub.urlreverse.methods.get_karger_doi_from_link(url)[source]

Karger IDs can be found in the URL after the “PDF” or “Abstract” piece, and used to compose a DOI by prepending enough zeroes to make a 9-digit number. The Karger publisher ID is 10.1159

e.g.: http://www.karger.com/Article/Abstract/329047 –> 10.1159/000329047 http://www.karger.com/Article/Abstract/83388 –> 10.1159/000083388

Parameters:: url – (str)
Returns:: doi (str) or None

metapub.urlreverse.methods.get_jstage_doi_from_link(url)[source]

Since the jstage urls are composed with some degree of unpredictability with respect to what’s found in segment that ought to contain the first_page element, we have to load the _article page (if we can) and try to extract the DOI.

Parameters:: url – (str)
Returns:: doi or None

metapub.urlreverse.methods.get_sciencedirect_doi_from_link(url)[source]

We can extract the PII from most sciencedirect links. To get a DOI, we may be able to simply append the PII to the publisher code “10.1016/”, or we may have to inject the special character separaters into the PII numbers.

Example

http://www.sciencedirect.com/science/article/pii/S0094576599000673

PII = S0094576599000673 DOI = 10.1016/S0094-5765(99)00067-3

Parameters:: url – (str)
Returns:: doi or None

metapub.urlreverse.methods.get_cell_doi_from_link(url)[source]

Cell and ScienceDirect links have similar properties, but there are several different url types for Cell abstracts and PDFs (much like biomedcentral).

Examples

http://www.cell.com/pdf/0092867480906212.pdf –> 10.1016/0092-8674(80)90621-2 http://www.cell.com/cancer-cell/pdf/S1535610806002844.pdf –> 10.1016/j.ccr.2006.09.010 http://www.cell.com/molecular-cell/abstract/S1097-2765(00)80321-4 –> 10.1016/S1097-2765(00)80321-4 http://www.cell.com/current-biology/fulltext/S0960-9822%2816%2930170-1 –> 10.1016/j.cub.2016.03.002 http://www.cell.com/cell-reports/pdfExtended/S2211-1247(15)01030-X –> 10.1016/j.celrep.2015.09.019 http://www.cell.com/ajhg/pdfExtended/S0002-9297(16)30051-9 –> 10.1016/j.ajhg.2016.03.016 http://www.cell.com/ajhg/pdf/S0002-9297(16)00050-1.pdf –> 10.1016/j.ajhg.2016.03.016

Unsolved cases:: http://www.cell.com/cms/attachment/2020150130/2039963519/mmc1.pdf –> 10.1016/j.neuron.2014.09.027 http://www.cell.com/cms/attachment/2024895080/2044576473/mmc1.pdf –> 10.1016/j.ajhg.2009.01.009 http://www.cell.com/cms/attachment/2030360419/2047969851/mmc1.xlsx –> ? http://www.cell.com/cms/attachment/2030360419/2047969852/mmc2.xlsx –> ?

Parameters:: url – (str)
Returns:: doi or None

metapub.urlreverse.methods.get_nature_doi_from_link(link)[source]

Custom method to get a DOI from a nature.com URL

Examples

http://www.nature.com/modpathol/journal/vaop/ncurrent/extref/modpathol2014160x3.xlsx –> http://www.nature.com/onc/journal/v26/n57/full/1210594a.html –> 10.1038/sj.onc.1210594 http://www.nature.com/pr/journal/v79/n5/full/pr201635a.html –> 10.1038/pr.2016.35

Older articles may have very different DOIs, so at the tail end of this process we do a lookup in dx.doi.org. If the DOI is invalid, we should use scrape_doi_from_article_page and return that instead.

Example of older-style DOI from Pediatric Research journal (‘pr’):: http://www.nature.com/pr/journal/v49/n1/full/pr20018a.html –> 10.1203/00006450-200101000-00008

Parameters:: link – the URL
Returns:: a string containing a DOI, if one was resolved, or None

metapub.urlreverse.methods.get_biomedcentral_doi_from_link(link)[source]

Custom method to get a DOI from a biomedcentral.com URL

Parameters:: link – (str) the URL
Returns:: doi (str) or None

metapub.urlreverse.methods.get_jci_doi_from_link(url)[source]

Journal of Clinical Investigation (JCI) links have a numerical ID that can be used to reconstruct the article’s DOI.

Example

http://www.jci.org/articles/view/32496 –> 10.1172/JCI32496 http://www.jci.org/articles/view/8154/version/1/pdf/render –> 10.1172/JCI8154

Parameters:: url – (str)
Returns:: doi or None

metapub.urlreverse.methods.get_ahajournals_doi_from_link(url)[source]

If this is an ahajournals.org journal, we might be able to compose a DOI using the publisher base of 10.1161 and pieces of the URL identifying the article.

Example

http://circimaging.ahajournals.org/content/suppl/2013/04/02/CIRCIMAGING.112.000333.DC1/000333_Supplemental_Material.pdf: –> 10.1161/CIRCIMAGING.112.000333

http://jaha.ahajournals.org/content/4/12/e002395.full.pdf –> 10.1161/JAHA.115.002395

Parameters:: url – (str)
Returns:: doi or None

metapub.urlreverse.methods.get_early_release_doi_from_link(url)[source]

Examples

http://cancerres.aacrjournals.org/content/early/2015/12/30/0008-5472.CAN-15-0295.full.pdf –> 10.1158/0008-5472.CAN-15-0295 http://ajcn.nutrition.org/content/early/2016/04/20/ajcn.115.123752.abstract –> 10.3945/ajcn.115.123752 http://www.mcponline.org/content/early/2016/04/25/mcp.O115.055467.full.pdf+html –> 10.1074/mcp.O115.055467 http://nar.oxfordjournals.org/content/early/2013/11/21/nar.gkt1163.full.pdf –> 10.1093/nar/gkt1163 http://jmg.bmj.com/content/early/2008/07/08/jmg.2008.058297 –> 10.1136/jmg.2008.058297

Parameters:: url – (str)
Returns:: doi or None

metapub.urlreverse.methods.get_generic_doi_from_link(url)[source]

Covers many publisher URLs such as wiley and springer.

Examples

http://onlinelibrary.wiley.com/doi/10.1111/j.1582-4934.2011.01476.x/full –> 10.1111/j.1582-4934.2011.01476.x link.springer.com/article/10.1186/1471-2164-7-243 –> 10.1186/1471-2164-7-243 http://link.springer.com/article/10.1007/s004399900122 –> 10.1007/s004399900122

Parameters:: url – (str)
Returns:: doi or None

metapub.urlreverse.methods.get_plos_doi_from_link(url)[source]

PLOS one (almost?) always has the DOI in the link, with a twist – some of the links we run across are DOIs pointing straight to article supplements.

For example:

Supplement doi: 10.1371/journal.pone.0094554.s002 Article doi: 10.1371/journal.pone.0094554

Since we always want the article DOI for PMID gathering purposes, the DOI returned from this function should be the one pointing to the parent article.

Examples

http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0154075 –> 10.1371/journal.pone.0154075 http://journals.plos.org/plosone/article?id=info%3Adoi%2F10.1371%2Fjournal.pone.0153994 –> 10.1371/journal.pone.0153994 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152441#pone-0152441-t002 –> 10.1371/journal.pone.0152441 http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0094554.s002 –> 10.1371/journal.pone.0094554

Parameters:: url – (str)
Returns:: doi (str) or None

metapub.urlreverse.methods.try_doi_methods(url)[source]

Tries every “get_*_doi_from_link” method registered in DOI_METHODS and returns a doi when/if it finds one. As a last resort, uses find_doi_in_string(url), which may work in cases where the DOI can be parsed directly out of the URL.

Parameters:: url – (str)
Returns:: {‘doi’: <doi>, ‘method’: <method>} or None

metapub.urlreverse.methods.try_vip_methods(url)[source]

Many URLs follow the “volume-issue-page” format. If this URL is one of them, this function will return a dictionary containing at least the volume, issue, and first_page aspects of this article. The ‘jtitle’ key may or may not be filled in depending on whether metapub is aware of this journal’s domain name.

See metapub/urlreverse/hostname2journal.py for the list of supported journals (and please consider contributing to the list if you can).

Parameters:: url – (str)
Returns:: dict or None

metapub.urlreverse.methods.try_pmid_methods(url)[source]

Attempts to get the PMID directly out of the URL.

Examples

https://www.ncbi.nlm.nih.gov/pubmed/22253870 –> 22253870 http://aac.asm.org/cgi/pmidlookup?view=long&pmid=7689822 –> 7689822

Parameters:: url – (str)
Returns:: pmid or None

metapub.urlreverse.urlreverse module

metapub.urlreverse.urlreverse.get_article_info_from_url(url)[source]

Using regular expressions, attempt to determine the “format” of the submitted URL, and if possible, extract useful information from the URL for article lookup by ID or citation.

Possible results:: ‘vip’: volume-issue-page –> {‘format’: ‘vip’, ‘volume’: <V>, ‘issue’: <I>, ‘first_page’: <P>, ‘jtitle’: <jrnl>} ‘doi’: has doi in the url –> {‘format’: ‘doi’, ‘doi’: <DOI>, ‘method’: <get_doi_function>} ‘pmid’: has pmid in the url –> {‘format’: ‘pmid’, ‘pmid’: <PMID>} ‘pmcid’: has PMC id in the url –> {‘format’: ‘pmcid’: ‘pmcid’: <PMCID>}
If none of the available methods work to parse the URL, the result dictionary will be:: {‘format’: ‘unknown’}

Parameters:: url
Returns:: result dictionary (see above)

class metapub.urlreverse.urlreverse.UrlReverse(url, skip_cache=False, **kwargs)[source]

Bases: object

UrlReverse takes a url and performs the switchboard operations that hopefully lead to the successful “reversal” of an article url into its origination DOI and/or PMID.

Whether the object is able to discover either or both of these identifiers depends highly on the information available in the URL and inferable from what is known about the publisher or website that the article was found upon.

Example

urlrev = UrlReverse(’http://jmg.bmj.com/content/43/2/97.full.pdf’) print(urlrev.doi) # 10.1136/jmg.2005.030833 print(urlrev.pmid) # 15879500

Human inspection can quickly verify that the above PDF definitely maps to this PubMed entry:

https://www.ncbi.nlm.nih.gov/pubmed/15879500

(Adding a machine-verification step might be a further development of UrlReverse; however, it would add significant page-loading and processing time. Might be better off as an external “wrapper” around the UrlReverse operations.)

The “steps” attribute will be of most interest if you want to know how UrlReverse arrived at its ID conclusions.

In the case of the above BMJ article URL, while the URL might have typically been “reversible” to a DOI from its constituent information, using DxDOI to verify whether the resultant DOI – “10.1136/bmj.43.2.97” – was a real one resulted in a DxDOIError, indicating that we did not have the Real McCoy.

Using print(urlrev.steps), we get the following:

[u’FOUND PMID via PubmedFetcher.pmids_for_citation’,
u’FOUND DOI via pmid2doi’, u’VERIFY dx.doi.org: http://jmg.bmj.com/content/43/2/97’]

So, UrlReverse had to use a fallback method – the pmids_for_citation approach, a relatively slower method, but which in this case got the job done. This approach relies on the use of knowing a volume, first_page, and journal name, and (hopefully) receiving a single unambiguous result from the query.

When ambiguous results are received, UrlReverse considers this a failure (see steps).

Parameters:

skip_cache – (default: False) whether to load results afresh, regardless of cache contents.

Keyword Arguments:

expiry_date – (default: None) forces cache to reload results older than given date.
cachedir – (default: ~/.cache) allows change of cachedir; set to None to disable cache.
debug – (default: False) raises log level of ‘metapub.UrlReverse’ logger to logging.DEBUG

__init__(url, skip_cache=False, **kwargs)[source]

to_dict()[source]: Returns a dictionary containing all public object attributes (i.e. not starting with an underscore). Function objects are converted to their names for JSON serialization.