metapub.urlreverse package
Submodules
metapub.urlreverse.hostname2doiprefix module
metapub.urlreverse.hostname2jrnl module
metapub.urlreverse.methods module
- metapub.urlreverse.methods.get_pnas_doi_from_link(url)[source]
PNAS (proceedings of the national academy of sciences of the USA)
Examples
http://www.pnas.org/content/suppl/2013/07/08/1305207110.DCSupplemental/sapp.pdf –> 10.1073/pnas.1305207110
- Parameters:
url – (str)
- Returns:
doi (str) or None
- metapub.urlreverse.methods.get_elifesciences_doi_from_link(url)[source]
eLIFE / http://elifesciences.org
Examples
http://elifesciences.org/content/5/e12203 –> 10.7554/eLife.12203
http://elifesciences.org/content/4/e11205 –> 10.7554/eLife.11205
http://cdn.elifesciences.org/elife-articles/11205/figures-pdf/elife11205-figures.pdf?xxxx
- Parameters:
url – (str)
- Returns:
doi (str) or None
- metapub.urlreverse.methods.get_bmj_doi_from_link(url)[source]
BMJ and subsidiaries use a VIP-ish format that can sometimes be mapped to their real DOIs. In the case that this process fails, use of the VIP->citation routines should work.
List of BMJ Journals: http://journals.bmj.com/
Examples
http://jmg.bmj.com/content/39/6/e31.full –> 10.1136/jmg.39.6.e31 http://www.bmj.com/content/353/bmj.i2195 –> 10.1136/bmj.i2195 http://www.bmj.com/content/353/bmj.i2139 –> 10.1136/bmj.i2139
- Returns None (should be caught by find_doi_in_string):
http://bmjopengastro.bmj.com/doi/full/10.1136/bmjgast-2015-000075 –> 10.1136/bmjgast-2015-000075
- Returns None (must use VIP->citation routines):
http://gut.bmj.com/content/65/5/767.abstract –> 10.1136/gutjnl-2015-311246
- Parameters:
url – (str)
- Returns:
doi (str) or None
- metapub.urlreverse.methods.get_spandidos_doi_from_link(url)[source]
Spandidos urls follow several different conventions and their website seems to be undergoing some changes recently. For now, let’s just scrape the page for the first available DOI.
Examples
http://www.spandidos-publications.com/or/30/2/553 –> 10.3892/or.2013.2535 http://www.spandidos-publications.com/10.3892/or.2016.4700 –> 10.3892/or.2013.2535 http://www.spandidos-publications.com/10.3892/or.2013.2535/abstract –> 10.3892/or.2013.2535
- Parameters:
url – (str)
- Returns:
doi (str) or None
- metapub.urlreverse.methods.get_karger_doi_from_link(url)[source]
Karger IDs can be found in the URL after the “PDF” or “Abstract” piece, and used to compose a DOI by prepending enough zeroes to make a 9-digit number. The Karger publisher ID is 10.1159
- e.g.
http://www.karger.com/Article/Abstract/329047 –> 10.1159/000329047 http://www.karger.com/Article/Abstract/83388 –> 10.1159/000083388
- Parameters:
url – (str)
- Returns:
doi (str) or None
- metapub.urlreverse.methods.get_jstage_doi_from_link(url)[source]
Since the jstage urls are composed with some degree of unpredictability with respect to what’s found in segment that ought to contain the first_page element, we have to load the _article page (if we can) and try to extract the DOI.
- Parameters:
url – (str)
- Returns:
doi or None
- metapub.urlreverse.methods.get_sciencedirect_doi_from_link(url)[source]
We can extract the PII from most sciencedirect links. To get a DOI, we may be able to simply append the PII to the publisher code “10.1016/”, or we may have to inject the special character separaters into the PII numbers.
Example
http://www.sciencedirect.com/science/article/pii/S0094576599000673
PII = S0094576599000673 DOI = 10.1016/S0094-5765(99)00067-3
- Parameters:
url – (str)
- Returns:
doi or None
- metapub.urlreverse.methods.get_cell_doi_from_link(url)[source]
Cell and ScienceDirect links have similar properties, but there are several different url types for Cell abstracts and PDFs (much like biomedcentral).
Examples
http://www.cell.com/pdf/0092867480906212.pdf –> 10.1016/0092-8674(80)90621-2 http://www.cell.com/cancer-cell/pdf/S1535610806002844.pdf –> 10.1016/j.ccr.2006.09.010 http://www.cell.com/molecular-cell/abstract/S1097-2765(00)80321-4 –> 10.1016/S1097-2765(00)80321-4 http://www.cell.com/current-biology/fulltext/S0960-9822%2816%2930170-1 –> 10.1016/j.cub.2016.03.002 http://www.cell.com/cell-reports/pdfExtended/S2211-1247(15)01030-X –> 10.1016/j.celrep.2015.09.019 http://www.cell.com/ajhg/pdfExtended/S0002-9297(16)30051-9 –> 10.1016/j.ajhg.2016.03.016 http://www.cell.com/ajhg/pdf/S0002-9297(16)00050-1.pdf –> 10.1016/j.ajhg.2016.03.016
- Unsolved cases:
http://www.cell.com/cms/attachment/2020150130/2039963519/mmc1.pdf –> 10.1016/j.neuron.2014.09.027 http://www.cell.com/cms/attachment/2024895080/2044576473/mmc1.pdf –> 10.1016/j.ajhg.2009.01.009 http://www.cell.com/cms/attachment/2030360419/2047969851/mmc1.xlsx –> ? http://www.cell.com/cms/attachment/2030360419/2047969852/mmc2.xlsx –> ?
- Parameters:
url – (str)
- Returns:
doi or None
- metapub.urlreverse.methods.get_nature_doi_from_link(link)[source]
Custom method to get a DOI from a nature.com URL
Examples
http://www.nature.com/modpathol/journal/vaop/ncurrent/extref/modpathol2014160x3.xlsx –> http://www.nature.com/onc/journal/v26/n57/full/1210594a.html –> 10.1038/sj.onc.1210594 http://www.nature.com/pr/journal/v79/n5/full/pr201635a.html –> 10.1038/pr.2016.35
Older articles may have very different DOIs, so at the tail end of this process we do a lookup in dx.doi.org. If the DOI is invalid, we should use scrape_doi_from_article_page and return that instead.
- Example of older-style DOI from Pediatric Research journal (‘pr’):
http://www.nature.com/pr/journal/v49/n1/full/pr20018a.html –> 10.1203/00006450-200101000-00008
- Parameters:
link – the URL
- Returns:
a string containing a DOI, if one was resolved, or None
- metapub.urlreverse.methods.get_biomedcentral_doi_from_link(link)[source]
Custom method to get a DOI from a biomedcentral.com URL
- Parameters:
link – (str) the URL
- Returns:
doi (str) or None
- metapub.urlreverse.methods.get_jci_doi_from_link(url)[source]
Journal of Clinical Investigation (JCI) links have a numerical ID that can be used to reconstruct the article’s DOI.
Example
http://www.jci.org/articles/view/32496 –> 10.1172/JCI32496 http://www.jci.org/articles/view/8154/version/1/pdf/render –> 10.1172/JCI8154
- Parameters:
url – (str)
- Returns:
doi or None
- metapub.urlreverse.methods.get_ahajournals_doi_from_link(url)[source]
If this is an ahajournals.org journal, we might be able to compose a DOI using the publisher base of 10.1161 and pieces of the URL identifying the article.
Example
- http://circimaging.ahajournals.org/content/suppl/2013/04/02/CIRCIMAGING.112.000333.DC1/000333_Supplemental_Material.pdf
–> 10.1161/CIRCIMAGING.112.000333
http://jaha.ahajournals.org/content/4/12/e002395.full.pdf –> 10.1161/JAHA.115.002395
- Parameters:
url – (str)
- Returns:
doi or None
- metapub.urlreverse.methods.get_early_release_doi_from_link(url)[source]
Examples
http://cancerres.aacrjournals.org/content/early/2015/12/30/0008-5472.CAN-15-0295.full.pdf –> 10.1158/0008-5472.CAN-15-0295 http://ajcn.nutrition.org/content/early/2016/04/20/ajcn.115.123752.abstract –> 10.3945/ajcn.115.123752 http://www.mcponline.org/content/early/2016/04/25/mcp.O115.055467.full.pdf+html –> 10.1074/mcp.O115.055467 http://nar.oxfordjournals.org/content/early/2013/11/21/nar.gkt1163.full.pdf –> 10.1093/nar/gkt1163 http://jmg.bmj.com/content/early/2008/07/08/jmg.2008.058297 –> 10.1136/jmg.2008.058297
- Parameters:
url – (str)
- Returns:
doi or None
- metapub.urlreverse.methods.get_generic_doi_from_link(url)[source]
Covers many publisher URLs such as wiley and springer.
Examples
http://onlinelibrary.wiley.com/doi/10.1111/j.1582-4934.2011.01476.x/full –> 10.1111/j.1582-4934.2011.01476.x link.springer.com/article/10.1186/1471-2164-7-243 –> 10.1186/1471-2164-7-243 http://link.springer.com/article/10.1007/s004399900122 –> 10.1007/s004399900122
- Parameters:
url – (str)
- Returns:
doi or None
- metapub.urlreverse.methods.get_plos_doi_from_link(url)[source]
PLOS one (almost?) always has the DOI in the link, with a twist – some of the links we run across are DOIs pointing straight to article supplements.
For example:
Supplement doi: 10.1371/journal.pone.0094554.s002 Article doi: 10.1371/journal.pone.0094554
Since we always want the article DOI for PMID gathering purposes, the DOI returned from this function should be the one pointing to the parent article.
Examples
http://journals.plos.org/plosone/article?id=10.1371%2Fjournal.pone.0154075 –> 10.1371/journal.pone.0154075 http://journals.plos.org/plosone/article?id=info%3Adoi%2F10.1371%2Fjournal.pone.0153994 –> 10.1371/journal.pone.0153994 http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0152441#pone-0152441-t002 –> 10.1371/journal.pone.0152441 http://journals.plos.org/plosone/article/asset?unique&id=info:doi/10.1371/journal.pone.0094554.s002 –> 10.1371/journal.pone.0094554
- Parameters:
url – (str)
- Returns:
doi (str) or None
- metapub.urlreverse.methods.try_doi_methods(url)[source]
Tries every “get_*_doi_from_link” method registered in DOI_METHODS and returns a doi when/if it finds one. As a last resort, uses find_doi_in_string(url), which may work in cases where the DOI can be parsed directly out of the URL.
- Parameters:
url – (str)
- Returns:
{‘doi’: <doi>, ‘method’: <method>} or None
- metapub.urlreverse.methods.try_vip_methods(url)[source]
Many URLs follow the “volume-issue-page” format. If this URL is one of them, this function will return a dictionary containing at least the volume, issue, and first_page aspects of this article. The ‘jtitle’ key may or may not be filled in depending on whether metapub is aware of this journal’s domain name.
See metapub/urlreverse/hostname2journal.py for the list of supported journals (and please consider contributing to the list if you can).
- Parameters:
url – (str)
- Returns:
dict or None
- metapub.urlreverse.methods.try_pmid_methods(url)[source]
Attempts to get the PMID directly out of the URL.
Examples
https://www.ncbi.nlm.nih.gov/pubmed/22253870 –> 22253870 http://aac.asm.org/cgi/pmidlookup?view=long&pmid=7689822 –> 7689822
- Parameters:
url – (str)
- Returns:
pmid or None
metapub.urlreverse.urlreverse module
- metapub.urlreverse.urlreverse.get_article_info_from_url(url)[source]
Using regular expressions, attempt to determine the “format” of the submitted URL, and if possible, extract useful information from the URL for article lookup by ID or citation.
- Possible results:
‘vip’: volume-issue-page –> {‘format’: ‘vip’, ‘volume’: <V>, ‘issue’: <I>, ‘first_page’: <P>, ‘jtitle’: <jrnl>} ‘doi’: has doi in the url –> {‘format’: ‘doi’, ‘doi’: <DOI>, ‘method’: <get_doi_function>} ‘pmid’: has pmid in the url –> {‘format’: ‘pmid’, ‘pmid’: <PMID>} ‘pmcid’: has PMC id in the url –> {‘format’: ‘pmcid’: ‘pmcid’: <PMCID>}
- If none of the available methods work to parse the URL, the result dictionary will be:
{‘format’: ‘unknown’}
- Parameters:
url
- Returns:
result dictionary (see above)
- class metapub.urlreverse.urlreverse.UrlReverse(url, skip_cache=False, **kwargs)[source]
Bases:
objectUrlReverse takes a url and performs the switchboard operations that hopefully lead to the successful “reversal” of an article url into its origination DOI and/or PMID.
Whether the object is able to discover either or both of these identifiers depends highly on the information available in the URL and inferable from what is known about the publisher or website that the article was found upon.
Example
urlrev = UrlReverse(’http://jmg.bmj.com/content/43/2/97.full.pdf’) print(urlrev.doi) # 10.1136/jmg.2005.030833 print(urlrev.pmid) # 15879500
Human inspection can quickly verify that the above PDF definitely maps to this PubMed entry:
(Adding a machine-verification step might be a further development of UrlReverse; however, it would add significant page-loading and processing time. Might be better off as an external “wrapper” around the UrlReverse operations.)
The “steps” attribute will be of most interest if you want to know how UrlReverse arrived at its ID conclusions.
In the case of the above BMJ article URL, while the URL might have typically been “reversible” to a DOI from its constituent information, using DxDOI to verify whether the resultant DOI – “10.1136/bmj.43.2.97” – was a real one resulted in a DxDOIError, indicating that we did not have the Real McCoy.
Using print(urlrev.steps), we get the following:
- [u’FOUND PMID via PubmedFetcher.pmids_for_citation’,
u’FOUND DOI via pmid2doi’, u’VERIFY dx.doi.org: http://jmg.bmj.com/content/43/2/97’]
So, UrlReverse had to use a fallback method – the pmids_for_citation approach, a relatively slower method, but which in this case got the job done. This approach relies on the use of knowing a volume, first_page, and journal name, and (hopefully) receiving a single unambiguous result from the query.
When ambiguous results are received, UrlReverse considers this a failure (see steps).
- Parameters:
skip_cache – (default: False) whether to load results afresh, regardless of cache contents.
- Keyword Arguments:
expiry_date – (default: None) forces cache to reload results older than given date.
cachedir – (default: ~/.cache) allows change of cachedir; set to None to disable cache.
debug – (default: False) raises log level of ‘metapub.UrlReverse’ logger to logging.DEBUG