How Scholarly Identifiers Are Defined

Introduction

This vignette explains how common scholarly identifiers are formally defined, what their structural components are, and what it means for them to be valid in a programmatic context.

When working with identifiers in R, it is essential to distinguish between:

The functions in scholid operate at the structural level. The regexes shown below describe the structural form that an identifier must match.


DOI (Digital Object Identifier)

Governing body: International DOI Foundation
Standard: ISO 26324

Structure

A DOI has two parts:

prefix/suffix

Prefix

  • Always begins with 10.
  • Followed by a registrant code (4–9 digits)

Example:

10.1000
10.1038

Suffix

  • Assigned by the registrant
  • May contain almost any printable character
  • Has no globally fixed grammar
  • Case-sensitive in theory

Example:

10.1000/182
10.1038/s41586-020-2649-2

Important Properties

Structural Regex

A commonly accepted structural regex:

^10\.\d{4,9}/\S+$

This checks: - Prefix starts with 10. - 4–9 digits - A slash - Non-whitespace suffix


ORCID

Governing body: ORCID, Inc.
Standard basis: ISO 7064 (checksum algorithm)

Structure

An ORCID iD consists of 16 characters:

0000-0002-1825-0097

Components

  • 16 digits total
  • Grouped as 4-4-4-4
  • Final character is a checksum digit
  • Check digit may be X

Internally (without hyphens):

0000000218250097

Checksum

Uses ISO 7064 Mod 11-2 algorithm.
A structurally correct ORCID may still be invalid if the checksum does not match.

Structural Regex

Hyphenated form:

^\d{4}-\d{4}-\d{4}-\d{3}[\dX]$

Unhyphenated internal form:

^\d{15}[\dX]$

ISBN (International Standard Book Number)

Governing body: International ISBN Agency
Standard: ISO 2108

Two Forms

ISBN-10

  • 9 digits + checksum digit
  • Check digit may be X

Example:

0306406152
030640615X

ISBN-13

  • 13 digits
  • Usually begins with 978 or 979
  • EAN-13 checksum algorithm

Example:

9780306406157

Structural Regex

ISBN-10:

^\d{9}[\dX]$

ISBN-13:

^\d{13}$

ISSN (International Standard Serial Number)

Governing body: ISSN International Centre
Standard: ISO 3297

Structure

An ISSN has 8 characters:

1234-567X

Components

  • 7 digits
  • 1 checksum digit (0–9 or X)
  • Canonical display includes a hyphen after 4 digits

Internal numeric form:

1234567X

Structural Regex

Hyphenated:

^\d{4}-\d{3}[\dX]$

Compact form:

^\d{7}[\dX]$

arXiv Identifier

Authority: arXiv (Cornell University)

Two Formats

Modern (post-2007)

YYMM.NNNN
YYMM.NNNNN

Optional version suffix:

YYMM.NNNN(v2)

Components: - 4-digit year/month - Dot - 4–5 digit submission number - Optional version vN

Structural regex:

^\d{4}\.\d{4,5}(v\d+)?$

Legacy (pre-2007)

archive/YYMMNNN

Example:

hep-th/9901001

Structural regex:

^[a-z\-]+/\d{7}(v\d+)?$

PMID (PubMed Identifier)

Authority: U.S. National Library of Medicine

Structure

Example:

12345678

Structural regex:

^\d+$

PMCID (PubMed Central Identifier)

Authority: PubMed Central

Structure

PMC1234567

Components: - Literal prefix PMC - One or more digits

Structural regex:

^PMC\d+$