Reports

An example of DOI extraction from a PDF file using pdftotext, grep and sed:

pdftotext -f 1 -l 1 my.pdf - 2>/dev/null | grep -o -a -m 1 -E '\b10\.[0-9]{3,}([.][0-9]+)*/([[:graph:]])+\b' | sed -e 's/[^[A-Za-z0-9]*$//' -e 's/ﬁ/fi/g' -e 's/ﬂ/fl/g'

pdftotext -f 1 -l 1: assumes that the DOI is present on the first page of the PDF;
grep: uses a simplification of Alix Axel's regexp but still relies on \b
sed : some additional cleaning of the DOI :
1. Remove non-alphanumeric characters at the end of the DOI, in particular punctuations in the REGEXP sense, ie ][!"#$%&'()*+,./:;<=>?@^_`{|}~- and non-ASCII characters (once had a Chinese ideogram as last character);
2. Replace the ligatures (fi and fl): these generally come from the formatting in the pdf and not from the DOI itself...

79433215