An example of DOI extraction from a PDF file using pdftotext, grep and sed:
pdftotext -f 1 -l 1 my.pdf - 2>/dev/null | grep -o -a -m 1 -E '\b10\.[0-9]{3,}([.][0-9]+)*/([[:graph:]])+\b' | sed -e 's/[^[A-Za-z0-9]*$//' -e 's/fi/fi/g' -e 's/fl/fl/g'
- pdftotext -f 1 -l 1: assumes that the DOI is present on the first page of the PDF;
- grep: uses a simplification of Alix Axel's regexp but still relies on \b
- sed : some additional cleaning of the DOI :
- Remove non-alphanumeric characters at the end of the DOI, in particular punctuations in the REGEXP sense, ie ][!"#$%&'()*+,./:;<=>?@^_`{|}~- and non-ASCII characters (once had a Chinese ideogram as last character);
- Replace the ligatures (fi and fl): these generally come from the formatting in the pdf and not from the DOI itself...