More fun publisher surveillance: Elsevier embeds a hash in the PDF metadata that is *unique for each time a PDF is downloaded*, this is a diff between metadata from two of the same paper. Combined with access timestamps, they can uniquely identify the source of any shared PDFs.
@SchmiegSophie sure, I'll extract and post a bunch. also agree this doesn't look random, and the whitespace that's present in the original strings gets squashed in the JSON version in the screenshot. one sec
@json_dirs @SchmiegSophie elsevier’s open access journals have this too so you can obtain an infinite amout of samples via that way. just confirmed the ID changes on every download for them too
@json_dirs @SchmiegSophie oh, looks like those IDs contain periods too in addition to dashes and underlines, they’re self-closing XMP tags. exiftool doesn’t seem to be a good tool for these kind of non-standard tags
@json_dirs @SchmiegSophie 1) there are no actual spaces, exiftool inserts them at lowercase-uppercase borders 2) exiftool strips those periods in the tag so looks like `exiftool -b -xmp PDFFILE | grep -oP '<[\w.-]{40,}/>'` is the only way to get the intact tag out
@json_dirs @SchmiegSophie 3) this is bad news because now there are three non-alphanumeric characters so it may not be base64
@json_dirs @SchmiegSophie though the "NN" and "Tma" parts are now perfectly aligning in my samples. downloaded 7 different versions of one OA article using three different browsers, one being tor browser; characters {16, 23..24, 29, 36, 53..61, 68..70} are aligning (23..24: "NN", 68..70: "Tma")
@horsemankukka @SchmiegSophie yep seeing the same thing. updated the gist with the samples and some python code to extract, mine i think still needs a lil regex tweaking but definitely more pattern to be found now. Tma appears to be a suffix. also seeing columns of "lt"/"lw", "o9e" and"G"
@horsemankukka @SchmiegSophie sorry my eyes are glazing over, that's almost certainly what you were flagging with your column numbers :p. will return tomorrow.
@json_dirs @horsemankukka @SchmiegSophie fwiw, maybe be helpful: