-
Star
(104)
You must be signed in to star a gist -
Fork
(9)
You must be signed in to fork a gist
-
-
Save sneakers-the-rat/172e8679b824a3871decd262ed3f59c6 to your computer and use it in GitHub Desktop.
| # -------------------------------------------------------------------- | |
| # Recursively find pdfs from the directory given as the first argument, | |
| # otherwise search the current directory. | |
| # Use exiftool and qpdf (both must be installed and locatable on $PATH) | |
| # to strip all top-level metadata from PDFs. | |
| # | |
| # Note - This only removes file-level metadata, not any metadata | |
| # in embedded images, etc. | |
| # | |
| # Code is provided as-is, I take no responsibility for its use, | |
| # and I make no guarantee that this code works | |
| # or makes your PDFs "safe," whatever that means to you. | |
| # | |
| # You may need to enable execution of this script before using, | |
| # eg. chmod +x clean_pdf.sh | |
| # | |
| # example: | |
| # clean current directory: | |
| # >>> ./clean_pdf.sh | |
| # | |
| # clean specific directory: | |
| # >>> ./clean_pdf.sh some/other/directory | |
| # -------------------------------------------------------------------- | |
| # Color Codes so that warnings/errors stick out | |
| GREEN="\e[32m" | |
| RED="\e[31m" | |
| CLEAR="\e[0m" | |
| # loop through all PDFs in first argument ($1), | |
| # or use '.' (this directory) if not given | |
| DIR="${1:-.}" | |
| echo "Cleaning PDFs in directory $DIR" | |
| # use find to locate files, pip to while read to get the | |
| # whole line instead of space delimited | |
| # Note -- this will find pdfs recursively!! | |
| find $DIR -type f -name "*.pdf" | while read -r i | |
| do | |
| # output file as original filename with suffix _clean.pdf | |
| TMP=${i%.*}_clean.pdf | |
| # remove the temporary file if it already exists | |
| if [ -f "$TMP" ]; then | |
| rm "$TMP"; | |
| fi | |
| exiftool -q -q -all:all= "$i" -o "$TMP" | |
| qpdf --linearize --deterministic-id --replace-input "$TMP" | |
| echo -e $(printf "${GREEN}Processed ${RED}${i} ${CLEAR}as ${GREEN}${TMP}${CLEAR}") | |
| done |
@muddynat you could probably just do something like the following one-liner for this:
for f in ./*.pdf; do exiftool -q -q -all:all= "$i" && qpdf --linearize --replace-input; done
that^^ would work, just need to add "$i" to the qpdf part, i believe. most of this script is just to add comments and tell the person running it what's going on. I have never gotten the hang of writing arguments for shell scripts, but it would be nice to have a --suffix flag (that you could just give "").
@RooneyMcNibNug & @sneakers-the-rat thanks! I don't know much about bash scripting - where would this "$i" go in the qpdf part?
@muddynat that's a string replacement, so you're substituting "$i" for the value of what you are iterating over in for or while . taking a second look at the code in the above comment it also needs its variable renamed and to use the while pattern, so it would be like this:
find $DIR -type f -name "*.pdf" | while read -r i
do
exiftool -q -q -all:all= "$i"
qpdf --linearize --replace-input "$i"
done@sneakers-the-rat just saw this user on reddit recommending the use of the --deterministic-id command from QPDF to achieve cleaner results: https://reddit.com/r/Piracy/comments/12ai3so/how_to_remove_all_metadata_identifiers_when/. From what I understood, this way each cleaned up file generated from a certain source pdf would have the exact same UUID
End result in line 52 would be simply qpdf --linearize --deterministic-id --replace-input "$TMP"
@sneakers-the-rat just saw this user on reddit recommending the use of the
--deterministic-idcommand from QPDF to achieve cleaner results: https://reddit.com/r/Piracy/comments/12ai3so/how_to_remove_all_metadata_identifiers_when/. From what I understood, this way each cleaned up file generated from a certain source pdf would have the exact same UUIDEnd result in line 52 would be simply
qpdf --linearize --deterministic-id --replace-input "$TMP"
the documentation related to --deterministic-id on QPDF here and a thread explaining it more clearly. the same article from Elsvr downloaded from multiple institutional accesses will generate byte-for-byte identical outputs from ExifTool+QPDF when using this method.
@bigfakelaugh yes good addition, edited!
How would one change this to replace the existing file, rather than creating a new one with the _clean.pdf suffix?