Last active
February 12, 2025 14:46
-
-
Save jgusta/0c819cc0f5b680df0e50067f4db77595 to your computer and use it in GitHub Desktop.
A Javascript for Automation (JXA) script for automating SiteSucker for Mac with opinionated defaults.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
/** | |
CHANGELOG | |
2-10-25 - Ask about whether to use webviews. | |
*/ | |
/** | |
* This script is a JavaScript for Automation (JXA) script that uses the SiteSucker app to download a website. | |
* | |
* Background: JXA is dogshit. I know there is no need to hold back on the cursing, because if you are using JXA, | |
* you have already explored every explitive available. The official JXA documentation is apple's changelog which | |
* itself is simply a log describing exactly when they themselves completely gave up on this bullshit. Also, | |
* AppleScript is dogshit. And that is the MUCH better documented of the two automation languages. | |
* | |
* So here you are trying to use JXA for a program that doesn't even technically support it, trying to | |
* translate an already god forsaken applescript descriptor into an esoteric and poorly documented, | |
* deceptively evil javascript implementation. This insideous spawn of satan has the veneer of a friendly | |
* scripting language (ECMA compliant, even!) but instead of any sort of sane standard library based on | |
* idiomatic web traditions or even an event loop, you are getting the grinding four-on-the-floor, piss-poor | |
* tone-deaf implementation of the Objective-C runtime. Welcome to hell. | |
* | |
* Needless to say, this script is the result of over 3 months of intense pain and suffering. I hope you enjoy it. | |
* | |
* Oh what does this script do? It automates SiteSucker with some sane defaults. Your own frustrations with this | |
* beautiful program with soul-crushing configuration and defaults has driven you to look for automated solutions. | |
* Your desire for pain and despair has further guided you to consider JXA for this. Its also your first time | |
* using JXA (BECAUSE THERE IS NEVER A SECOND TIME). My god I hope no one is unfortunate nough to search for | |
* and find this script. The github copilot auto-complete suggestion for this comment includes all of | |
* Dante's inferno I kid you not. | |
* | |
* So how to use this script. For fuck's sake you aren't just looking for this solution because you are trying | |
* to copy-paste something. If this is your first time seeing "app.includeStandardAdditions" my god, turn away | |
* and run now. But more likely You are knee deep in the shit right now and just need a lifeline. I don't need to | |
* tell you what color underwear is best to shit yourself in. Enjoy. | |
* | |
* @param {string} input - The input parameter for the `run` function. | |
* @param {any} parameters - The parameters for the `run` function. | |
* @returns {void} | |
*/ | |
function run(input, parameters) { | |
// --- BEGIN app setup --- | |
// this section needed before function declarations | |
ObjC.import("Foundation") | |
ObjC.import("AppKit") | |
const app = Application("SiteSucker") | |
app.includeStandardAdditions = true | |
app.strictPropertyScope = false | |
app.strictCommandScope = false | |
app.strictParameterType = false | |
// --- END app setup --- | |
// --- BEGIN pre IIFE section --- | |
/** | |
* Edit a limited subset of sitesucker settings here. | |
* All keys are MANDATORY here. If you don't know what to do, leave them as is. | |
* | |
* CAVEAT ONE | |
* One thing to note: "identity" may become outdated when SiteSucker updates | |
* (as of writing this version is 5.4.5). | |
* You have to use an EXACT string from the "Settings->Request->Identity" dropdown. | |
* This string includes a special dash character from hell which is NOT two dash characters i.e. -- | |
* but is a unicode piece of shit which I have pasted here: — | |
* | |
* If this is too hard for you, just use the evergreen string "SiteSucker" | |
* like a chump and out yourself to every site you scrape. | |
* | |
* Alternatively, create your own "Identity" in "SiteSucker-Preferences->Identities" | |
* and use the "Name" string you create there, but if you are that advanced, | |
* then use that big brain of yours to figure out how to copy/paste that | |
* unicode character (here it is again! right here: —) because it is a lot easier. | |
* | |
* CAVEAT TWO | |
* I can almost guarantee you that you cannot come up with a better excludeRegexString | |
* than I have here. Your regex skills pale in comparison to mine. I'll have you know | |
* I graduated top of my class in the Navy Seals, I am trained in gorilla warfare and | |
* I'm the top coder in the entire US armed forces, have been involved in numerous | |
* secret raids, and I have over 300 confirmed kills parsing HTML alone. | |
* Furthermore it is likely you cannot grasp (nor should you want to understand) the | |
* intricacies of the escaping idiocyncasies I have successfully navigated for you here. | |
* But if you are dying for a world of pain and want to jackknife this volatile 18-wheeler | |
* past the off-ramp I have constructed for you, let me give you some hints: | |
* 1) The applescript implementation of the settings object is busted. It says an array objects | |
* in the spec, but nope, when they say array of objects, they mean just ONE object literal. | |
* I just saved you about 20 hours of pain. Therefore, you must pile on all your regex into | |
* one string, which is why I only let your set a single excludeRegexString string here. | |
* 2) The regex string is a string, it can't be a regex. To minimize escape character jank, I have | |
* used a js template string. However, this implementation will graciously allow you to | |
* arbitrarily escape any character with a backslash meaning that if you want a backslash in | |
* your regex, you need to escape it. | |
* 3) Finally, don't forget to escape the ending dollar sign (ONCE) because this of course is a | |
* template string and that is the only character you need to escape other than a backtick. Or | |
* you could leave the end of line thing off, but then you are probably someone who just | |
* leaves things unfinished anyway and have no hope of succeeding anyway. | |
* */ | |
const scriptSetup = { | |
folder: "", | |
urlToSuck: input, | |
excludeRegexString: `^https?:\\/\\/(?:(?:(?:[a-z1-9-_]{1,10}\\.){0,2}(?:cpanel|hp|github|discogs|bandcamp|angelfire|webcitation|wordpress|jetpack|facebook|googletagmanager|googleapis|google|list-manage|linkedin|instagram|mixcloud|x|twitter|gstatic|thisamericanlife|nytimes|imdb)\\.(?:com|net|org).*?)|(?:.*?\\.php\?action=(?:unread|stats|recent|profile|search|login|help|register|printpage|print).*?)|(?:.*?(?:login|logout|auth|authorize)\\/?(?:\\.php|\\.asp|\\.html).*?))\$`, | |
customSettings: { | |
downloadAttempts: 1, | |
downloadDelay: 0.5, | |
downloadTimeout: 15, | |
saveDelay: 5.0, | |
treatAmbiguousURLsAsFolders: true, | |
//identity: "SiteSucker", | |
identity: "Firefox 126.0 — Macintosh", | |
ignoreRelEqualsNofollow: true, | |
ignoreRobotExclusions: true | |
} | |
} | |
/* | |
* getSettingsObject - returns the default settings object for SiteSucker | |
* @returns {object} - the settings object | |
* */ | |
function getSettingsObject() { | |
const excludeRegexString = `^https?:\/\/(?:(?:(?:[a-z1-9-_]{1,20}\.){0,2}(?:cpanel|hp|github|discogs|bandcamp|angelfire|webcitation|wordpress|jetpack|globalprivacycontrol|facebook|googletagmanager|googleapis|google|list-manage|linkedin|adstransparency|instagram|withgoogle|mixcloud|digitaladvertisingalliance|x|twitter|gstatic|thisamericanlife|nytimes|imdb)\.(?:com|google|net|org|eu|co\.uk|).*?)|(?:.*?\.php?action=(?:unread|stats|recent|profile|search|login|help|register|printpage|print).*?)|(?:.*?(?:login|logout|auth|authorize)\/?(?:\.php|\.asp|\.html).*?))$` | |
return { | |
alwaysDownloadHtmlAndCss: false, | |
askForDestination: false, | |
checkAllLinks: false, | |
connections: 30, | |
createPDF: false, | |
customDataAttributes: [], | |
customTypes: [], | |
destination: {}, | |
downloadAttempts: 2, | |
downloadDelay: 2, | |
downloadErrorPages: false, | |
downloadLinksInPDFs: false, | |
downloadTimeout: 60, | |
downloadUsingWebViews: true, | |
fileModification: "localize", | |
fileReplacement: "never replace", | |
fileTypesOption: "allow all file types", | |
filterArchives: false, | |
filterAudioFiles: false, | |
filterCustomTypes: false, | |
filterImages: false, | |
htmlTypes: [], | |
identity: "Firefox 126.0 — Macintosh", | |
ignoreFilenameInHeaders: false, | |
ignoreRelEqualsNofollow: true, | |
ignoreRobotExclusions: true, | |
includeSupportingFiles: true, | |
javascript: "", | |
limitFiles: false, | |
limitLevels: false, | |
limitMaxFileSize: false, | |
limitMinFileSize: false, | |
limitMinImageSize: false, | |
logErrors: true, | |
logFinalStatus: false, | |
logHistory: true, | |
loginDialog: "display when necessary", | |
logMediaTypes: false, | |
logWarnings: false, | |
maxFiles: 0, | |
maxFileSize: 1000, | |
maxLevels: 4, | |
// "mediaTypeReplacement": [], | |
minFileSize: 0, | |
minImageSize: 25, | |
//"patterns": [], | |
postanalysisScript: "", | |
preanalysisScript: "", | |
replaceSpecialCharactersWithUnderscore: true, | |
saveDelay: 3, | |
scanCommentsForUrls: false, | |
textEncoding: "Default", | |
treatAmbiguousURLsAsFolders: false, | |
urlConstraint: "host", | |
urlsToExclude: { | |
regex: true, | |
urlOrPattern: excludeRegexString | |
}, | |
// "urlsToInclude": [], | |
webViewSize: { | |
height: 1080, | |
width: 1920 | |
} | |
} | |
} | |
// --- END pre IIFE section --- | |
// --- BEGIN giant IIFE --- | |
(({ folder, urlToSuck, customSettings, excludeRegexString }) => { | |
debugger | |
// -- BEGIN function declarations --- | |
/* | |
* slugify - slugifies a string for use in a URL or file name | |
* @param {string} str - the string to be slugified | |
* @returns {string} - the slugified string | |
* @throws {Error} - if the string is undefined or null | |
* */ | |
function slugify(str) { | |
if (!str) { | |
throw new Error("String to be slugified is undefined or null") | |
} | |
str = str.replace(/^\s+|\s+$/g, "") | |
str = str.toLowerCase() | |
str = str | |
.replace(/[^a-z0-9 -.]/g, "") | |
.replace(/\s+/g, "-") | |
.replace(/-+/g, "-") | |
return str | |
} | |
/* | |
* parseURL - parses a URL into its parts using NSURL | |
* @param {string} urlString - the URL to be parsed | |
* @returns {URLParts} - the URL parts | |
* @throws {Error} - if the URL is invalid | |
* | |
* @typedef {object} URLParts | |
* @property {string} scheme - the URL scheme | |
* @property {string} host - the URL host | |
* @property {string} port - the URL port | |
* @property {string} path - the URL path | |
* @property {string} query - the URL query | |
* @property {string} fragment - the URL fragment | |
* | |
* */ | |
function parseURL(urlString) { | |
const nsURL = $.NSURL.URLWithString(urlString) | |
if (!nsURL) { | |
throw new Error(`Invalid URL: ${urlString}`) | |
} | |
return { | |
scheme: nsURL.scheme ? ObjC.unwrap(nsURL.scheme) : null, | |
host: nsURL.host ? ObjC.unwrap(nsURL.host) : null, | |
port: nsURL.port ? ObjC.unwrap(nsURL.port) : null, | |
path: nsURL.path ? ObjC.unwrap(nsURL.path) : null, | |
query: nsURL.query ? ObjC.unwrap(nsURL.query) : null, | |
fragment: nsURL.fragment ? ObjC.unwrap(nsURL.fragment) : null | |
} | |
} | |
/* | |
* createFolder - creates a folder at the specified path | |
* what is this $() bullshit? Don't worry about it. Just thank god | |
* arthurdapaz wrote this code and you don't have to. | |
* | |
* @param {string} path - the path to the folder to be created | |
* @param {number} createIntermediatesFlag - flag to create intermediate folders | |
* @returns {boolean} - true if the folder was created | |
* @throws {Error} - if the folder could not be created | |
* @see {@link https://gist.github.com/arthurdapaz/cd3dca57ed9412b01a41e348e1f608c8} | |
* */ | |
function createFolder(path, createIntermediatesFlag = 1) { | |
let p = $(path).stringByStandardizingPath | |
let i = createIntermediatesFlag ? 1 : 0 | |
let a = $() | |
let e = $() | |
let r = $.NSFileManager.defaultManager.createDirectoryAtPathWithIntermediateDirectoriesAttributesError( | |
p, i, a, e | |
) | |
if (!e.isNil()) { | |
let s1 = "mkdir(): " | |
let s2 = e.localizedDescription.js | |
let s3 = e.localizedRecoverySuggestion.js || "" | |
throw s1 + s2 + s3 | |
} | |
return r | |
} | |
/* | |
* createExternalPattern - creates a regular expression pattern to match external URLs | |
* @param {string} host - the host to be used in the pattern | |
* @returns {string} - the regular expression pattern | |
* */ | |
function createExternalPattern(host) { | |
let pat = "" | |
const subs = host.toLowerCase().split(".") | |
if (subs[0] === "www") { | |
subs.shift() | |
} | |
pat += "(?:www?\d?\.)?" | |
pat += subs.join("\.") | |
const fullReg = `^(?!${pat})(.*)` | |
return fullReg | |
} | |
/* | |
* ensureSomeScheme - ensures that a URL has https:// or http://. If it doesn't, it adds https:// | |
* @param {string} url - the URL to be checked | |
* @returns {string} - the URL with https:// if it doesn't have it | |
* */ | |
function ensureSomeScheme(url) { | |
if (!/^https?:\/\//i.test(url)) { | |
return 'https://' + url | |
} | |
return url | |
} | |
/* | |
* buildSettings - builds the settings object for SiteSucker based on the URL and custom settings | |
* @param {URLParts} parsedURL - the parsed URL | |
* @param {string} destination - the destination folder | |
* @param {object} customSettings - the custom settings object | |
* @returns {object} - the settings object | |
* */ | |
function buildSettings(parsedURL, destination, customSettings, excludeRegexString = '') { | |
const fullReg = createExternalPattern(parsedURL.host) | |
let output = getSettingsObject() | |
output.pathsToReplace = { | |
filePathPattern: fullReg, | |
template: "__external/$1" | |
} | |
output.destination = destination | |
if (excludeRegexString) { | |
output.urlsToExclude = { | |
regex: true, | |
urlOrPattern: excludeRegexString | |
} | |
} | |
const direct = ["downloadAttempts", | |
"downloadDelay", | |
"downloadTimeout", | |
"saveDelay", | |
"treatAmbiguousURLsAsFolders", "identity", "ignoreRelEqualsNofollow", "ignoreRobotExclusions"] | |
for (const f of direct) { | |
output[f] = customSettings[f] | |
} | |
return output | |
} | |
/* | |
* isEmpty - checks if a value is empty or not | |
* @param {any} value - the value to be checked | |
* @returns {boolean} - true if the value is empty | |
* */ | |
function isEmpty(value) { | |
if (value === "" || typeof value === "undefined" || value === null || value === false) { | |
return true | |
} | |
if (Array.isArray(value) && value.length < 1) { | |
return true | |
} | |
for (let prop in value) { | |
if (value.hasOwnProperty(prop)) return false | |
} | |
return true | |
} | |
/* | |
* askUrl - asks the user for a URL to download if one is not provided | |
* @returns {object} - the dialog result | |
* */ | |
function askUrl() { | |
let urldef = app.theClipboard() || '' | |
const result = app.displayDialog("Please enter the URL you wish to download.", { | |
defaultAnswer: urldef, | |
buttons: ['Ok', 'Cancel'], | |
defaultButton: 'Ok', | |
cancelButton: 'Cancel' | |
}) | |
return result | |
} | |
// --- END function declarations --- | |
// --- BEGIN main --- | |
// ignore the following lines, because guess what, they don't work. | |
//const prefs = Prefs.make({ allowed: ['baseFolder'], appIdentifier: "ltd.janky.ssalt" }) | |
//const defaultfolder = prefs.baseFolder | |
let defaultfolder = "" | |
let url | |
// array comes from automator input | |
if (Array.isArray(urlToSuck) && !isEmpty(urlToSuck)) { | |
url = urlToSuck[0] | |
} | |
// if urlToSuck is passed as a string | |
if (!isEmpty(urlToSuck)) { | |
url = urlToSuck | |
} | |
else { | |
const url2 = askUrl() | |
url = url2.textReturned | |
} | |
url = ensureSomeScheme(url) | |
const parsedURL = parseURL(url) | |
if (isEmpty(folder)) { | |
folder = app.chooseFolder({ withPrompt: "Choose the root folder where you want all archives to go. This script will create subfolders for each site automatically, so choose the same folder each time.", default: isEmpty(defaultfolder) ? null : defaultfolder }) | |
} | |
const webview = app.displayDialog("Do you want to use web views (slow) or headless (fast)? Webviews allow the page to run for a few seconds in the browser before capturing the resulting html.", { | |
buttons: [ 'Static', 'Webviews', "Cancel"], | |
defaultButton: 'Static', | |
cancelButton: 'Cancel' | |
}) | |
// save basefolder ... jk doesnt work | |
//prefs.baseFolder = folder.toString() | |
const slugName = `${slugify(parsedURL.host)}` | |
const newFolderPath = `${folder.toString()}/${slugName}` | |
createFolder(newFolderPath) | |
const sets = buildSettings( | |
parsedURL, | |
newFolderPath.toString(), | |
customSettings | |
) | |
const doc = app.Document({ name: parsedURL.host }).make({ new: 'Document' }) | |
// const resetable = doc.getProperty('settings') | |
// here's the million dollar line | |
for (const key in sets) { | |
doc.getProperty('settings').setProperty(key, sets[key]) | |
} | |
doc.getProperty('settings').setProperty("downloadUsingWebViews", webview==="Webviews"?true:false); | |
app.download(url, { in: doc }) | |
// --- END main --- | |
})(scriptSetup) | |
// --- END giant IIFE --- | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment