Skip to content

Instantly share code, notes, and snippets.

@jgusta
Last active February 12, 2025 14:46
Show Gist options
  • Save jgusta/0c819cc0f5b680df0e50067f4db77595 to your computer and use it in GitHub Desktop.
Save jgusta/0c819cc0f5b680df0e50067f4db77595 to your computer and use it in GitHub Desktop.
A Javascript for Automation (JXA) script for automating SiteSucker for Mac with opinionated defaults.
/**
CHANGELOG
2-10-25 - Ask about whether to use webviews.
*/
/**
* This script is a JavaScript for Automation (JXA) script that uses the SiteSucker app to download a website.
*
* Background: JXA is dogshit. I know there is no need to hold back on the cursing, because if you are using JXA,
* you have already explored every explitive available. The official JXA documentation is apple's changelog which
* itself is simply a log describing exactly when they themselves completely gave up on this bullshit. Also,
* AppleScript is dogshit. And that is the MUCH better documented of the two automation languages.
*
* So here you are trying to use JXA for a program that doesn't even technically support it, trying to
* translate an already god forsaken applescript descriptor into an esoteric and poorly documented,
* deceptively evil javascript implementation. This insideous spawn of satan has the veneer of a friendly
* scripting language (ECMA compliant, even!) but instead of any sort of sane standard library based on
* idiomatic web traditions or even an event loop, you are getting the grinding four-on-the-floor, piss-poor
* tone-deaf implementation of the Objective-C runtime. Welcome to hell.
*
* Needless to say, this script is the result of over 3 months of intense pain and suffering. I hope you enjoy it.
*
* Oh what does this script do? It automates SiteSucker with some sane defaults. Your own frustrations with this
* beautiful program with soul-crushing configuration and defaults has driven you to look for automated solutions.
* Your desire for pain and despair has further guided you to consider JXA for this. Its also your first time
* using JXA (BECAUSE THERE IS NEVER A SECOND TIME). My god I hope no one is unfortunate nough to search for
* and find this script. The github copilot auto-complete suggestion for this comment includes all of
* Dante's inferno I kid you not.
*
* So how to use this script. For fuck's sake you aren't just looking for this solution because you are trying
* to copy-paste something. If this is your first time seeing "app.includeStandardAdditions" my god, turn away
* and run now. But more likely You are knee deep in the shit right now and just need a lifeline. I don't need to
* tell you what color underwear is best to shit yourself in. Enjoy.
*
* @param {string} input - The input parameter for the `run` function.
* @param {any} parameters - The parameters for the `run` function.
* @returns {void}
*/
function run(input, parameters) {
// --- BEGIN app setup ---
// this section needed before function declarations
ObjC.import("Foundation")
ObjC.import("AppKit")
const app = Application("SiteSucker")
app.includeStandardAdditions = true
app.strictPropertyScope = false
app.strictCommandScope = false
app.strictParameterType = false
// --- END app setup ---
// --- BEGIN pre IIFE section ---
/**
* Edit a limited subset of sitesucker settings here.
* All keys are MANDATORY here. If you don't know what to do, leave them as is.
*
* CAVEAT ONE
* One thing to note: "identity" may become outdated when SiteSucker updates
* (as of writing this version is 5.4.5).
* You have to use an EXACT string from the "Settings->Request->Identity" dropdown.
* This string includes a special dash character from hell which is NOT two dash characters i.e. --
* but is a unicode piece of shit which I have pasted here: —
*
* If this is too hard for you, just use the evergreen string "SiteSucker"
* like a chump and out yourself to every site you scrape.
*
* Alternatively, create your own "Identity" in "SiteSucker-Preferences->Identities"
* and use the "Name" string you create there, but if you are that advanced,
* then use that big brain of yours to figure out how to copy/paste that
* unicode character (here it is again! right here: —) because it is a lot easier.
*
* CAVEAT TWO
* I can almost guarantee you that you cannot come up with a better excludeRegexString
* than I have here. Your regex skills pale in comparison to mine. I'll have you know
* I graduated top of my class in the Navy Seals, I am trained in gorilla warfare and
* I'm the top coder in the entire US armed forces, have been involved in numerous
* secret raids, and I have over 300 confirmed kills parsing HTML alone.
* Furthermore it is likely you cannot grasp (nor should you want to understand) the
* intricacies of the escaping idiocyncasies I have successfully navigated for you here.
* But if you are dying for a world of pain and want to jackknife this volatile 18-wheeler
* past the off-ramp I have constructed for you, let me give you some hints:
* 1) The applescript implementation of the settings object is busted. It says an array objects
* in the spec, but nope, when they say array of objects, they mean just ONE object literal.
* I just saved you about 20 hours of pain. Therefore, you must pile on all your regex into
* one string, which is why I only let your set a single excludeRegexString string here.
* 2) The regex string is a string, it can't be a regex. To minimize escape character jank, I have
* used a js template string. However, this implementation will graciously allow you to
* arbitrarily escape any character with a backslash meaning that if you want a backslash in
* your regex, you need to escape it.
* 3) Finally, don't forget to escape the ending dollar sign (ONCE) because this of course is a
* template string and that is the only character you need to escape other than a backtick. Or
* you could leave the end of line thing off, but then you are probably someone who just
* leaves things unfinished anyway and have no hope of succeeding anyway.
* */
const scriptSetup = {
folder: "",
urlToSuck: input,
excludeRegexString: `^https?:\\/\\/(?:(?:(?:[a-z1-9-_]{1,10}\\.){0,2}(?:cpanel|hp|github|discogs|bandcamp|angelfire|webcitation|wordpress|jetpack|facebook|googletagmanager|googleapis|google|list-manage|linkedin|instagram|mixcloud|x|twitter|gstatic|thisamericanlife|nytimes|imdb)\\.(?:com|net|org).*?)|(?:.*?\\.php\?action=(?:unread|stats|recent|profile|search|login|help|register|printpage|print).*?)|(?:.*?(?:login|logout|auth|authorize)\\/?(?:\\.php|\\.asp|\\.html).*?))\$`,
customSettings: {
downloadAttempts: 1,
downloadDelay: 0.5,
downloadTimeout: 15,
saveDelay: 5.0,
treatAmbiguousURLsAsFolders: true,
//identity: "SiteSucker",
identity: "Firefox 126.0 — Macintosh",
ignoreRelEqualsNofollow: true,
ignoreRobotExclusions: true
}
}
/*
* getSettingsObject - returns the default settings object for SiteSucker
* @returns {object} - the settings object
* */
function getSettingsObject() {
const excludeRegexString = `^https?:\/\/(?:(?:(?:[a-z1-9-_]{1,20}\.){0,2}(?:cpanel|hp|github|discogs|bandcamp|angelfire|webcitation|wordpress|jetpack|globalprivacycontrol|facebook|googletagmanager|googleapis|google|list-manage|linkedin|adstransparency|instagram|withgoogle|mixcloud|digitaladvertisingalliance|x|twitter|gstatic|thisamericanlife|nytimes|imdb)\.(?:com|google|net|org|eu|co\.uk|).*?)|(?:.*?\.php?action=(?:unread|stats|recent|profile|search|login|help|register|printpage|print).*?)|(?:.*?(?:login|logout|auth|authorize)\/?(?:\.php|\.asp|\.html).*?))$`
return {
alwaysDownloadHtmlAndCss: false,
askForDestination: false,
checkAllLinks: false,
connections: 30,
createPDF: false,
customDataAttributes: [],
customTypes: [],
destination: {},
downloadAttempts: 2,
downloadDelay: 2,
downloadErrorPages: false,
downloadLinksInPDFs: false,
downloadTimeout: 60,
downloadUsingWebViews: true,
fileModification: "localize",
fileReplacement: "never replace",
fileTypesOption: "allow all file types",
filterArchives: false,
filterAudioFiles: false,
filterCustomTypes: false,
filterImages: false,
htmlTypes: [],
identity: "Firefox 126.0 — Macintosh",
ignoreFilenameInHeaders: false,
ignoreRelEqualsNofollow: true,
ignoreRobotExclusions: true,
includeSupportingFiles: true,
javascript: "",
limitFiles: false,
limitLevels: false,
limitMaxFileSize: false,
limitMinFileSize: false,
limitMinImageSize: false,
logErrors: true,
logFinalStatus: false,
logHistory: true,
loginDialog: "display when necessary",
logMediaTypes: false,
logWarnings: false,
maxFiles: 0,
maxFileSize: 1000,
maxLevels: 4,
// "mediaTypeReplacement": [],
minFileSize: 0,
minImageSize: 25,
//"patterns": [],
postanalysisScript: "",
preanalysisScript: "",
replaceSpecialCharactersWithUnderscore: true,
saveDelay: 3,
scanCommentsForUrls: false,
textEncoding: "Default",
treatAmbiguousURLsAsFolders: false,
urlConstraint: "host",
urlsToExclude: {
regex: true,
urlOrPattern: excludeRegexString
},
// "urlsToInclude": [],
webViewSize: {
height: 1080,
width: 1920
}
}
}
// --- END pre IIFE section ---
// --- BEGIN giant IIFE ---
(({ folder, urlToSuck, customSettings, excludeRegexString }) => {
debugger
// -- BEGIN function declarations ---
/*
* slugify - slugifies a string for use in a URL or file name
* @param {string} str - the string to be slugified
* @returns {string} - the slugified string
* @throws {Error} - if the string is undefined or null
* */
function slugify(str) {
if (!str) {
throw new Error("String to be slugified is undefined or null")
}
str = str.replace(/^\s+|\s+$/g, "")
str = str.toLowerCase()
str = str
.replace(/[^a-z0-9 -.]/g, "")
.replace(/\s+/g, "-")
.replace(/-+/g, "-")
return str
}
/*
* parseURL - parses a URL into its parts using NSURL
* @param {string} urlString - the URL to be parsed
* @returns {URLParts} - the URL parts
* @throws {Error} - if the URL is invalid
*
* @typedef {object} URLParts
* @property {string} scheme - the URL scheme
* @property {string} host - the URL host
* @property {string} port - the URL port
* @property {string} path - the URL path
* @property {string} query - the URL query
* @property {string} fragment - the URL fragment
*
* */
function parseURL(urlString) {
const nsURL = $.NSURL.URLWithString(urlString)
if (!nsURL) {
throw new Error(`Invalid URL: ${urlString}`)
}
return {
scheme: nsURL.scheme ? ObjC.unwrap(nsURL.scheme) : null,
host: nsURL.host ? ObjC.unwrap(nsURL.host) : null,
port: nsURL.port ? ObjC.unwrap(nsURL.port) : null,
path: nsURL.path ? ObjC.unwrap(nsURL.path) : null,
query: nsURL.query ? ObjC.unwrap(nsURL.query) : null,
fragment: nsURL.fragment ? ObjC.unwrap(nsURL.fragment) : null
}
}
/*
* createFolder - creates a folder at the specified path
* what is this $() bullshit? Don't worry about it. Just thank god
* arthurdapaz wrote this code and you don't have to.
*
* @param {string} path - the path to the folder to be created
* @param {number} createIntermediatesFlag - flag to create intermediate folders
* @returns {boolean} - true if the folder was created
* @throws {Error} - if the folder could not be created
* @see {@link https://gist.github.com/arthurdapaz/cd3dca57ed9412b01a41e348e1f608c8}
* */
function createFolder(path, createIntermediatesFlag = 1) {
let p = $(path).stringByStandardizingPath
let i = createIntermediatesFlag ? 1 : 0
let a = $()
let e = $()
let r = $.NSFileManager.defaultManager.createDirectoryAtPathWithIntermediateDirectoriesAttributesError(
p, i, a, e
)
if (!e.isNil()) {
let s1 = "mkdir(): "
let s2 = e.localizedDescription.js
let s3 = e.localizedRecoverySuggestion.js || ""
throw s1 + s2 + s3
}
return r
}
/*
* createExternalPattern - creates a regular expression pattern to match external URLs
* @param {string} host - the host to be used in the pattern
* @returns {string} - the regular expression pattern
* */
function createExternalPattern(host) {
let pat = ""
const subs = host.toLowerCase().split(".")
if (subs[0] === "www") {
subs.shift()
}
pat += "(?:www?\d?\.)?"
pat += subs.join("\.")
const fullReg = `^(?!${pat})(.*)`
return fullReg
}
/*
* ensureSomeScheme - ensures that a URL has https:// or http://. If it doesn't, it adds https://
* @param {string} url - the URL to be checked
* @returns {string} - the URL with https:// if it doesn't have it
* */
function ensureSomeScheme(url) {
if (!/^https?:\/\//i.test(url)) {
return 'https://' + url
}
return url
}
/*
* buildSettings - builds the settings object for SiteSucker based on the URL and custom settings
* @param {URLParts} parsedURL - the parsed URL
* @param {string} destination - the destination folder
* @param {object} customSettings - the custom settings object
* @returns {object} - the settings object
* */
function buildSettings(parsedURL, destination, customSettings, excludeRegexString = '') {
const fullReg = createExternalPattern(parsedURL.host)
let output = getSettingsObject()
output.pathsToReplace = {
filePathPattern: fullReg,
template: "__external/$1"
}
output.destination = destination
if (excludeRegexString) {
output.urlsToExclude = {
regex: true,
urlOrPattern: excludeRegexString
}
}
const direct = ["downloadAttempts",
"downloadDelay",
"downloadTimeout",
"saveDelay",
"treatAmbiguousURLsAsFolders", "identity", "ignoreRelEqualsNofollow", "ignoreRobotExclusions"]
for (const f of direct) {
output[f] = customSettings[f]
}
return output
}
/*
* isEmpty - checks if a value is empty or not
* @param {any} value - the value to be checked
* @returns {boolean} - true if the value is empty
* */
function isEmpty(value) {
if (value === "" || typeof value === "undefined" || value === null || value === false) {
return true
}
if (Array.isArray(value) && value.length < 1) {
return true
}
for (let prop in value) {
if (value.hasOwnProperty(prop)) return false
}
return true
}
/*
* askUrl - asks the user for a URL to download if one is not provided
* @returns {object} - the dialog result
* */
function askUrl() {
let urldef = app.theClipboard() || ''
const result = app.displayDialog("Please enter the URL you wish to download.", {
defaultAnswer: urldef,
buttons: ['Ok', 'Cancel'],
defaultButton: 'Ok',
cancelButton: 'Cancel'
})
return result
}
// --- END function declarations ---
// --- BEGIN main ---
// ignore the following lines, because guess what, they don't work.
//const prefs = Prefs.make({ allowed: ['baseFolder'], appIdentifier: "ltd.janky.ssalt" })
//const defaultfolder = prefs.baseFolder
let defaultfolder = ""
let url
// array comes from automator input
if (Array.isArray(urlToSuck) && !isEmpty(urlToSuck)) {
url = urlToSuck[0]
}
// if urlToSuck is passed as a string
if (!isEmpty(urlToSuck)) {
url = urlToSuck
}
else {
const url2 = askUrl()
url = url2.textReturned
}
url = ensureSomeScheme(url)
const parsedURL = parseURL(url)
if (isEmpty(folder)) {
folder = app.chooseFolder({ withPrompt: "Choose the root folder where you want all archives to go. This script will create subfolders for each site automatically, so choose the same folder each time.", default: isEmpty(defaultfolder) ? null : defaultfolder })
}
const webview = app.displayDialog("Do you want to use web views (slow) or headless (fast)? Webviews allow the page to run for a few seconds in the browser before capturing the resulting html.", {
buttons: [ 'Static', 'Webviews', "Cancel"],
defaultButton: 'Static',
cancelButton: 'Cancel'
})
// save basefolder ... jk doesnt work
//prefs.baseFolder = folder.toString()
const slugName = `${slugify(parsedURL.host)}`
const newFolderPath = `${folder.toString()}/${slugName}`
createFolder(newFolderPath)
const sets = buildSettings(
parsedURL,
newFolderPath.toString(),
customSettings
)
const doc = app.Document({ name: parsedURL.host }).make({ new: 'Document' })
// const resetable = doc.getProperty('settings')
// here's the million dollar line
for (const key in sets) {
doc.getProperty('settings').setProperty(key, sets[key])
}
doc.getProperty('settings').setProperty("downloadUsingWebViews", webview==="Webviews"?true:false);
app.download(url, { in: doc })
// --- END main ---
})(scriptSetup)
// --- END giant IIFE ---
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment