githubfoam · August 1, 2025 10:52
diff --git a/wayback maschine cheat sheet b/wayback maschine cheat sheet
 #====================================================================
 Nginx config 

 Why search: 

 Nginx server configuration.
 Nginx is a popular web server, and its configuration files (nginx.conf) control 
 how the server processes requests, handles proxies, and serves content. An exposed 
 config file can reveal server secrets and misconfigurations.

 Sensitive data:

 Server names and aliases: Reveals all domain names and subdomains associated with the server.
 Sensitive file paths: It could reveal the location of log files, SSL certificates, or other critical server files.
 Server block configurations
 Upstream backend server addresses
 SSL certificate paths
 Access log locations
 Proxy information: If the server acts as a reverse proxy, the configuration might expose 
 the IP addresses or internal hostnames of backend servers.
 Proxy settings revealing internal network structure
 Rate limiting and security configurations
 credentials
 internal IPs
 access rules

 proxy_pass http://internal-api.example.local;

 nginx.conf or related

 #====================================================================
 WEB-INF/web.xml

 Why search: 

 Java web application configuration
 This is the deployment descriptor for Java web applications. The entire WEB-INF directory 
 is meant to be inaccessible to the public. If it's exposed, it's a significant security flaw.

 Sensitive data:

 Internal file paths: The web.xml file specifies the location of servlets, filters, 
 and other Java components, revealing the internal structure of the application.
 Database connection parameters
 Security constraints: It may contain information about which URLs require authentication or specific user roles, 
 providing an attacker with a blueprint for a privilege escalation attack.
 Session configuration
 Error page mappings
 debug options
 Servlet mappings: It can map URLs to specific Java classes, which can help an attacker understand the application's backend logic.
 internal class paths
 Servlet mappings exposing unreleased features.

 Database credentials in <resource-ref>.

 <env-entry>
  <env-entry-name>DBPassword</env-entry-name>
  <env-entry-value>secret123</env-entry-value>
 </env-entry>

 #====================================================================
 web.config files

 Why search: 

 IIS/Windows web server configuration.
 IIS/ASP.NET configuration
 This is the primary configuration file for ASP.NET applications running on IIS servers. Like other config files, 
 it is not intended for public viewing.

 Sensitive data:


 Debugging information: The file might be configured to display detailed error messages, 
 which can expose stack traces, server paths, and other valuable debugging information to an attacker.
 The file might be configured to display detailed error messages, which can expose stack traces, server paths, 
 and other valuable debugging information to an attacker.
 Debug mode enabled in historical versions.
 Database connection strings: This is the ASP.NET equivalent of a database credential, 
 often including the server, database name, and login credentials.
 Connection strings with database credentials
 Custom application settings: Any sensitive application settings or keys for services are stored here.
 Application keys and machine keys
 Application secrets in <appSettings>.
 Custom error pages with system information
 Authentication provider settings
 Session state configuration

 <connectionStrings>
  <add name="DefaultConnection" connectionString="Server=sql;Database=mydb;User Id=admin;Password=Pass123!;" />
 </connectionStrings>

 #====================================================================
 .htaccess files

 Why search: 

 Apache configuration files
 Apache config for access control
 The .htaccess file controls server behavior, including URL rewriting, password protection, 
 and access control. An exposed .htaccess file can reveal details about a server's security configurations.

 Sensitive data:

 disabled directories.
 IP-based access controls: It might reveal IP addresses that are whitelisted for administrative access.
 security rules (e.g., IP whitelisting)
 Password-protected directories: The file can list directories that are protected with basic authentication, 
 providing an attacker with a clear target for brute-force attacks.
 Directory passwords and usernames
 Access restrictions indicating sensitive areas
 Internal redirects and rewrites: Information on how the server 
 handles specific URLs, which could reveal hidden file names or internal directory structures.
 Redirect rules revealing internal structure
 Rewrite rules exposing backend services
 Error page configurations showing system paths



 AuthUserFile /var/www/html/.htpasswd

 #====================================================================
 sitemap.xml


 Why search: 

 Sitemaps list all accessible URLs on a site for SEO.
 While sitemaps are intended to help search engines index a site, they can inadvertently list URLs 
 that are not linked from the main website and should not be public
 Lists all site URLs, including unlinked/hidden pages.

 Sensitive data:

 Hidden or unlinked directories: The sitemap might contain links to development directories, test pages, staging, admin panels,
 or internal administrative pages that are not publicly visible on the site itself.
 User-specific data: A misconfigured sitemap could point to URLs containing user-specific data, such as /user-profile/123456, 
 potentially allowing an attacker to enumerate user IDs.
 Can reveal structure of the site, including sensitive or deprecated endpoints
 Hidden admin pages
 Internal tool URLs
 API endpoints,Internal tools or deprecated APIs with vulnerabilities.
 Development/staging URLs,Forgotten development/staging paths (e.g., /dev/login.php).
 File upload directories
 User-generated content paths

 https://example.com/admin-dashboard
 https://example.com/internal-api


 #====================================================================

 config.*
 Config files 

 Why search: 

 Configuration files store application settings
 Configuration files are the central nervous system of an application, storing critical settings and credentials. Exposing a config file is a critical security vulnerability.
 Application or system configuration

 Sensitive data:

 Database credentials: This is one of the most common and damaging exposures. 
 It could include database hostname, port, username, and plaintext password, giving an attacker full access to the site's database.
 Database connection strings,database usernames, passwords
 SMTP credentials
 API keysand secrets: Keys for third-party services like payment gateways, 
 social media integrations, or cloud services.
 Third-party service API keys(AWS, Google Cloud, payment gateways) 
 Secrets
 Environment variables: Information about the server environment, file paths, and other system-level details.
 Custom encryption keys: Keys used to encrypt sensitive user data (e.g., sessions, cookies).
 Encryption keys and salts
 Debug settings revealing system paths
 Feature flags  or feature toggles showing internal functionality
 internal endpoints

 e.g., config.php, settings.py,config.json

 $db_password = "SuperSecret123!";
 database_password="s3cr3t" in a snapshot.

 #====================================================================
 .git files 

 Why search: 

 Git version control metadata

 Git repositories contain complete source code history
 Full source code: The entire source code of the application, including comments, logic, and internal directory structures
 If .git directory is exposed, attackers can download the entire source code.

 This may include API keys, database credentials, or logic flaws.
 Hardcoded credentials: API keys, database usernames and passwords, or other secrets that were accidentally committed to the repository.

 When developers misconfigure a web server, they can accidentally expose the entire .git directory, 
 which contains the complete source code history of the website. An attacker can use 
 the Wayback Machine to find a snapshot of this directory and download it.

 Security vulnerabilities: The commit history may reveal past vulnerabilities that have since been fixed, 
 allowing an attacker to find older, unpatched versions of the code on other servers.
 Internal developer notes

 Sensitive data:

 Source code and application logic
 Database credentials in config files
 API keys and secrets
 Commit history
 Developer comments with passwords
 Email addresses and internal usernames
 Deployment scripts and infrastructure details

 Example sensitive data:

    .git/config → Repository URL (can lead to internal repos)

    .git/HEAD, .git/objects/ → Full code history.
 #====================================================================
 robots.txt 

 Why search: 

 Often reveals hidden directories and sensitive paths,admin panels 
 Paths to administrative panels, log files, or backup directories accessible in archives
 The robots.txt file is designed to tell search engine crawlers which parts of a website they should and should not index. 
 Attackers can use this file to quickly identify paths and directories that a website's administrator considers sensitive 
 and wants to keep private.

 Sensitive data:

 Admin panels (/admin, /dashboard, /management)
 API endpoints (/api/v1/, /rest/)
 Development directories (/dev/, /test/, /staging/)
 Backup locations (/backups/, /old/)
 Database interfaces (/phpmyadmin/, /mysql/)
 /private

 Disallow: /backup/
 Disallow: /old-site/

 Example: Disallow: /secret/ → Wayback URL: https://web.archive.org/web/20200101/https://www.harvard.edu/secret/

 #====================================================================
 Understanding Web.archive.org and URL Exposure

 The Wayback Machine (Internet Archive) captures snapshots of websites over time, including the URLs.
 If a website previously exposed sensitive data in its URLs (e.g., in query strings), these URLs might be archived and accessible.
 #====================================================================
 The filter parameter in the Wayback Machine's CDX (Capture Index) API is incredibly powerful for narrowing down your search results. While, as noted, it doesn't directly help you find sensitive data embedded within the URL string itself, it does help you find URLs that point to resources that are more likely to contain sensitive data

 The general format for filtering is &filter=<field>:<value_regex>. You can also negate a filter with &filter=!<field>:<value_regex>.

 Common fields you can filter by:

    statuscode: HTTP status code (e.g., 200, 301, 404, 500)

    mimetype: MIME type of the captured content (e.g., text/html, application/pdf, image/jpeg, application/json)

    original: The original URL (useful for more complex regex filtering on the URL itself, beyond what the main url parameter can do)

    digest: SHA1 hash of the content

    length: Size of the content in bytes

    offset: Byte offset of the content in the WARC file

    urlkey: A normalized form of the URL used internally by the Archive

    timestamp: The capture timestamp (YYYYMMDDhhmmss)
    
 #====================================================================
 Sensitive data in URLs can include:

    Personally Identifiable Information (PII): Names, email addresses, phone numbers, addresses, social security numbers, dates of birth.

 Authentication/Session Information: Session IDs, authentication tokens, API keys, temporary passwords.

 Financial Information: Credit card numbers, bank account details.

 Confidential Business Information: Internal document IDs, project names, proprietary data.

 Common sensitive data indicators in URLs:

    password= or pwd=

    token= or auth=

    sessionid= or jsessionid=

    user= or username=

    email=

    ssn=

    account= or acc=

    File extensions that often contain sensitive data: .pdf, .xls, .xlsx, .doc, .docx, .sql, .bak, .zip, .tar, .gz, .log, .config, .csv, .yaml, .md, .key, .pem, .crt.

    URLs containing numerical sequences that could be IDs or financial data.
    
 Analyze the results: The CDX API will return a list of URLs. You will need to manually or programmatically review these URLs to identify any that appear to contain sensitive information. Look for:

    Clear text parameters: Are there parameters in the URL query string that directly expose sensitive data (e.g., ?username=john.doe&password=plaintextpassword).

    Base64 encoded strings: Sometimes sensitive data is base64 encoded within a URL parameter. While not plain text, it's easily decoded.

    Long, complex strings: These might be session tokens or API keys.

    File paths: Look for paths that might lead to sensitive files, such as /admin/, /backup/, /private/, or specific file names like config.php, database.sql.    

 Sensitive data in URLs typically appears as:

    Query parameters:
    ?token=abcd1234
    ?ssn=123-45-6789
    ?api_key=abcdef12345

 Session IDs:
 ?sessionid=xyz

 User credentials or emails:
 ?user=john&password=secret

 File paths exposing private data**:
 /internal/config.php

 Use wildcards to find specific file types: harvard.edu/*.sql, harvard.edu/*.txt, harvard.edu/*.log

 Check for common sensitive paths like harvard.edu/admin/*, harvard.edu/config/*, harvard.edu/.env

 Look for error pages that might reveal system information: harvard.edu/* 500 or harvard.edu/* error

 Search for backup files: harvard.edu/*.bak, harvard.edu/*.old, harvard.edu/*.backup

 Check development/staging subdomains: dev.harvard.edu/*, staging.harvard.edu/*, test.harvard.edu/*
 #====================================================================
 Time-Based Analysis:

 Focus on older snapshots (2000s-2010s) when security practices were less mature
 Look for periods around major site updates or migrations
 Check snapshots from different years to see what changed
 #====================================================================
 Common Sensitive Data Patterns:

 Database connection strings
 API keys and tokens
 Directory listings showing internal structure
 Configuration files (.htaccess, web.config, etc.)
 Debug pages with system information
 Employee directories or internal documents

 #====================================================================
 Manual Search via Web Interface

    Visit https://web.archive.org.

    Enter https://www.harvard.edu in the search bar.

    Browse archived snapshots (timeline/calendar view).

    Check URLs in the browser's address bar when viewing a snapshot. Look for:

        Parameters like ?id=..., ?token=..., etc.

        Unusual path segments (e.g., /admin/, /debug/).

    Inspect links on the archived page for sensitive parameters.
    
 #====================================================================
 Use Wayback Machine's CDX API

 # Example API request for all captures of harvard.edu
 curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&filter=statuscode:200"


 curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&fl=original&collapse=urlkey"

 url=www.harvard.edu/* → Targets all subpaths

    output=json → JSON format

    fl=original → Only original URLs

    collapse=urlkey → Remove duplicates

 This returns a list of URLs archived over time.


 Look for common sensitive parameters:

    password, token, key, auth, session, email, etc.


 Command-line filtering:

 curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&fl=original&collapse=urlkey" \
 | grep -Ei "token|key|auth|session|password|email"

 Command-line filtering:


 curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&filter=statuscode:200" \
 | grep -Ei "token|key|auth|session|password|email"
 #====================================================================
 https://web.archive.org/web/*/https://www.harvard.edu/*?token=*
 https://web.archive.org/web/*/https://www.harvard.edu/*?session=*
 #====================================================================
	#====================================================================
	Nginx config

	Why search:

	Nginx server configuration.
	Nginx is a popular web server, and its configuration files (nginx.conf) control
	how the server processes requests, handles proxies, and serves content. An exposed
	config file can reveal server secrets and misconfigurations.

	Sensitive data:

	Server names and aliases: Reveals all domain names and subdomains associated with the server.
	Sensitive file paths: It could reveal the location of log files, SSL certificates, or other critical server files.
	Server block configurations
	Upstream backend server addresses
	SSL certificate paths
	Access log locations
	Proxy information: If the server acts as a reverse proxy, the configuration might expose
	the IP addresses or internal hostnames of backend servers.
	Proxy settings revealing internal network structure
	Rate limiting and security configurations
	credentials
	internal IPs
	access rules

	proxy_pass http://internal-api.example.local;

	nginx.conf or related

	#====================================================================
	WEB-INF/web.xml

	Why search:

	Java web application configuration
	This is the deployment descriptor for Java web applications. The entire WEB-INF directory
	is meant to be inaccessible to the public. If it's exposed, it's a significant security flaw.

	Sensitive data:

	Internal file paths: The web.xml file specifies the location of servlets, filters,
	and other Java components, revealing the internal structure of the application.
	Database connection parameters
	Security constraints: It may contain information about which URLs require authentication or specific user roles,
	providing an attacker with a blueprint for a privilege escalation attack.
	Session configuration
	Error page mappings
	debug options
	Servlet mappings: It can map URLs to specific Java classes, which can help an attacker understand the application's backend logic.
	internal class paths
	Servlet mappings exposing unreleased features.

	Database credentials in <resource-ref>.

	<env-entry>
	<env-entry-name>DBPassword</env-entry-name>
	<env-entry-value>secret123</env-entry-value>
	</env-entry>

	#====================================================================
	web.config files

	Why search:

	IIS/Windows web server configuration.
	IIS/ASP.NET configuration
	This is the primary configuration file for ASP.NET applications running on IIS servers. Like other config files,
	it is not intended for public viewing.

	Sensitive data:


	Debugging information: The file might be configured to display detailed error messages,
	which can expose stack traces, server paths, and other valuable debugging information to an attacker.
	The file might be configured to display detailed error messages, which can expose stack traces, server paths,
	and other valuable debugging information to an attacker.
	Debug mode enabled in historical versions.
	Database connection strings: This is the ASP.NET equivalent of a database credential,
	often including the server, database name, and login credentials.
	Connection strings with database credentials
	Custom application settings: Any sensitive application settings or keys for services are stored here.
	Application keys and machine keys
	Application secrets in <appSettings>.
	Custom error pages with system information
	Authentication provider settings
	Session state configuration

	<connectionStrings>
	<add name="DefaultConnection" connectionString="Server=sql;Database=mydb;User Id=admin;Password=Pass123!;" />
	</connectionStrings>

	#====================================================================
	.htaccess files

	Why search:

	Apache configuration files
	Apache config for access control
	The .htaccess file controls server behavior, including URL rewriting, password protection,
	and access control. An exposed .htaccess file can reveal details about a server's security configurations.

	Sensitive data:

	disabled directories.
	IP-based access controls: It might reveal IP addresses that are whitelisted for administrative access.
	security rules (e.g., IP whitelisting)
	Password-protected directories: The file can list directories that are protected with basic authentication,
	providing an attacker with a clear target for brute-force attacks.
	Directory passwords and usernames
	Access restrictions indicating sensitive areas
	Internal redirects and rewrites: Information on how the server
	handles specific URLs, which could reveal hidden file names or internal directory structures.
	Redirect rules revealing internal structure
	Rewrite rules exposing backend services
	Error page configurations showing system paths



	AuthUserFile /var/www/html/.htpasswd

	#====================================================================
	sitemap.xml


	Why search:

	Sitemaps list all accessible URLs on a site for SEO.
	While sitemaps are intended to help search engines index a site, they can inadvertently list URLs
	that are not linked from the main website and should not be public
	Lists all site URLs, including unlinked/hidden pages.

	Sensitive data:

	Hidden or unlinked directories: The sitemap might contain links to development directories, test pages, staging, admin panels,
	or internal administrative pages that are not publicly visible on the site itself.
	User-specific data: A misconfigured sitemap could point to URLs containing user-specific data, such as /user-profile/123456,
	potentially allowing an attacker to enumerate user IDs.
	Can reveal structure of the site, including sensitive or deprecated endpoints
	Hidden admin pages
	Internal tool URLs
	API endpoints,Internal tools or deprecated APIs with vulnerabilities.
	Development/staging URLs,Forgotten development/staging paths (e.g., /dev/login.php).
	File upload directories
	User-generated content paths

	https://example.com/admin-dashboard
	https://example.com/internal-api


	#====================================================================

	config.*
	Config files

	Why search:

	Configuration files store application settings
	Configuration files are the central nervous system of an application, storing critical settings and credentials. Exposing a config file is a critical security vulnerability.
	Application or system configuration

	Sensitive data:

	Database credentials: This is one of the most common and damaging exposures.
	It could include database hostname, port, username, and plaintext password, giving an attacker full access to the site's database.
	Database connection strings,database usernames, passwords
	SMTP credentials
	API keysand secrets: Keys for third-party services like payment gateways,
	social media integrations, or cloud services.
	Third-party service API keys(AWS, Google Cloud, payment gateways)
	Secrets
	Environment variables: Information about the server environment, file paths, and other system-level details.
	Custom encryption keys: Keys used to encrypt sensitive user data (e.g., sessions, cookies).
	Encryption keys and salts
	Debug settings revealing system paths
	Feature flags or feature toggles showing internal functionality
	internal endpoints

	e.g., config.php, settings.py,config.json

	$db_password = "SuperSecret123!";
	database_password="s3cr3t" in a snapshot.

	#====================================================================
	.git files

	Why search:

	Git version control metadata

	Git repositories contain complete source code history
	Full source code: The entire source code of the application, including comments, logic, and internal directory structures
	If .git directory is exposed, attackers can download the entire source code.

	This may include API keys, database credentials, or logic flaws.
	Hardcoded credentials: API keys, database usernames and passwords, or other secrets that were accidentally committed to the repository.

	When developers misconfigure a web server, they can accidentally expose the entire .git directory,
	which contains the complete source code history of the website. An attacker can use
	the Wayback Machine to find a snapshot of this directory and download it.

	Security vulnerabilities: The commit history may reveal past vulnerabilities that have since been fixed,
	allowing an attacker to find older, unpatched versions of the code on other servers.
	Internal developer notes

	Sensitive data:

	Source code and application logic
	Database credentials in config files
	API keys and secrets
	Commit history
	Developer comments with passwords
	Email addresses and internal usernames
	Deployment scripts and infrastructure details

	Example sensitive data:

	.git/config → Repository URL (can lead to internal repos)

	.git/HEAD, .git/objects/ → Full code history.
	#====================================================================
	robots.txt

	Why search:

	Often reveals hidden directories and sensitive paths,admin panels
	Paths to administrative panels, log files, or backup directories accessible in archives
	The robots.txt file is designed to tell search engine crawlers which parts of a website they should and should not index.
	Attackers can use this file to quickly identify paths and directories that a website's administrator considers sensitive
	and wants to keep private.

	Sensitive data:

	Admin panels (/admin, /dashboard, /management)
	API endpoints (/api/v1/, /rest/)
	Development directories (/dev/, /test/, /staging/)
	Backup locations (/backups/, /old/)
	Database interfaces (/phpmyadmin/, /mysql/)
	/private

	Disallow: /backup/
	Disallow: /old-site/

	Example: Disallow: /secret/ → Wayback URL: https://web.archive.org/web/20200101/https://www.harvard.edu/secret/

	#====================================================================
	Understanding Web.archive.org and URL Exposure

	The Wayback Machine (Internet Archive) captures snapshots of websites over time, including the URLs.
	If a website previously exposed sensitive data in its URLs (e.g., in query strings), these URLs might be archived and accessible.
	#====================================================================
	The filter parameter in the Wayback Machine's CDX (Capture Index) API is incredibly powerful for narrowing down your search results. While, as noted, it doesn't directly help you find sensitive data embedded within the URL string itself, it does help you find URLs that point to resources that are more likely to contain sensitive data

	The general format for filtering is &filter=<field>:<value_regex>. You can also negate a filter with &filter=!<field>:<value_regex>.

	Common fields you can filter by:

	statuscode: HTTP status code (e.g., 200, 301, 404, 500)

	mimetype: MIME type of the captured content (e.g., text/html, application/pdf, image/jpeg, application/json)

	original: The original URL (useful for more complex regex filtering on the URL itself, beyond what the main url parameter can do)

	digest: SHA1 hash of the content

	length: Size of the content in bytes

	offset: Byte offset of the content in the WARC file

	urlkey: A normalized form of the URL used internally by the Archive

	timestamp: The capture timestamp (YYYYMMDDhhmmss)

	#====================================================================
	Sensitive data in URLs can include:

	Personally Identifiable Information (PII): Names, email addresses, phone numbers, addresses, social security numbers, dates of birth.

	Authentication/Session Information: Session IDs, authentication tokens, API keys, temporary passwords.

	Financial Information: Credit card numbers, bank account details.

	Confidential Business Information: Internal document IDs, project names, proprietary data.

	Common sensitive data indicators in URLs:

	password= or pwd=

	token= or auth=

	sessionid= or jsessionid=

	user= or username=

	email=

	ssn=

	account= or acc=

	File extensions that often contain sensitive data: .pdf, .xls, .xlsx, .doc, .docx, .sql, .bak, .zip, .tar, .gz, .log, .config, .csv, .yaml, .md, .key, .pem, .crt.

	URLs containing numerical sequences that could be IDs or financial data.

	Analyze the results: The CDX API will return a list of URLs. You will need to manually or programmatically review these URLs to identify any that appear to contain sensitive information. Look for:

	Clear text parameters: Are there parameters in the URL query string that directly expose sensitive data (e.g., ?username=john.doe&password=plaintextpassword).

	Base64 encoded strings: Sometimes sensitive data is base64 encoded within a URL parameter. While not plain text, it's easily decoded.

	Long, complex strings: These might be session tokens or API keys.

	File paths: Look for paths that might lead to sensitive files, such as /admin/, /backup/, /private/, or specific file names like config.php, database.sql.

	Sensitive data in URLs typically appears as:

	Query parameters:
	?token=abcd1234
	?ssn=123-45-6789
	?api_key=abcdef12345

	Session IDs:
	?sessionid=xyz

	User credentials or emails:
	?user=john&password=secret

	File paths exposing private data**:
	/internal/config.php

	Use wildcards to find specific file types: harvard.edu/.sql, harvard.edu/.txt, harvard.edu/*.log

	Check for common sensitive paths like harvard.edu/admin/, harvard.edu/config/, harvard.edu/.env

	Look for error pages that might reveal system information: harvard.edu/* 500 or harvard.edu/* error

	Search for backup files: harvard.edu/.bak, harvard.edu/.old, harvard.edu/*.backup

	Check development/staging subdomains: dev.harvard.edu/, staging.harvard.edu/, test.harvard.edu/*
	#====================================================================
	Time-Based Analysis:

	Focus on older snapshots (2000s-2010s) when security practices were less mature
	Look for periods around major site updates or migrations
	Check snapshots from different years to see what changed
	#====================================================================
	Common Sensitive Data Patterns:

	Database connection strings
	API keys and tokens
	Directory listings showing internal structure
	Configuration files (.htaccess, web.config, etc.)
	Debug pages with system information
	Employee directories or internal documents

	#====================================================================
	Manual Search via Web Interface

	Visit https://web.archive.org.

	Enter https://www.harvard.edu in the search bar.

	Browse archived snapshots (timeline/calendar view).

	Check URLs in the browser's address bar when viewing a snapshot. Look for:

	Parameters like ?id=..., ?token=..., etc.

	Unusual path segments (e.g., /admin/, /debug/).

	Inspect links on the archived page for sensitive parameters.

	#====================================================================
	Use Wayback Machine's CDX API

	# Example API request for all captures of harvard.edu
	curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&filter=statuscode:200"


	curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&fl=original&collapse=urlkey"

	url=www.harvard.edu/* → Targets all subpaths

	output=json → JSON format

	fl=original → Only original URLs

	collapse=urlkey → Remove duplicates

	This returns a list of URLs archived over time.


	Look for common sensitive parameters:

	password, token, key, auth, session, email, etc.


	Command-line filtering:

	curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&fl=original&collapse=urlkey" \
	\| grep -Ei "token\|key\|auth\|session\|password\|email"

	Command-line filtering:


	curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&filter=statuscode:200" \
	\| grep -Ei "token\|key\|auth\|session\|password\|email"
	#====================================================================
	https://web.archive.org/web//https://www.harvard.edu/?token=*
	https://web.archive.org/web//https://www.harvard.edu/?session=*
	#====================================================================