Last active
August 1, 2025 10:52
-
-
Save githubfoam/bd1f22d0aab0d25e19ccc230ca4ee60b to your computer and use it in GitHub Desktop.
wayback maschine cheat sheet
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#==================================================================== | |
Nginx config | |
Why search: | |
Nginx server configuration. | |
Nginx is a popular web server, and its configuration files (nginx.conf) control | |
how the server processes requests, handles proxies, and serves content. An exposed | |
config file can reveal server secrets and misconfigurations. | |
Sensitive data: | |
Server names and aliases: Reveals all domain names and subdomains associated with the server. | |
Sensitive file paths: It could reveal the location of log files, SSL certificates, or other critical server files. | |
Server block configurations | |
Upstream backend server addresses | |
SSL certificate paths | |
Access log locations | |
Proxy information: If the server acts as a reverse proxy, the configuration might expose | |
the IP addresses or internal hostnames of backend servers. | |
Proxy settings revealing internal network structure | |
Rate limiting and security configurations | |
credentials | |
internal IPs | |
access rules | |
proxy_pass http://internal-api.example.local; | |
nginx.conf or related | |
#==================================================================== | |
WEB-INF/web.xml | |
Why search: | |
Java web application configuration | |
This is the deployment descriptor for Java web applications. The entire WEB-INF directory | |
is meant to be inaccessible to the public. If it's exposed, it's a significant security flaw. | |
Sensitive data: | |
Internal file paths: The web.xml file specifies the location of servlets, filters, | |
and other Java components, revealing the internal structure of the application. | |
Database connection parameters | |
Security constraints: It may contain information about which URLs require authentication or specific user roles, | |
providing an attacker with a blueprint for a privilege escalation attack. | |
Session configuration | |
Error page mappings | |
debug options | |
Servlet mappings: It can map URLs to specific Java classes, which can help an attacker understand the application's backend logic. | |
internal class paths | |
Servlet mappings exposing unreleased features. | |
Database credentials in <resource-ref>. | |
<env-entry> | |
<env-entry-name>DBPassword</env-entry-name> | |
<env-entry-value>secret123</env-entry-value> | |
</env-entry> | |
#==================================================================== | |
web.config files | |
Why search: | |
IIS/Windows web server configuration. | |
IIS/ASP.NET configuration | |
This is the primary configuration file for ASP.NET applications running on IIS servers. Like other config files, | |
it is not intended for public viewing. | |
Sensitive data: | |
Debugging information: The file might be configured to display detailed error messages, | |
which can expose stack traces, server paths, and other valuable debugging information to an attacker. | |
The file might be configured to display detailed error messages, which can expose stack traces, server paths, | |
and other valuable debugging information to an attacker. | |
Debug mode enabled in historical versions. | |
Database connection strings: This is the ASP.NET equivalent of a database credential, | |
often including the server, database name, and login credentials. | |
Connection strings with database credentials | |
Custom application settings: Any sensitive application settings or keys for services are stored here. | |
Application keys and machine keys | |
Application secrets in <appSettings>. | |
Custom error pages with system information | |
Authentication provider settings | |
Session state configuration | |
<connectionStrings> | |
<add name="DefaultConnection" connectionString="Server=sql;Database=mydb;User Id=admin;Password=Pass123!;" /> | |
</connectionStrings> | |
#==================================================================== | |
.htaccess files | |
Why search: | |
Apache configuration files | |
Apache config for access control | |
The .htaccess file controls server behavior, including URL rewriting, password protection, | |
and access control. An exposed .htaccess file can reveal details about a server's security configurations. | |
Sensitive data: | |
disabled directories. | |
IP-based access controls: It might reveal IP addresses that are whitelisted for administrative access. | |
security rules (e.g., IP whitelisting) | |
Password-protected directories: The file can list directories that are protected with basic authentication, | |
providing an attacker with a clear target for brute-force attacks. | |
Directory passwords and usernames | |
Access restrictions indicating sensitive areas | |
Internal redirects and rewrites: Information on how the server | |
handles specific URLs, which could reveal hidden file names or internal directory structures. | |
Redirect rules revealing internal structure | |
Rewrite rules exposing backend services | |
Error page configurations showing system paths | |
AuthUserFile /var/www/html/.htpasswd | |
#==================================================================== | |
sitemap.xml | |
Why search: | |
Sitemaps list all accessible URLs on a site for SEO. | |
While sitemaps are intended to help search engines index a site, they can inadvertently list URLs | |
that are not linked from the main website and should not be public | |
Lists all site URLs, including unlinked/hidden pages. | |
Sensitive data: | |
Hidden or unlinked directories: The sitemap might contain links to development directories, test pages, staging, admin panels, | |
or internal administrative pages that are not publicly visible on the site itself. | |
User-specific data: A misconfigured sitemap could point to URLs containing user-specific data, such as /user-profile/123456, | |
potentially allowing an attacker to enumerate user IDs. | |
Can reveal structure of the site, including sensitive or deprecated endpoints | |
Hidden admin pages | |
Internal tool URLs | |
API endpoints,Internal tools or deprecated APIs with vulnerabilities. | |
Development/staging URLs,Forgotten development/staging paths (e.g., /dev/login.php). | |
File upload directories | |
User-generated content paths | |
https://example.com/admin-dashboard | |
https://example.com/internal-api | |
#==================================================================== | |
config.* | |
Config files | |
Why search: | |
Configuration files store application settings | |
Configuration files are the central nervous system of an application, storing critical settings and credentials. Exposing a config file is a critical security vulnerability. | |
Application or system configuration | |
Sensitive data: | |
Database credentials: This is one of the most common and damaging exposures. | |
It could include database hostname, port, username, and plaintext password, giving an attacker full access to the site's database. | |
Database connection strings,database usernames, passwords | |
SMTP credentials | |
API keysand secrets: Keys for third-party services like payment gateways, | |
social media integrations, or cloud services. | |
Third-party service API keys(AWS, Google Cloud, payment gateways) | |
Secrets | |
Environment variables: Information about the server environment, file paths, and other system-level details. | |
Custom encryption keys: Keys used to encrypt sensitive user data (e.g., sessions, cookies). | |
Encryption keys and salts | |
Debug settings revealing system paths | |
Feature flags or feature toggles showing internal functionality | |
internal endpoints | |
e.g., config.php, settings.py,config.json | |
$db_password = "SuperSecret123!"; | |
database_password="s3cr3t" in a snapshot. | |
#==================================================================== | |
.git files | |
Why search: | |
Git version control metadata | |
Git repositories contain complete source code history | |
Full source code: The entire source code of the application, including comments, logic, and internal directory structures | |
If .git directory is exposed, attackers can download the entire source code. | |
This may include API keys, database credentials, or logic flaws. | |
Hardcoded credentials: API keys, database usernames and passwords, or other secrets that were accidentally committed to the repository. | |
When developers misconfigure a web server, they can accidentally expose the entire .git directory, | |
which contains the complete source code history of the website. An attacker can use | |
the Wayback Machine to find a snapshot of this directory and download it. | |
Security vulnerabilities: The commit history may reveal past vulnerabilities that have since been fixed, | |
allowing an attacker to find older, unpatched versions of the code on other servers. | |
Internal developer notes | |
Sensitive data: | |
Source code and application logic | |
Database credentials in config files | |
API keys and secrets | |
Commit history | |
Developer comments with passwords | |
Email addresses and internal usernames | |
Deployment scripts and infrastructure details | |
Example sensitive data: | |
.git/config → Repository URL (can lead to internal repos) | |
.git/HEAD, .git/objects/ → Full code history. | |
#==================================================================== | |
robots.txt | |
Why search: | |
Often reveals hidden directories and sensitive paths,admin panels | |
Paths to administrative panels, log files, or backup directories accessible in archives | |
The robots.txt file is designed to tell search engine crawlers which parts of a website they should and should not index. | |
Attackers can use this file to quickly identify paths and directories that a website's administrator considers sensitive | |
and wants to keep private. | |
Sensitive data: | |
Admin panels (/admin, /dashboard, /management) | |
API endpoints (/api/v1/, /rest/) | |
Development directories (/dev/, /test/, /staging/) | |
Backup locations (/backups/, /old/) | |
Database interfaces (/phpmyadmin/, /mysql/) | |
/private | |
Disallow: /backup/ | |
Disallow: /old-site/ | |
Example: Disallow: /secret/ → Wayback URL: https://web.archive.org/web/20200101/https://www.harvard.edu/secret/ | |
#==================================================================== | |
Understanding Web.archive.org and URL Exposure | |
The Wayback Machine (Internet Archive) captures snapshots of websites over time, including the URLs. | |
If a website previously exposed sensitive data in its URLs (e.g., in query strings), these URLs might be archived and accessible. | |
#==================================================================== | |
The filter parameter in the Wayback Machine's CDX (Capture Index) API is incredibly powerful for narrowing down your search results. While, as noted, it doesn't directly help you find sensitive data embedded within the URL string itself, it does help you find URLs that point to resources that are more likely to contain sensitive data | |
The general format for filtering is &filter=<field>:<value_regex>. You can also negate a filter with &filter=!<field>:<value_regex>. | |
Common fields you can filter by: | |
statuscode: HTTP status code (e.g., 200, 301, 404, 500) | |
mimetype: MIME type of the captured content (e.g., text/html, application/pdf, image/jpeg, application/json) | |
original: The original URL (useful for more complex regex filtering on the URL itself, beyond what the main url parameter can do) | |
digest: SHA1 hash of the content | |
length: Size of the content in bytes | |
offset: Byte offset of the content in the WARC file | |
urlkey: A normalized form of the URL used internally by the Archive | |
timestamp: The capture timestamp (YYYYMMDDhhmmss) | |
#==================================================================== | |
Sensitive data in URLs can include: | |
Personally Identifiable Information (PII): Names, email addresses, phone numbers, addresses, social security numbers, dates of birth. | |
Authentication/Session Information: Session IDs, authentication tokens, API keys, temporary passwords. | |
Financial Information: Credit card numbers, bank account details. | |
Confidential Business Information: Internal document IDs, project names, proprietary data. | |
Common sensitive data indicators in URLs: | |
password= or pwd= | |
token= or auth= | |
sessionid= or jsessionid= | |
user= or username= | |
email= | |
ssn= | |
account= or acc= | |
File extensions that often contain sensitive data: .pdf, .xls, .xlsx, .doc, .docx, .sql, .bak, .zip, .tar, .gz, .log, .config, .csv, .yaml, .md, .key, .pem, .crt. | |
URLs containing numerical sequences that could be IDs or financial data. | |
Analyze the results: The CDX API will return a list of URLs. You will need to manually or programmatically review these URLs to identify any that appear to contain sensitive information. Look for: | |
Clear text parameters: Are there parameters in the URL query string that directly expose sensitive data (e.g., ?username=john.doe&password=plaintextpassword). | |
Base64 encoded strings: Sometimes sensitive data is base64 encoded within a URL parameter. While not plain text, it's easily decoded. | |
Long, complex strings: These might be session tokens or API keys. | |
File paths: Look for paths that might lead to sensitive files, such as /admin/, /backup/, /private/, or specific file names like config.php, database.sql. | |
Sensitive data in URLs typically appears as: | |
Query parameters: | |
?token=abcd1234 | |
?ssn=123-45-6789 | |
?api_key=abcdef12345 | |
Session IDs: | |
?sessionid=xyz | |
User credentials or emails: | |
?user=john&password=secret | |
File paths exposing private data**: | |
/internal/config.php | |
Use wildcards to find specific file types: harvard.edu/*.sql, harvard.edu/*.txt, harvard.edu/*.log | |
Check for common sensitive paths like harvard.edu/admin/*, harvard.edu/config/*, harvard.edu/.env | |
Look for error pages that might reveal system information: harvard.edu/* 500 or harvard.edu/* error | |
Search for backup files: harvard.edu/*.bak, harvard.edu/*.old, harvard.edu/*.backup | |
Check development/staging subdomains: dev.harvard.edu/*, staging.harvard.edu/*, test.harvard.edu/* | |
#==================================================================== | |
Time-Based Analysis: | |
Focus on older snapshots (2000s-2010s) when security practices were less mature | |
Look for periods around major site updates or migrations | |
Check snapshots from different years to see what changed | |
#==================================================================== | |
Common Sensitive Data Patterns: | |
Database connection strings | |
API keys and tokens | |
Directory listings showing internal structure | |
Configuration files (.htaccess, web.config, etc.) | |
Debug pages with system information | |
Employee directories or internal documents | |
#==================================================================== | |
Manual Search via Web Interface | |
Visit https://web.archive.org. | |
Enter https://www.harvard.edu in the search bar. | |
Browse archived snapshots (timeline/calendar view). | |
Check URLs in the browser's address bar when viewing a snapshot. Look for: | |
Parameters like ?id=..., ?token=..., etc. | |
Unusual path segments (e.g., /admin/, /debug/). | |
Inspect links on the archived page for sensitive parameters. | |
#==================================================================== | |
Use Wayback Machine's CDX API | |
# Example API request for all captures of harvard.edu | |
curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&filter=statuscode:200" | |
curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&fl=original&collapse=urlkey" | |
url=www.harvard.edu/* → Targets all subpaths | |
output=json → JSON format | |
fl=original → Only original URLs | |
collapse=urlkey → Remove duplicates | |
This returns a list of URLs archived over time. | |
Look for common sensitive parameters: | |
password, token, key, auth, session, email, etc. | |
Command-line filtering: | |
curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&fl=original&collapse=urlkey" \ | |
| grep -Ei "token|key|auth|session|password|email" | |
Command-line filtering: | |
curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&filter=statuscode:200" \ | |
| grep -Ei "token|key|auth|session|password|email" | |
#==================================================================== | |
https://web.archive.org/web/*/https://www.harvard.edu/*?token=* | |
https://web.archive.org/web/*/https://www.harvard.edu/*?session=* | |
#==================================================================== | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment