Skip to content

Instantly share code, notes, and snippets.

@githubfoam
Last active August 1, 2025 10:52
Show Gist options
  • Save githubfoam/bd1f22d0aab0d25e19ccc230ca4ee60b to your computer and use it in GitHub Desktop.
Save githubfoam/bd1f22d0aab0d25e19ccc230ca4ee60b to your computer and use it in GitHub Desktop.
wayback maschine cheat sheet
#====================================================================
Nginx config
Why search:
Nginx server configuration.
Nginx is a popular web server, and its configuration files (nginx.conf) control
how the server processes requests, handles proxies, and serves content. An exposed
config file can reveal server secrets and misconfigurations.
Sensitive data:
Server names and aliases: Reveals all domain names and subdomains associated with the server.
Sensitive file paths: It could reveal the location of log files, SSL certificates, or other critical server files.
Server block configurations
Upstream backend server addresses
SSL certificate paths
Access log locations
Proxy information: If the server acts as a reverse proxy, the configuration might expose
the IP addresses or internal hostnames of backend servers.
Proxy settings revealing internal network structure
Rate limiting and security configurations
credentials
internal IPs
access rules
proxy_pass http://internal-api.example.local;
nginx.conf or related
#====================================================================
WEB-INF/web.xml
Why search:
Java web application configuration
This is the deployment descriptor for Java web applications. The entire WEB-INF directory
is meant to be inaccessible to the public. If it's exposed, it's a significant security flaw.
Sensitive data:
Internal file paths: The web.xml file specifies the location of servlets, filters,
and other Java components, revealing the internal structure of the application.
Database connection parameters
Security constraints: It may contain information about which URLs require authentication or specific user roles,
providing an attacker with a blueprint for a privilege escalation attack.
Session configuration
Error page mappings
debug options
Servlet mappings: It can map URLs to specific Java classes, which can help an attacker understand the application's backend logic.
internal class paths
Servlet mappings exposing unreleased features.
Database credentials in <resource-ref>.
<env-entry>
<env-entry-name>DBPassword</env-entry-name>
<env-entry-value>secret123</env-entry-value>
</env-entry>
#====================================================================
web.config files
Why search:
IIS/Windows web server configuration.
IIS/ASP.NET configuration
This is the primary configuration file for ASP.NET applications running on IIS servers. Like other config files,
it is not intended for public viewing.
Sensitive data:
Debugging information: The file might be configured to display detailed error messages,
which can expose stack traces, server paths, and other valuable debugging information to an attacker.
The file might be configured to display detailed error messages, which can expose stack traces, server paths,
and other valuable debugging information to an attacker.
Debug mode enabled in historical versions.
Database connection strings: This is the ASP.NET equivalent of a database credential,
often including the server, database name, and login credentials.
Connection strings with database credentials
Custom application settings: Any sensitive application settings or keys for services are stored here.
Application keys and machine keys
Application secrets in <appSettings>.
Custom error pages with system information
Authentication provider settings
Session state configuration
<connectionStrings>
<add name="DefaultConnection" connectionString="Server=sql;Database=mydb;User Id=admin;Password=Pass123!;" />
</connectionStrings>
#====================================================================
.htaccess files
Why search:
Apache configuration files
Apache config for access control
The .htaccess file controls server behavior, including URL rewriting, password protection,
and access control. An exposed .htaccess file can reveal details about a server's security configurations.
Sensitive data:
disabled directories.
IP-based access controls: It might reveal IP addresses that are whitelisted for administrative access.
security rules (e.g., IP whitelisting)
Password-protected directories: The file can list directories that are protected with basic authentication,
providing an attacker with a clear target for brute-force attacks.
Directory passwords and usernames
Access restrictions indicating sensitive areas
Internal redirects and rewrites: Information on how the server
handles specific URLs, which could reveal hidden file names or internal directory structures.
Redirect rules revealing internal structure
Rewrite rules exposing backend services
Error page configurations showing system paths
AuthUserFile /var/www/html/.htpasswd
#====================================================================
sitemap.xml
Why search:
Sitemaps list all accessible URLs on a site for SEO.
While sitemaps are intended to help search engines index a site, they can inadvertently list URLs
that are not linked from the main website and should not be public
Lists all site URLs, including unlinked/hidden pages.
Sensitive data:
Hidden or unlinked directories: The sitemap might contain links to development directories, test pages, staging, admin panels,
or internal administrative pages that are not publicly visible on the site itself.
User-specific data: A misconfigured sitemap could point to URLs containing user-specific data, such as /user-profile/123456,
potentially allowing an attacker to enumerate user IDs.
Can reveal structure of the site, including sensitive or deprecated endpoints
Hidden admin pages
Internal tool URLs
API endpoints,Internal tools or deprecated APIs with vulnerabilities.
Development/staging URLs,Forgotten development/staging paths (e.g., /dev/login.php).
File upload directories
User-generated content paths
https://example.com/admin-dashboard
https://example.com/internal-api
#====================================================================
config.*
Config files
Why search:
Configuration files store application settings
Configuration files are the central nervous system of an application, storing critical settings and credentials. Exposing a config file is a critical security vulnerability.
Application or system configuration
Sensitive data:
Database credentials: This is one of the most common and damaging exposures.
It could include database hostname, port, username, and plaintext password, giving an attacker full access to the site's database.
Database connection strings,database usernames, passwords
SMTP credentials
API keysand secrets: Keys for third-party services like payment gateways,
social media integrations, or cloud services.
Third-party service API keys(AWS, Google Cloud, payment gateways)
Secrets
Environment variables: Information about the server environment, file paths, and other system-level details.
Custom encryption keys: Keys used to encrypt sensitive user data (e.g., sessions, cookies).
Encryption keys and salts
Debug settings revealing system paths
Feature flags or feature toggles showing internal functionality
internal endpoints
e.g., config.php, settings.py,config.json
$db_password = "SuperSecret123!";
database_password="s3cr3t" in a snapshot.
#====================================================================
.git files
Why search:
Git version control metadata
Git repositories contain complete source code history
Full source code: The entire source code of the application, including comments, logic, and internal directory structures
If .git directory is exposed, attackers can download the entire source code.
This may include API keys, database credentials, or logic flaws.
Hardcoded credentials: API keys, database usernames and passwords, or other secrets that were accidentally committed to the repository.
When developers misconfigure a web server, they can accidentally expose the entire .git directory,
which contains the complete source code history of the website. An attacker can use
the Wayback Machine to find a snapshot of this directory and download it.
Security vulnerabilities: The commit history may reveal past vulnerabilities that have since been fixed,
allowing an attacker to find older, unpatched versions of the code on other servers.
Internal developer notes
Sensitive data:
Source code and application logic
Database credentials in config files
API keys and secrets
Commit history
Developer comments with passwords
Email addresses and internal usernames
Deployment scripts and infrastructure details
Example sensitive data:
.git/config → Repository URL (can lead to internal repos)
.git/HEAD, .git/objects/ → Full code history.
#====================================================================
robots.txt
Why search:
Often reveals hidden directories and sensitive paths,admin panels
Paths to administrative panels, log files, or backup directories accessible in archives
The robots.txt file is designed to tell search engine crawlers which parts of a website they should and should not index.
Attackers can use this file to quickly identify paths and directories that a website's administrator considers sensitive
and wants to keep private.
Sensitive data:
Admin panels (/admin, /dashboard, /management)
API endpoints (/api/v1/, /rest/)
Development directories (/dev/, /test/, /staging/)
Backup locations (/backups/, /old/)
Database interfaces (/phpmyadmin/, /mysql/)
/private
Disallow: /backup/
Disallow: /old-site/
Example: Disallow: /secret/ → Wayback URL: https://web.archive.org/web/20200101/https://www.harvard.edu/secret/
#====================================================================
Understanding Web.archive.org and URL Exposure
The Wayback Machine (Internet Archive) captures snapshots of websites over time, including the URLs.
If a website previously exposed sensitive data in its URLs (e.g., in query strings), these URLs might be archived and accessible.
#====================================================================
The filter parameter in the Wayback Machine's CDX (Capture Index) API is incredibly powerful for narrowing down your search results. While, as noted, it doesn't directly help you find sensitive data embedded within the URL string itself, it does help you find URLs that point to resources that are more likely to contain sensitive data
The general format for filtering is &filter=<field>:<value_regex>. You can also negate a filter with &filter=!<field>:<value_regex>.
Common fields you can filter by:
statuscode: HTTP status code (e.g., 200, 301, 404, 500)
mimetype: MIME type of the captured content (e.g., text/html, application/pdf, image/jpeg, application/json)
original: The original URL (useful for more complex regex filtering on the URL itself, beyond what the main url parameter can do)
digest: SHA1 hash of the content
length: Size of the content in bytes
offset: Byte offset of the content in the WARC file
urlkey: A normalized form of the URL used internally by the Archive
timestamp: The capture timestamp (YYYYMMDDhhmmss)
#====================================================================
Sensitive data in URLs can include:
Personally Identifiable Information (PII): Names, email addresses, phone numbers, addresses, social security numbers, dates of birth.
Authentication/Session Information: Session IDs, authentication tokens, API keys, temporary passwords.
Financial Information: Credit card numbers, bank account details.
Confidential Business Information: Internal document IDs, project names, proprietary data.
Common sensitive data indicators in URLs:
password= or pwd=
token= or auth=
sessionid= or jsessionid=
user= or username=
email=
ssn=
account= or acc=
File extensions that often contain sensitive data: .pdf, .xls, .xlsx, .doc, .docx, .sql, .bak, .zip, .tar, .gz, .log, .config, .csv, .yaml, .md, .key, .pem, .crt.
URLs containing numerical sequences that could be IDs or financial data.
Analyze the results: The CDX API will return a list of URLs. You will need to manually or programmatically review these URLs to identify any that appear to contain sensitive information. Look for:
Clear text parameters: Are there parameters in the URL query string that directly expose sensitive data (e.g., ?username=john.doe&password=plaintextpassword).
Base64 encoded strings: Sometimes sensitive data is base64 encoded within a URL parameter. While not plain text, it's easily decoded.
Long, complex strings: These might be session tokens or API keys.
File paths: Look for paths that might lead to sensitive files, such as /admin/, /backup/, /private/, or specific file names like config.php, database.sql.
Sensitive data in URLs typically appears as:
Query parameters:
?token=abcd1234
?ssn=123-45-6789
?api_key=abcdef12345
Session IDs:
?sessionid=xyz
User credentials or emails:
?user=john&password=secret
File paths exposing private data**:
/internal/config.php
Use wildcards to find specific file types: harvard.edu/*.sql, harvard.edu/*.txt, harvard.edu/*.log
Check for common sensitive paths like harvard.edu/admin/*, harvard.edu/config/*, harvard.edu/.env
Look for error pages that might reveal system information: harvard.edu/* 500 or harvard.edu/* error
Search for backup files: harvard.edu/*.bak, harvard.edu/*.old, harvard.edu/*.backup
Check development/staging subdomains: dev.harvard.edu/*, staging.harvard.edu/*, test.harvard.edu/*
#====================================================================
Time-Based Analysis:
Focus on older snapshots (2000s-2010s) when security practices were less mature
Look for periods around major site updates or migrations
Check snapshots from different years to see what changed
#====================================================================
Common Sensitive Data Patterns:
Database connection strings
API keys and tokens
Directory listings showing internal structure
Configuration files (.htaccess, web.config, etc.)
Debug pages with system information
Employee directories or internal documents
#====================================================================
Manual Search via Web Interface
Visit https://web.archive.org.
Enter https://www.harvard.edu in the search bar.
Browse archived snapshots (timeline/calendar view).
Check URLs in the browser's address bar when viewing a snapshot. Look for:
Parameters like ?id=..., ?token=..., etc.
Unusual path segments (e.g., /admin/, /debug/).
Inspect links on the archived page for sensitive parameters.
#====================================================================
Use Wayback Machine's CDX API
# Example API request for all captures of harvard.edu
curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&filter=statuscode:200"
curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&fl=original&collapse=urlkey"
url=www.harvard.edu/* → Targets all subpaths
output=json → JSON format
fl=original → Only original URLs
collapse=urlkey → Remove duplicates
This returns a list of URLs archived over time.
Look for common sensitive parameters:
password, token, key, auth, session, email, etc.
Command-line filtering:
curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&fl=original&collapse=urlkey" \
| grep -Ei "token|key|auth|session|password|email"
Command-line filtering:
curl "http://web.archive.org/cdx/search/cdx?url=www.harvard.edu/*&output=json&filter=statuscode:200" \
| grep -Ei "token|key|auth|session|password|email"
#====================================================================
https://web.archive.org/web/*/https://www.harvard.edu/*?token=*
https://web.archive.org/web/*/https://www.harvard.edu/*?session=*
#====================================================================
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment