Ethical Hacker Tip: Scrape Wayback Machine from CLI Using curl

Listen to this Post

In this guide, we’ll explore how to scrape URLs from the Wayback Machine (archive.org) using command-line tools like curl, sed, and cut. This technique is useful for penetration testers, bug bounty hunters, and cybersecurity researchers who need historical website data.

Step-by-Step Commands

1. Set the Target URL

First, encode the target URL in a Bash variable:

encoded_url="https://example.com"  Replace with your target URL

2. Fetch Wayback Machine Data with curl

Use `curl` to retrieve archived URLs in JSON format:

curl -s "https://web.archive.org/web/timemap/json?url=$encoded_url&matchType=prefix&collapse=urlkey&output=json&fl=original,mimetype,timestamp,endtimestamp,groupcount,uniqcount&limit=10000&_=$(date +%s)" -o output.json

– `-s` silences unnecessary output.
– `-o output.json` saves results to a file.

3. Extract Clean URLs with sed & cut

Add this handy `Xurl` alias to filter URLs:

alias Xurl='sed "s/http/\nhttp/g" | grep ^http | sed "s/(^http[^ <])(.)/\1/g" | tr "\"" " " | tr ">" " " | tr "," " " | tr ")" " "'

Now, pipe the `curl` output to extract clean URLs:

curl -s "https://web.archive.org/web/timemap/json?url=$encoded_url&matchType=prefix&collapse=urlkey&output=json&fl=original,mimetype,timestamp,endtimestamp,groupcount,uniqcount&limit=10000&_=$(date +%s)" | Xurl | cut -d " " -f 1

– `cut -d ” ” -f 1` isolates the first field (URL) from each line.

4. Save Output to a File

To store results for later analysis:

curl -s "https://web.archive.org/web/timemap/json?url=$encoded_url&matchType=prefix&collapse=urlkey&output=json&fl=original,mimetype,timestamp,endtimestamp,groupcount,uniqcount&limit=10000&_=$(date +%s)" | Xurl | cut -d " " -f 1 > wayback_urls.txt

You Should Know:

  • Alternative Tools:
  • Use `waybackurls` (from Tomnomnom) for faster scraping.
    echo "example.com" | waybackurls
    
  • Filtering Results:
  • Extract only JavaScript files:
    cat wayback_urls.txt | grep ".js$"
    
  • Find parameters for XSS testing:
    cat wayback_urls.txt | grep "?.="
    
  • Automation with Bash:
    for url in $(cat targets.txt); do
    echo "$url" | waybackurls >> all_wayback.txt
    done
    
  • Check for Sensitive Data:
    cat wayback_urls.txt | grep -E "api|admin|config|token|key"
    

What Undercode Say:

The Wayback Machine is a goldmine for cybersecurity professionals. By automating URL extraction, you can uncover hidden endpoints, deprecated APIs, and forgotten files that may expose vulnerabilities. Combine this with tools like grep, ffuf, and `nuclei` for deeper reconnaissance.

Expected Output:

https://example.com/page1 
https://example.com/page2 
https://example.com/old-api 
... 

For further reading:

References:

Reported By: Activity 7316216822880456704 – Hackers Feeds
Extra Hub: Undercode MoN
Basic Verification: Pass ✅

Join Our Cyber World:

💬 Whatsapp | 💬 TelegramFeatured Image