Skip to content

Instantly share code, notes, and snippets.

View ConfoundingVariables's full-sized avatar
💭
hecker

ConfoundingVariables

💭
hecker
View GitHub Profile

I identified ten leading self-hosted web-scraping and browser-automation frameworks, spanning headless-browser drivers, high-level crawlers, and API-first services. Only Firecrawl offers built-in Markdown output, converting scraped pages directly into clean Markdown; all others require either custom pipelines or external libraries (e.g. Turndown) to transform HTML/text into Markdown. Here’s a quick rundown:

  • Firecrawl (Node, Python, Go SDKs): API-first scraper with native Markdown output and AI-powered extraction citeturn0search0turn0search11
  • Playwright (TS/JS, Python, C#, Java): Cross-browser headless automation; no native Markdown conversion citeturn1search0turn1search12
  • Puppeteer (JS/TS): Headless Chrome/Firefox control; requires manual Markdown transformation citeturn5search0
  • Scrapy (Python): Asynchronous HTTP crawler; extensible pipelines but no built-in Markdown citeturn6search0
  • Apify SDK (JS/TS): Scalable crawler on Puppeteer; rich API but no Markdown