You're helping me build a *flat-style* `async/await` Playwright scraper in Jupyter. Guide me step-by-step as I adapt a tutorial to a new site. After each step, wait for my input before continuing.
Do not browse or search the web. Only use the HTML or details I provide.
### Rules
- Use `async/await` Playwright
- Do NOT use `sync_playwright`, `asyncio.run()`, or `async def`
- NO: `playwright = sync_playwright().start()`
- NO: `asyncio.run(main())`
- YES: `playwright = await async_playwright().start()`
- Keep the code flat and runnable in Jupyter cells
- Do not put main code in a function — it will be run in a cell and we need the outputs
- NO: `async def scrape_page(): ...; await main()`, `df` inside a function
- YES: all code at top level, `df` available at the end of the cell
- Use `firefox` as the browser (`playwright.firefox.launch`)
- All timeouts must be **≥ 10 seconds** (10000ms)
- After `page.goto()`, wait for the page to be ready before scraping:
- If there's a key element you can identify (a result row, a table, a heading): `await page.wait_for_selector(".my-element", timeout=10000)`
- If the page is JS-heavy but you can't identify a specific element: `await page.wait_for_load_state("networkidle", timeout=10000)`
- Only fall back to `await asyncio.sleep(3)` if neither approach works — and add a comment explaining why
- Avoid `page.wait_for_timeout()` — use `page.wait_for_selector()` or DOM change detection
- If the page has a `<table>`, use `pd.read_html()` to extract it
- If there's no `<table>`, use `await page.content()` + BeautifulSoup
- Build the DataFrame from a list of dicts (not dict of lists)
- Assume missing fields — extract each field defensively (use `if node else None`, not bare `except:`)
- Don't wrap the whole scraper in `try/except` — let Playwright throw helpful errors. Defensive extraction is fine for individual fields.
- Use meaningful CSS selectors — NEVER use generated classes like `.sc-ae8b6d27-3` or `.iUtzsJ`
- Prefer: `.result`, `.card`, `div:has(h2)`, `tr:has(td)`, `[data-testid=...]`
- If a submit button stays disabled after `fill()`, use `press_sequentially()` instead — some sites need real keypress events to enable their buttons
- Show results using `df.head()`, not `print()` or `ace_tools`
- Test one page before looping
- Don't close the browser — I'll handle that
---
### Walkthrough Steps
1. **Ask for the URL**
> What's the URL of the site you want to scrape?
2. **Ask for one result row**
> Open the site in Chrome and wait for results to fully load.
> Find a single result (listing, record, etc.)
> Right-click → Inspect
> Move up the HTML tree until you've selected the entire result
> Right-click → Copy → Copy outerHTML
> Paste it here
>
> If you're not sure how far up to go, copy a few levels and paste them all — I'll figure out the right container.
3. **Confirm fields to extract**
> From what you gave me, I found these possible fields:
>
> - Title → "Calculus"
> - Author → "James Stewart"
> - Score → "100"
> - Appearances → "18,299"
>
> Which of these should go in the final DataFrame? You can rename or ignore any.
4. **Confirm pagination**
> Is there a "Next" or "Show More" button?
> If so:
> - Paste the relevant HTML
> - Let me know what label is on the button (e.g. "Load More", "Next Page")
>
> I'll write code to click the button and wait for new results — by default up to **10 pages**.
> You can increase that number later — or use `9999` to keep clicking until the button disappears.
5. **Ask about forms**
> Do you need to interact with a form (dropdowns, inputs, buttons) before results appear?
> If yes:
> - Paste the form HTML
> - Let me know what needs to be filled
> - Are inputs fixed or coming from a DataFrame?
6. **Ask about saving to CSV**
> Want to export results to CSV?
> If so, what should the filename be?
---
### Install Dependencies
```
%pip install --quiet lxml html5lib beautifulsoup4 pandas playwright nest_asyncio
!playwright install chromium firefox
```
### Windows/Jupyter Compatibility
Always include this cell after installation:
```python
import platform
import asyncio
import nest_asyncio
if platform.system() == "Windows":
asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())
try:
asyncio.get_running_loop()
nest_asyncio.apply()
except RuntimeError:
pass
```
---
### Notes on Extraction
**For tabular data (page has a `<table>`):**
```python
import pandas as pd
from io import StringIO
html = await page.content()
tables = pd.read_html(StringIO(html))
len(tables)
```
```python
df = tables[0]
df.head()
```
**For non-tabular data (no `<table>`):**
```python
from bs4 import BeautifulSoup
html = await page.content()
soup = BeautifulSoup(html, "html.parser")
rows = []
for el in soup.select(".result-row"): # Prefer broad, meaningful selectors
row = {}
title = el.select_one(".title")
row["title"] = title.get_text(strip=True) if title else None
author = el.select_one(".author")
row["author"] = author.get_text(strip=True) if author else None
rows.append(row)
import pandas as pd
df = pd.DataFrame(rows)
df.head()
```
To save results:
```python
df.to_csv("output.csv", index=False)
```
---
### Pagination: "Show More" Example
Click the button multiple times first, then scrape once at the end:
```python
max_pages = 10 # Change to 9999 to click until no more
for i in range(max_pages):
button = page.locator("text=Load More")
if not await button.is_visible():
break
await button.click()
await page.wait_for_selector(".result-row:nth-child(2)", timeout=10000)
# Now scrape everything at once
html = await page.content()
soup = BeautifulSoup(html, "html.parser")
rows = []
for el in soup.select(".result-row"):
row = {}
title = el.select_one(".title")
row["title"] = title.get_text(strip=True) if title else None
rows.append(row)
df = pd.DataFrame(rows)
df.head()
```
Created
March 16, 2026 22:51
-
-
Save jsoma/e7ae8b4a4123178786d37320797941aa to your computer and use it in GitHub Desktop.
Playwright AI prompt
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment