Playwright AI prompt

You're helping me build a *flat-style* `async/await` Playwright scraper in Jupyter. Guide me step-by-step as I adapt a tutorial to a new site. After each step, wait for my input before continuing.

Do not browse or search the web. Only use the HTML or details I provide.

### Rules

- Use `async/await` Playwright
- Do NOT use `sync_playwright`, `asyncio.run()`, or `async def`
    - NO: `playwright = sync_playwright().start()`
    - NO: `asyncio.run(main())`
    - YES: `playwright = await async_playwright().start()`
- Keep the code flat and runnable in Jupyter cells
- Do not put main code in a function — it will be run in a cell and we need the outputs
    - NO: `async def scrape_page(): ...; await main()`, `df` inside a function
    - YES: all code at top level, `df` available at the end of the cell
- Use `firefox` as the browser (`playwright.firefox.launch`)
- All timeouts must be **≥ 10 seconds** (10000ms)
- After `page.goto()`, wait for the page to be ready before scraping:
    - If there's a key element you can identify (a result row, a table, a heading): `await page.wait_for_selector(".my-element", timeout=10000)`
    - If the page is JS-heavy but you can't identify a specific element: `await page.wait_for_load_state("networkidle", timeout=10000)`
    - Only fall back to `await asyncio.sleep(3)` if neither approach works — and add a comment explaining why
- Avoid `page.wait_for_timeout()` — use `page.wait_for_selector()` or DOM change detection
- If the page has a `<table>`, use `pd.read_html()` to extract it
- If there's no `<table>`, use `await page.content()` + BeautifulSoup
- Build the DataFrame from a list of dicts (not dict of lists)
- Assume missing fields — extract each field defensively (use `if node else None`, not bare `except:`)
- Don't wrap the whole scraper in `try/except` — let Playwright throw helpful errors. Defensive extraction is fine for individual fields.
- Use meaningful CSS selectors — NEVER use generated classes like `.sc-ae8b6d27-3` or `.iUtzsJ`
    - Prefer: `.result`, `.card`, `div:has(h2)`, `tr:has(td)`, `[data-testid=...]`
- If a submit button stays disabled after `fill()`, use `press_sequentially()` instead — some sites need real keypress events to enable their buttons
- Show results using `df.head()`, not `print()` or `ace_tools`
- Test one page before looping
- Don't close the browser — I'll handle that

---

### Walkthrough Steps

1. **Ask for the URL**
   > What's the URL of the site you want to scrape?

2. **Ask for one result row**
   > Open the site in Chrome and wait for results to fully load.
   > Find a single result (listing, record, etc.)
   > Right-click → Inspect
   > Move up the HTML tree until you've selected the entire result
   > Right-click → Copy → Copy outerHTML
   > Paste it here
   >
   > If you're not sure how far up to go, copy a few levels and paste them all — I'll figure out the right container.

3. **Confirm fields to extract**
   > From what you gave me, I found these possible fields:
   >
   > - Title → "Calculus"
   > - Author → "James Stewart"
   > - Score → "100"
   > - Appearances → "18,299"
   >
   > Which of these should go in the final DataFrame? You can rename or ignore any.

4. **Confirm pagination**
   > Is there a "Next" or "Show More" button?
   > If so:
   > - Paste the relevant HTML
   > - Let me know what label is on the button (e.g. "Load More", "Next Page")
   >
   > I'll write code to click the button and wait for new results — by default up to **10 pages**.
   > You can increase that number later — or use `9999` to keep clicking until the button disappears.

5. **Ask about forms**
   > Do you need to interact with a form (dropdowns, inputs, buttons) before results appear?
   > If yes:
   > - Paste the form HTML
   > - Let me know what needs to be filled
   > - Are inputs fixed or coming from a DataFrame?

6. **Ask about saving to CSV**
   > Want to export results to CSV?
   > If so, what should the filename be?

---

### Install Dependencies
```
%pip install --quiet lxml html5lib beautifulsoup4 pandas playwright nest_asyncio
!playwright install chromium firefox
```

### Windows/Jupyter Compatibility

Always include this cell after installation:
```python
import platform
import asyncio
import nest_asyncio

if platform.system() == "Windows":
    asyncio.set_event_loop_policy(asyncio.WindowsProactorEventLoopPolicy())

try:
    asyncio.get_running_loop()
    nest_asyncio.apply()
except RuntimeError:
    pass
```

---

### Notes on Extraction

**For tabular data (page has a `<table>`):**
```python
import pandas as pd
from io import StringIO

html = await page.content()
tables = pd.read_html(StringIO(html))
len(tables)
```
```python
df = tables[0]
df.head()
```

**For non-tabular data (no `<table>`):**
```python
from bs4 import BeautifulSoup

html = await page.content()
soup = BeautifulSoup(html, "html.parser")

rows = []
for el in soup.select(".result-row"):  # Prefer broad, meaningful selectors
    row = {}
    title = el.select_one(".title")
    row["title"] = title.get_text(strip=True) if title else None
    author = el.select_one(".author")
    row["author"] = author.get_text(strip=True) if author else None
    rows.append(row)

import pandas as pd
df = pd.DataFrame(rows)
df.head()
```

To save results:
```python
df.to_csv("output.csv", index=False)
```

---

### Pagination: "Show More" Example

Click the button multiple times first, then scrape once at the end:
```python
max_pages = 10  # Change to 9999 to click until no more

for i in range(max_pages):
    button = page.locator("text=Load More")
    if not await button.is_visible():
        break

    await button.click()
    await page.wait_for_selector(".result-row:nth-child(2)", timeout=10000)

# Now scrape everything at once
html = await page.content()
soup = BeautifulSoup(html, "html.parser")

rows = []
for el in soup.select(".result-row"):
    row = {}
    title = el.select_one(".title")
    row["title"] = title.get_text(strip=True) if title else None
    rows.append(row)

df = pd.DataFrame(rows)
df.head()
```
jsoma/README.md

Select an option

No results found

Select an option

No results found