Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save Granitosaurus/130cfb26db34ebd840e88c6649735d6b to your computer and use it in GitHub Desktop.
Save Granitosaurus/130cfb26db34ebd840e88c6649735d6b to your computer and use it in GitHub Desktop.
How to use Llamaindex to scrape multiple pages and parse data with LLMs

What are you using for your LLM execution? If you're using something like LlamaIndex it can work with multiple documents surprisingly well especially in markdown.

For example, here's a script that I explored with Scrapfly. It scrapes multiple pages as markdown and then loads them all into an index you can query (with openAI in this case):

import os
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.readers.web import ScrapflyReader

# 1. Add OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR OPEN API KEY"

# 2. Set up ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
    api_key="YOUR SCRAPFLY KEY", 
)

# 3. scrape web pages as markdown
documents = scrapfly_reader.load_data(
    urls=[
        "https://web-scraping.dev/products?page=1",
        "https://web-scraping.dev/products?page=2",
    ],
    scrape_format="markdown",
)


# 4. Create index and query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
    llm=OpenAI(model="gpt-3.5-turbo-0125"),
)

# 5. Run your query but be specific about multiple pages
response = query_engine.query(
"""
Given the data fetched from the specified product paging URLs,
extract title, description and price of each product preview from each page in JSON format
"""
)
print(response)
"""
{
    "Page 1": [
        {
            "title": "Orange Chocolate Box",
            "description": "Medium size chocolate box with orange flavor",
            "price": "$15.99"
        },
        {
            "title": "Dark Red Potion",
            "description": "Mysterious dark red potion with magical properties",
            "price": "$9.99"
        },
        {
            "title": "Teal Potion",
            "description": "Refreshing teal-colored potion for vitality",
            "price": "$12.49"
        },
        {
            "title": "Red Potion",
            "description": "Classic red potion for healing and energy",
            "price": "$7.99"
        },
        {
            "title": "Blue Potion",
            "description": "Cool blue potion for enhancing abilities",
            "price": "$10.99"
        }
    ],
    "Page 2": [
        {
            "title": "Dragon Potion",
            "description": "Powerful potion with dragon essence",
            "price": "$19.99"
        },
        {
            "title": "Hiking Boots",
            "description": "Sturdy hiking boots for outdoor adventures",
            "price": "$49.99"
        },
        {
            "title": "Women's Sandals",
            "description": "Elegant beige sandals for summer style",
            "price": "$29.99"
        },
        {
            "title": "Men's Running Shoes",
            "description": "Sleek running shoes for fitness enthusiasts",
            "price": "$39.99"
        },
        {
            "title": "Kids Light-Up Sneakers",
            "description": "Fun light-up sneakers for kids",
            "price": "$24.99"
        }
    ]
}
"""

Note that with llamaindex you often need to be a bit specific with your prompt of how much of an index you want to prompt on as naturally it's very lazy on cheap models like gpt-3.5-turbo. For even better results you can apply more data transformations (see llamaindex docs for more on that) but just markdown with llamaindex will get you surprisingly far!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment