What are you using for your LLM execution? If you're using something like LlamaIndex it can work with multiple documents surprisingly well especially in markdown.
For example, here's a script that I explored with Scrapfly. It scrapes multiple pages as markdown and then loads them all into an index you can query (with openAI in this case):
import os
from llama_index.core import VectorStoreIndex
from llama_index.llms.openai import OpenAI
from llama_index.readers.web import ScrapflyReader
# 1. Add OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR OPEN API KEY"
# 2. Set up ScrapflyReader with your ScrapFly API key
scrapfly_reader = ScrapflyReader(
api_key="YOUR SCRAPFLY KEY",
)
# 3. scrape web pages as markdown
documents = scrapfly_reader.load_data(
urls=[
"https://web-scraping.dev/products?page=1",
"https://web-scraping.dev/products?page=2",
],
scrape_format="markdown",
)
# 4. Create index and query engine
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(
llm=OpenAI(model="gpt-3.5-turbo-0125"),
)
# 5. Run your query but be specific about multiple pages
response = query_engine.query(
"""
Given the data fetched from the specified product paging URLs,
extract title, description and price of each product preview from each page in JSON format
"""
)
print(response)
"""
{
"Page 1": [
{
"title": "Orange Chocolate Box",
"description": "Medium size chocolate box with orange flavor",
"price": "$15.99"
},
{
"title": "Dark Red Potion",
"description": "Mysterious dark red potion with magical properties",
"price": "$9.99"
},
{
"title": "Teal Potion",
"description": "Refreshing teal-colored potion for vitality",
"price": "$12.49"
},
{
"title": "Red Potion",
"description": "Classic red potion for healing and energy",
"price": "$7.99"
},
{
"title": "Blue Potion",
"description": "Cool blue potion for enhancing abilities",
"price": "$10.99"
}
],
"Page 2": [
{
"title": "Dragon Potion",
"description": "Powerful potion with dragon essence",
"price": "$19.99"
},
{
"title": "Hiking Boots",
"description": "Sturdy hiking boots for outdoor adventures",
"price": "$49.99"
},
{
"title": "Women's Sandals",
"description": "Elegant beige sandals for summer style",
"price": "$29.99"
},
{
"title": "Men's Running Shoes",
"description": "Sleek running shoes for fitness enthusiasts",
"price": "$39.99"
},
{
"title": "Kids Light-Up Sneakers",
"description": "Fun light-up sneakers for kids",
"price": "$24.99"
}
]
}
"""
Note that with llamaindex you often need to be a bit specific with your prompt of how much of an index you want to prompt on as naturally it's very lazy on cheap models like gpt-3.5-turbo. For even better results you can apply more data transformations (see llamaindex docs for more on that) but just markdown with llamaindex will get you surprisingly far!