Web Scraping Using LLM -

6 min readAug 1, 2024

Now that we are able to scrape websites using Python : (Scrape Amazon Product Reviews With Python –) and its libraries like BeautifulSoup, Requests, and Pandas, let’s take a step ahead and learn how we could simplify it further using LLM. Before we talk about the scraping part, let us understand the terminologies and what is an LLM. You are in the right place to learn about all these words if you are unfamiliar with LangChain, AI, or NLP.

What is LLM?

LLM stands for Large Language Model. It is a machine learning model trained on a large amount of data, referred to as a corpus, which consists of vast textual data. Large in the sense that there is a lot of data — terabytes — contained in the data. For example, an LLM may have seen terabytes of data, while a file on your computer system may be sized in gigabytes (GB). LLMs are able to respond to inquiries based on such textual data because of their thorough training. By utilizing them wisely, large language models may be applied to a variety of tasks, including summaries, Q&As, and translations. Just as Python provides libraries and frameworks, LLMs also have these resources.

The Term ScrapeGraphAI

ScrapeGraphAI is an advanced Python package that uses Large Language Models (LLMs) and configurable graph based pipelines to transform online scraping. A strong technology that makes structured data extraction from webpages and local documents easier is called ScrapeGraphAI. Users may obtain important information with only one inquiry. With ScrapeGraphAI’s sophisticated language model capabilities, manual parsing and convoluted rule-based systems will be a thing of the past as it can comprehend intricate data structures and extract valuable facts.

The library provides a variety of customized graph classes, namely SmartScraperGraph for single-page scraping, SearchGraph for multi-page extraction from search results, and ScriptCreatorGraph to generate customized scraping scripts. using ScrapeGraphAI, you may select the best AI backend for your scraping requirements from a variety of LLM providers, like as OpenAI, Groq, and Azure, in addition to local models using Ollama.

Let we begin by extracting data using LLM from a website.

With the integration of Large Language Models (LLMs) and modular graph-based pipelines, ScrapeGraphAI is an open-source Python toolkit that is revolutionizing online scraping. This tutorial will walk you through the fundamentals of using the library. Along with that, we will look into different frameworks too.

Installation

First, let’s install ScrapeGraphAI. Run the following command in your terminal:

pip install langchain pydantic python-dotenv scrapegraphai

Note: I will be using the Google Colab environment.

The code snippet below does the following:

Install Packages: The code updates and installs some tools needed for web scraping and working with OpenAI.
Import Modules: It brings in tools for securely getting your API key and managing settings.
Set API Key: You enter your OpenAI API key safely, and it gets stored for use in your project.
Allow Nested Async Operations: It adjusts the setup to let you run multiple asynchronous tasks at once, which is useful for certain types of programming.

Learn how to generate openai api key to integrate it into our code above

import os 
from dotenv import load_dotenv 
from scrapegraphai.graphs import SmartScraperGraph

Here’s a straightforward explanation of what below code does:

Import Modules:
• os: Used for interacting with the operating system, like getting environment variables.
• dotenv: Helps load environment variables from a .env file.
• SmartScraperGraph from scrapegraphai.graphs: A tool for web scraping using AI.
Load Environment Variables:
• load_dotenv(): This reads from a .env file in your project directory to load environment variables like your API key.
Get API Key:
• os.getenv(“OPENAI_API_KEY”): Retrieves your OpenAI API key from the environment variables.
Set Up Configuration:
• graph_config: Sets up the configuration for SmartScraperGraph, including the API key and the AI model to use (e.g., GPT-4).
Create and Run SmartScraperGraph:
• SmartScraperGraph(…): Initializes the web scraper with a prompt, a webpage URL to scrape, and the configuration.
• prompt: Instructions for what you want to scrape from the page.
• source: The URL of the webpage to scrape.
• config: The setup for the AI model and API key.
• smart_scraper_graph.run(): Executes the scraping task and gets the results.
Print Results:
• print(result): Displays the results of the scraping task.
So, in essence, this script is set up to scrape data from a specific webpage using an AI model, with the API key loaded from a secure file.

import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

# Load environment variables (make sure you have a .env file with your OPENAI_API_KEY)
load_dotenv()

openai_key = os.getenv("OPENAI_API_KEY")

graph_config = {
   "llm": {
      "api_key": openai_key,
      "model": "gpt-4o",
   },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the services on this page with their descriptions.",
    source="https://understandingdata.com/",
    config=graph_config,
)

result = smart_scraper_graph.run()
print(result)

OUTPUT -

{'services': [
    {'name': 'Data Engineering', 'description': 'NA'}, 
    {'name': 'React Development', 'description': 'NA'}, 
    {'name': 'Python Programming Development', 'description': 'NA'}, 
    {'name': 'Prompt Engineering', 'description': 'NA'}, 
    {'name': 'Web Scraping', 'description': 'NA'}, 
    {'name': 'SaaS Applications', 'description': 'NA'}
]}

Web Scraping Using LangChain and Pydantic

The Term LangChain

LangChain is a framework that utilizes Large Language Models (LLMs). Take ChatGPT, for example — it uses an OpenAI Language Model (LLM) to generate responses. However, ChatGPT itself isn’t an LLM; it’s an application built with one. Now, if you want to personalize and manage your own LLMs, LangChain is a great tool. It allows you to use your data to tailor the LLMs to your needs, simplifying many processes. This way, developers can focus on the critical tasks while LangChain handles the rest.

Now, let’s take a look at another method using LangChain and Pydantic.

Prerequisites

Before we dive into the implementation, make sure you have the following installed:

Python 3.7+
LangChain
Pydantic

Setting Up Pydantic Models

Pydantic is a data validation and settings management library that uses Python type annotations. It works great for defining and confirming structured data. To represent the services we wish to extract, we will define our data models here.

from langchain_core.pydantic_v1 import BaseModel
from typing import List

class ServiceSchema(BaseModel):
    name: str
    description: str

class Services(BaseModel):
    services: List[ServiceSchema]

In this code, `ServiceSchema` represents a single service with a name and description. `Services` is a container for a list of `ServiceSchema` objects.

Configuring SmartScraperGraph

`SmartScraperGraph` is a powerful tool in LangChain for building and running scraping tasks. Here, we’ll configure it to scrape the services from a webpage.

from langchain import SmartScraperGraph

smart_scraper_graph = SmartScraperGraph(
    prompt="Extract all of the services that are offered on this page.",
    source="https://understandingdata.com/",
    config=graph_config,
    schema=Services,
)

result = smart_scraper_graph.run()
print(result)

With this setup:
The scraper is instructed to extract services via a prompt.
The URL of the webpage to be scraped is the source.
Model details and the OpenAI API key are included in the configuration.
The Pydantic model known as schema specifies the extracted data’s structure.

Complete Example

Here’s the complete example in one script:

from langchain_core.pydantic_v1 import BaseModel
from typing import List
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph

class ServiceSchema(BaseModel):
    name: str
    description: str

class Services(BaseModel):
    services: List[ServiceSchema]

# Load environment variables
load_dotenv()
openai_key = os.getenv("OPENAI_API_KEY")

graph_config = {
   "llm": {
      "api_key": openai_key,
      "model": "gpt-4o",
   },
}

smart_scraper_graph = SmartScraperGraph(
    prompt="List me all the services on this page with their descriptions.",
    source="https://understandingdata.com/",
    config=graph_config,
    schema=Services,
)

result = smart_scraper_graph.run()
print(result)

try:
    model = Services(**result)
    print(model)
except Exception as e:
    print(e)

Output -

services=[
    ServiceSchema(name='Data Engineering', description='NA'), 
    ServiceSchema(name='React Development', description='NA'), 
    ServiceSchema(name='Python Programming Development', description='NA'), 
    ServiceSchema(name='Prompt Engineering', description='NA'), 
    ServiceSchema(name='Web Scraping', description='NA'), 
    ServiceSchema(name='SaaS Applications', description='NA')
]

Conclusion
We’ve demonstrated in this blog article how to create a smart scraper to obtain services from a webpage using Pydantic and LangChain. You may easily scrape and arrange data from any webpage by using SmartScraperGraph’s capabilities and well-defined data models. This method makes the scraping process easier to use and evaluate by making sure that the scraped data follows a predetermined structure. For those looking for professional assistance, consider reaching out to web scraping experts at xByte for web scraping services.