Data Extraction

PyBA can extract structured data from web pages during automation. This page covers how to define extraction formats and use them effectively.

Overview

PyBA extracts data in two ways:

  1. Pydantic Models: Define exact structure with type hints

  2. Natural Language: Describe what you want in the prompt

The Pydantic approach is more precise and recommended for production use.

Using Pydantic Models

Define a Pydantic BaseModel to specify exactly what data you want:

from typing import List, Optional
from pydantic import BaseModel
from pyba import Engine, Database

class HackerNewsPost(BaseModel):
    title: str
    num_upvotes: Optional[int]
    num_comments: Optional[int]
    url: Optional[str]

database = Database(engine="sqlite", name="/tmp/pyba.db")
engine = Engine(
    openai_api_key="sk-...",
    database=database,
    enable_tracing=True
)

result = engine.sync_run(
    prompt="Go to news.ycombinator.com and extract all posts from the front page",
    extraction_format=HackerNewsPost
)

Key points:

  • Use Optional for fields that might not exist

  • The extraction happens during navigation, not just at the end

  • Data is stored in the database if configured

How Extraction Works

When you provide an extraction_format:

  1. The Playwright Agent decides if the current page contains relevant data

  2. If yes, it sets an internal extract_info flag

  3. The Extraction Agent runs in a separate thread to extract data

  4. Extraction happens without blocking the main automation

This means you get data progressively as PyBA navigates, not just from the final page.

Extraction Agent

The Extraction Agent:

  • Receives the visible text from the page

  • Uses the LLM to extract data matching your Pydantic model

  • Stores results in the database

Example: E-commerce Scraping

from typing import List, Optional
from pydantic import BaseModel
from pyba import Engine, Database

class Product(BaseModel):
    name: str
    price: str
    rating: Optional[str]
    num_reviews: Optional[str]

db = Database(engine="sqlite", name="/tmp/products.db")
engine = Engine(openai_api_key="sk-...", database=db)

result = engine.sync_run(
    prompt="Go to Amazon, search for 'wireless mouse', and extract the first 5 products",
    extraction_format=Product
)

Example: News Articles

from typing import Optional
from pydantic import BaseModel
from pyba import Engine

class Article(BaseModel):
    headline: str
    author: Optional[str]
    date: Optional[str]
    summary: Optional[str]

engine = Engine(openai_api_key="sk-...")

result = engine.sync_run(
    prompt="Go to BBC News and extract the top 3 headlines with their summaries",
    extraction_format=Article
)

Natural Language Extraction

For quick tasks, you can describe the extraction in your prompt:

engine = Engine(openai_api_key="sk-...")

result = engine.sync_run(
    prompt="""
    Go to news.ycombinator.com.
    Extract all posts from the front page.
    For each post, get: title, number of upvotes, number of comments.
    """
)

This is less precise than Pydantic models but works for simple cases.

Best Practices

Use Optional Fields

Web pages are unpredictable. Use Optional for fields that might be missing:

class Product(BaseModel):
    name: str                          # Required - should always exist
    price: Optional[str] = None        # Optional - might be "Price unavailable"
    rating: Optional[float] = None     # Optional - new products might not have ratings

Be Specific in Your Prompt

The extraction quality depends on your prompt. Be specific about:

  • What page to visit

  • How many items to extract

  • What fields matter most

# Good prompt
result = engine.sync_run(
    prompt="Go to GitHub trending. Extract the top 10 repositories. For each, get: name, description, stars, and programming language.",
    extraction_format=Repository
)

# Vague prompt
result = engine.sync_run(
    prompt="Get GitHub stuff",
    extraction_format=Repository
)

Use Database for Persistence

Extracted data is stored in the database if configured:

from pyba import Engine, Database

db = Database(engine="sqlite", name="/tmp/extractions.db")
engine = Engine(openai_api_key="sk-...", database=db)

# Data is automatically persisted
result = engine.sync_run(
    prompt="Extract product data...",
    extraction_format=Product
)

Limitations

  • One model per run: You can only specify one extraction_format per run() call

  • Text-based: Extraction works on visible text, not images or PDFs

  • LLM quality: Extraction accuracy depends on the LLM’s understanding

YouTube Extraction

PyBA has special handling for YouTube pages. When on YouTube, it uses a custom DOM extraction script optimized for video metadata:

result = engine.sync_run(
    prompt="Go to this YouTube video and extract: title, views, likes, channel name"
)

The YouTube extractor pulls:

  • Video title and description

  • View count and likes

  • Channel information

  • Comments (if visible)