Data Extraction

PyBA can extract structured data from web pages during automation. This page covers how to define extraction formats and use them effectively.

Overview 

PyBA extracts data in two ways:

Pydantic Models: Define exact structure with type hints
Natural Language: Describe what you want in the prompt

The Pydantic approach is more precise and recommended for production use.

Using Pydantic Models 

Define a Pydantic BaseModel to specify exactly what data you want:

from typing import List, Optional
from pydantic import BaseModel
from pyba import Engine, Database

class HackerNewsPost(BaseModel):
    title: str
    num_upvotes: Optional[int]
    num_comments: Optional[int]
    url: Optional[str]

database = Database(engine="sqlite", name="/tmp/pyba.db")
engine = Engine(
    openai_api_key="sk-...",
    database=database,
    enable_tracing=True
)

result = engine.sync_run(
    prompt="Go to news.ycombinator.com and extract all posts from the front page",
    extraction_format=HackerNewsPost
)

Key points:

Use Optional for fields that might not exist
The extraction happens during navigation, not just at the end
Data is stored in the database if configured

How Extraction Works 

When you provide an extraction_format:

The Playwright Agent decides if the current page contains relevant data
If yes, it sets an internal extract_info flag
The Extraction Agent runs in a separate thread to extract data
Extraction happens without blocking the main automation

This means you get data progressively as PyBA navigates, not just from the final page.

Extraction Agent 

The Extraction Agent:

Receives the visible text from the page
Uses the LLM to extract data matching your Pydantic model
Stores results in the database

Example: E-commerce Scraping 

from typing import List, Optional
from pydantic import BaseModel
from pyba import Engine, Database

class Product(BaseModel):
    name: str
    price: str
    rating: Optional[str]
    num_reviews: Optional[str]

db = Database(engine="sqlite", name="/tmp/products.db")
engine = Engine(openai_api_key="sk-...", database=db)

result = engine.sync_run(
    prompt="Go to Amazon, search for 'wireless mouse', and extract the first 5 products",
    extraction_format=Product
)

Example: News Articles 

from typing import Optional
from pydantic import BaseModel
from pyba import Engine

class Article(BaseModel):
    headline: str
    author: Optional[str]
    date: Optional[str]
    summary: Optional[str]

engine = Engine(openai_api_key="sk-...")

result = engine.sync_run(
    prompt="Go to BBC News and extract the top 3 headlines with their summaries",
    extraction_format=Article
)

Natural Language Extraction 

For quick tasks, you can describe the extraction in your prompt:

engine = Engine(openai_api_key="sk-...")

result = engine.sync_run(
    prompt="""
    Go to news.ycombinator.com.
    Extract all posts from the front page.
    For each post, get: title, number of upvotes, number of comments.
    """
)

This is less precise than Pydantic models but works for simple cases.

Best Practices 

Use Optional Fields 

Web pages are unpredictable. Use Optional for fields that might be missing:

class Product(BaseModel):
    name: str                          # Required - should always exist
    price: Optional[str] = None        # Optional - might be "Price unavailable"
    rating: Optional[float] = None     # Optional - new products might not have ratings

Be Specific in Your Prompt 

The extraction quality depends on your prompt. Be specific about:

What page to visit
How many items to extract
What fields matter most

# Good prompt
result = engine.sync_run(
    prompt="Go to GitHub trending. Extract the top 10 repositories. For each, get: name, description, stars, and programming language.",
    extraction_format=Repository
)

# Vague prompt
result = engine.sync_run(
    prompt="Get GitHub stuff",
    extraction_format=Repository
)

Use Database for Persistence 

Extracted data is stored in the database if configured:

from pyba import Engine, Database

db = Database(engine="sqlite", name="/tmp/extractions.db")
engine = Engine(openai_api_key="sk-...", database=db)

# Data is automatically persisted
result = engine.sync_run(
    prompt="Extract product data...",
    extraction_format=Product
)

Limitations 

One model per run: You can only specify one extraction_format per run() call
Text-based: Extraction works on visible text, not images or PDFs
LLM quality: Extraction accuracy depends on the LLM’s understanding

YouTube Extraction 

PyBA has special handling for YouTube pages. When on YouTube, it uses a custom DOM extraction script optimized for video metadata:

result = engine.sync_run(
    prompt="Go to this YouTube video and extract: title, views, likes, channel name"
)

The YouTube extractor pulls:

Video title and description
View count and likes
Channel information
Comments (if visible)