Data Extraction
PyBA can extract structured data from web pages during automation. This page covers how to define extraction formats and use them effectively.
Overview
PyBA extracts data in two ways:
Pydantic Models: Define exact structure with type hints
Natural Language: Describe what you want in the prompt
The Pydantic approach is more precise and recommended for production use.
Using Pydantic Models
Define a Pydantic BaseModel to specify exactly what data you want:
from typing import List, Optional
from pydantic import BaseModel
from pyba import Engine, Database
class HackerNewsPost(BaseModel):
title: str
num_upvotes: Optional[int]
num_comments: Optional[int]
url: Optional[str]
database = Database(engine="sqlite", name="/tmp/pyba.db")
engine = Engine(
openai_api_key="sk-...",
database=database,
enable_tracing=True
)
result = engine.sync_run(
prompt="Go to news.ycombinator.com and extract all posts from the front page",
extraction_format=HackerNewsPost
)
Key points:
Use
Optionalfor fields that might not existThe extraction happens during navigation, not just at the end
Data is stored in the database if configured
How Extraction Works
When you provide an extraction_format:
The Playwright Agent decides if the current page contains relevant data
If yes, it sets an internal
extract_infoflagThe Extraction Agent runs in a separate thread to extract data
Extraction happens without blocking the main automation
This means you get data progressively as PyBA navigates, not just from the final page.
Extraction Agent
The Extraction Agent:
Receives the visible text from the page
Uses the LLM to extract data matching your Pydantic model
Stores results in the database
Example: E-commerce Scraping
from typing import List, Optional
from pydantic import BaseModel
from pyba import Engine, Database
class Product(BaseModel):
name: str
price: str
rating: Optional[str]
num_reviews: Optional[str]
db = Database(engine="sqlite", name="/tmp/products.db")
engine = Engine(openai_api_key="sk-...", database=db)
result = engine.sync_run(
prompt="Go to Amazon, search for 'wireless mouse', and extract the first 5 products",
extraction_format=Product
)
Example: News Articles
from typing import Optional
from pydantic import BaseModel
from pyba import Engine
class Article(BaseModel):
headline: str
author: Optional[str]
date: Optional[str]
summary: Optional[str]
engine = Engine(openai_api_key="sk-...")
result = engine.sync_run(
prompt="Go to BBC News and extract the top 3 headlines with their summaries",
extraction_format=Article
)
Natural Language Extraction
For quick tasks, you can describe the extraction in your prompt:
engine = Engine(openai_api_key="sk-...")
result = engine.sync_run(
prompt="""
Go to news.ycombinator.com.
Extract all posts from the front page.
For each post, get: title, number of upvotes, number of comments.
"""
)
This is less precise than Pydantic models but works for simple cases.
Best Practices
Use Optional Fields
Web pages are unpredictable. Use Optional for fields that might be missing:
class Product(BaseModel):
name: str # Required - should always exist
price: Optional[str] = None # Optional - might be "Price unavailable"
rating: Optional[float] = None # Optional - new products might not have ratings
Be Specific in Your Prompt
The extraction quality depends on your prompt. Be specific about:
What page to visit
How many items to extract
What fields matter most
# Good prompt
result = engine.sync_run(
prompt="Go to GitHub trending. Extract the top 10 repositories. For each, get: name, description, stars, and programming language.",
extraction_format=Repository
)
# Vague prompt
result = engine.sync_run(
prompt="Get GitHub stuff",
extraction_format=Repository
)
Use Database for Persistence
Extracted data is stored in the database if configured:
from pyba import Engine, Database
db = Database(engine="sqlite", name="/tmp/extractions.db")
engine = Engine(openai_api_key="sk-...", database=db)
# Data is automatically persisted
result = engine.sync_run(
prompt="Extract product data...",
extraction_format=Product
)
Limitations
One model per run: You can only specify one
extraction_formatperrun()callText-based: Extraction works on visible text, not images or PDFs
LLM quality: Extraction accuracy depends on the LLM’s understanding
YouTube Extraction
PyBA has special handling for YouTube pages. When on YouTube, it uses a custom DOM extraction script optimized for video metadata:
result = engine.sync_run(
prompt="Go to this YouTube video and extract: title, views, likes, channel name"
)
The YouTube extractor pulls:
Video title and description
View count and likes
Channel information
Comments (if visible)