Data Extraction =============== PyBA can extract structured data from web pages during automation. This page covers how to define extraction formats and use them effectively. .. contents:: :local: :depth: 2 Overview -------- PyBA extracts data in two ways: 1. **Pydantic Models**: Define exact structure with type hints 2. **Natural Language**: Describe what you want in the prompt The Pydantic approach is more precise and recommended for production use. Using Pydantic Models --------------------- Define a Pydantic ``BaseModel`` to specify exactly what data you want: .. code-block:: python from typing import List, Optional from pydantic import BaseModel from pyba import Engine, Database class HackerNewsPost(BaseModel): title: str num_upvotes: Optional[int] num_comments: Optional[int] url: Optional[str] database = Database(engine="sqlite", name="/tmp/pyba.db") engine = Engine( openai_api_key="sk-...", database=database, enable_tracing=True ) result = engine.sync_run( prompt="Go to news.ycombinator.com and extract all posts from the front page", extraction_format=HackerNewsPost ) **Key points:** - Use ``Optional`` for fields that might not exist - The extraction happens **during** navigation, not just at the end - Data is stored in the database if configured How Extraction Works -------------------- When you provide an ``extraction_format``: 1. The **Playwright Agent** decides if the current page contains relevant data 2. If yes, it sets an internal ``extract_info`` flag 3. The **Extraction Agent** runs in a separate thread to extract data 4. Extraction happens without blocking the main automation This means you get data progressively as PyBA navigates, not just from the final page. Extraction Agent ^^^^^^^^^^^^^^^^ The Extraction Agent: - Receives the visible text from the page - Uses the LLM to extract data matching your Pydantic model - Stores results in the database Example: E-commerce Scraping ---------------------------- .. code-block:: python from typing import List, Optional from pydantic import BaseModel from pyba import Engine, Database class Product(BaseModel): name: str price: str rating: Optional[str] num_reviews: Optional[str] db = Database(engine="sqlite", name="/tmp/products.db") engine = Engine(openai_api_key="sk-...", database=db) result = engine.sync_run( prompt="Go to Amazon, search for 'wireless mouse', and extract the first 5 products", extraction_format=Product ) Example: News Articles ---------------------- .. code-block:: python from typing import Optional from pydantic import BaseModel from pyba import Engine class Article(BaseModel): headline: str author: Optional[str] date: Optional[str] summary: Optional[str] engine = Engine(openai_api_key="sk-...") result = engine.sync_run( prompt="Go to BBC News and extract the top 3 headlines with their summaries", extraction_format=Article ) Natural Language Extraction --------------------------- For quick tasks, you can describe the extraction in your prompt: .. code-block:: python engine = Engine(openai_api_key="sk-...") result = engine.sync_run( prompt=""" Go to news.ycombinator.com. Extract all posts from the front page. For each post, get: title, number of upvotes, number of comments. """ ) This is less precise than Pydantic models but works for simple cases. Best Practices -------------- Use Optional Fields ^^^^^^^^^^^^^^^^^^^ Web pages are unpredictable. Use ``Optional`` for fields that might be missing: .. code-block:: python class Product(BaseModel): name: str # Required - should always exist price: Optional[str] = None # Optional - might be "Price unavailable" rating: Optional[float] = None # Optional - new products might not have ratings Be Specific in Your Prompt ^^^^^^^^^^^^^^^^^^^^^^^^^^ The extraction quality depends on your prompt. Be specific about: - What page to visit - How many items to extract - What fields matter most .. code-block:: python # Good prompt result = engine.sync_run( prompt="Go to GitHub trending. Extract the top 10 repositories. For each, get: name, description, stars, and programming language.", extraction_format=Repository ) # Vague prompt result = engine.sync_run( prompt="Get GitHub stuff", extraction_format=Repository ) Use Database for Persistence ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Extracted data is stored in the database if configured: .. code-block:: python from pyba import Engine, Database db = Database(engine="sqlite", name="/tmp/extractions.db") engine = Engine(openai_api_key="sk-...", database=db) # Data is automatically persisted result = engine.sync_run( prompt="Extract product data...", extraction_format=Product ) Limitations ----------- - **One model per run**: You can only specify one ``extraction_format`` per ``run()`` call - **Text-based**: Extraction works on visible text, not images or PDFs - **LLM quality**: Extraction accuracy depends on the LLM's understanding YouTube Extraction ------------------ PyBA has special handling for YouTube pages. When on YouTube, it uses a custom DOM extraction script optimized for video metadata: .. code-block:: python result = engine.sync_run( prompt="Go to this YouTube video and extract: title, views, likes, channel name" ) The YouTube extractor pulls: - Video title and description - View count and likes - Channel information - Comments (if visible)