Performance & Benchmarking

Performance Tips

  1. Use headless mode for faster execution (headless=True)

  2. Enable low memory mode on resource-constrained environments (low_memory=True)

  3. Enable database logging only when you need code generation

  4. Set appropriate max_depth — higher isn’t always better

  5. Use extraction_format when you need structured data

Benchmarking Memory

You can measure PyBA’s memory usage on your own machine using this script:

import asyncio, os, glob, gc

def chromium_rss():
    total = 0
    for p in glob.glob("/proc/[0-9]*/stat"):
        try:
            pid = int(p.split("/")[2])
            if b"chromium" in open(f"/proc/{pid}/cmdline", "rb").read().lower():
                with open(f"/proc/{pid}/statm") as f:
                    total += int(f.read().split()[1])
        except Exception:
            pass
    return total * os.sysconf("SC_PAGE_SIZE") // 1048576

def py_rss():
    return int(open(f"/proc/{os.getpid()}/statm").read().split()[1]) * os.sysconf("SC_PAGE_SIZE") // 1048576

async def bench(low_memory):
    from pyba import Step
    print(f"--- low_memory={low_memory} ---")
    print(f"Before: python={py_rss()} MB")
    step = Step(gemini_api_key="...", headless=True, low_memory=low_memory, enable_tracing=False)
    await step.start()
    await step.page.goto("https://www.amazon.com", wait_until="domcontentloaded")
    await asyncio.sleep(5)
    try:
        await step.step("Search for headphones")
    except Exception:
        pass
    print(f"Peak: chromium={chromium_rss()} MB, python={py_rss()} MB")
    await step.stop()
    del step
    gc.collect()
    print(f"After: python={py_rss()} MB")

asyncio.run(bench(low_memory=True))

Measured Results

Tested on Amazon.com, headless mode, Gemini provider:

Metric

low_memory=True

low_memory=False

Idle (before any session)

~60 MB

~180 MB

Peak (during session)

~940 MB

~940 MB

After session cleanup

~130 MB

~130 MB

Where the Savings Are

The ~120MB saving is at idle — before any browser session is launched. This comes from lazy-loading heavy Python dependencies:

  • oxymouse (numpy/scipy): ~46MB

  • google-genai: ~64MB (skipped when using OpenAI)

  • openai: ~9MB (skipped when using Gemini)

Peak memory during a session is dominated by Chromium (~800MB) which is unaffected by low_memory. After the first session, lazy-loaded modules remain cached in sys.modules for the process lifetime.

Note

The Chromium flags in low memory mode (--disable-gpu, --disable-dev-shm-usage, etc.) do not measurably reduce browser RSS. They improve stability in containerized environments, especially --disable-dev-shm-usage which prevents OOM when /dev/shm is too small.

Memory Lifecycle in a Server

If you run PyBA as a long-lived server (e.g. with aiohttp), the memory lifecycle is:

  1. Server starts: ~60MB (with low_memory=True) or ~180MB (without)

  2. First session opens: Jumps to ~600-1000MB (Chromium + lazy-loaded modules)

  3. First session closes: Drops to ~130-220MB (modules stay loaded, Chromium exits)

  4. Subsequent sessions: Same peak, returns to ~130-220MB each time

To ensure memory is returned to the OS after sessions close, set these environment variables in your container:

MALLOC_TRIM_THRESHOLD_=65536
MALLOC_MMAP_THRESHOLD_=65536

These tell glibc’s allocator to return freed pages to the OS more aggressively.