Benchmarks & Datasets
Transparent methodology-first benchmarks and small open datasets for reproducibility.
Available Benchmarks
UI Regression Stability
Measuring resilience of UI tests under DOM and style changes across browsers.
- • Dataset: 1,200 DOM variants
- • Metrics: Flake rate, recovery time
- • Runners: Playwright, Selenium
RAG Retrieval Accuracy
Evaluating retriever precision/recall, grounding quality, and answer relevance.
- • Dataset: 5k Q/A pairs
- • Metrics: P@k, nDCG
- • Tools: LangSmith, LlamaIndex
Test Generation Quality
Comparing LLM-based test generation across domains and input modalities.
- • Inputs: Jira, Figma, APIs
- • Metrics: Coverage, brittleness
- • Frameworks: Dobby, GPT, Claude
Methodology
Reproducibility
All experiments include seed control, environment pinning, and result variance reporting.
Transparency
Clear documentation on dataset composition, annotation, and known limitations.