Invaris

Benchmarks & Datasets

Transparent methodology-first benchmarks and small open datasets for reproducibility.

Available Benchmarks

UI Regression Stability

Measuring resilience of UI tests under DOM and style changes across browsers.

  • • Dataset: 1,200 DOM variants
  • • Metrics: Flake rate, recovery time
  • • Runners: Playwright, Selenium

RAG Retrieval Accuracy

Evaluating retriever precision/recall, grounding quality, and answer relevance.

  • • Dataset: 5k Q/A pairs
  • • Metrics: P@k, nDCG
  • • Tools: LangSmith, LlamaIndex

Test Generation Quality

Comparing LLM-based test generation across domains and input modalities.

  • • Inputs: Jira, Figma, APIs
  • • Metrics: Coverage, brittleness
  • • Frameworks: Dobby, GPT, Claude

Methodology

Reproducibility

All experiments include seed control, environment pinning, and result variance reporting.

Transparency

Clear documentation on dataset composition, annotation, and known limitations.