Skip to main content

Real-World Benchmarks for AI Agents: The xbench Evaluation Framework

As AI models and agents move beyond static capabilities and into real-world workflows, the way we evaluate them must evolve as well. Traditional benchmarks, while useful, often fail to capture whether an AI system can perform reliably in dynamic environments, deliver measurable business value, or operate as part of a broader production process. Our goal with xbench is to close this gap by building evaluations that reflect how AI is actually used.

xbench is a collaborative, continuously evolving evaluation platform designed to measure both foundational intelligence breakthroughs and real-world professional performance. Through open participation from model developers, agent builders, industry practitioners, and researchers, we create benchmarks that are challenging, credible, and grounded in practical outcomes. Our evaluation framework combines live, real-world testing with professionally curated datasets to ensure results remain relevant as models, tools, and use cases evolve.

By introducing a dual-track evaluation system spanning AGI Tracking and Profession-Aligned benchmarks, xbench provides a clear path from early capability validation to production-level assessment. This structure allows us to identify when new intelligence emerges, understand how it translates into real workflows, and anticipate when AI applications are approaching true tech-market fit.

Dynamic Real-World Evaluations (Live Evaluations for Agents)

The Dual-Track Path for Agent Evaluation: We have introduced a dual-track series of evaluation sets in xbench: xbench-AGI Tracking and xbench-Profession Aligned. We consider AGI Tracking evaluations to be the fundamental stepping stone for Agent applications, while Profession-Aligned evaluations represent higher-level practice tied directly to real production scenarios.

  • AGI Tracking Evaluations – These aim to verify whether a model has achieved a “0 to 1” intelligence breakthrough in a specific capability dimension. Such evaluations need to be sufficiently difficult and cleverly designed, with enough discrimination to probe the boundaries of “intelligence” rather than just the accumulation of training data or knowledge. Only when an AI’s key capability achieves a true breakthrough in an AGI Tracking test can it unlock more complex professional workflows and move into the realm of Profession-Aligned evaluation.
  • Profession-Aligned Evaluations – These focus on realistic production scenarios, effectively treating an AI Agent as a digital employee embedded in a concrete business workflow. Here, the evaluation’s core is not whether the AI exhibits intelligence in the abstract, but whether it can deliver results and business value in a real-world setting. In Profession-Aligned tasks, we do not limit the solution approach or model; we only evaluate the outcomes. This approach starts from actual productivity needs: it defines tasks from vertical applications and seeks AI solutions for domain-specific problems – even if a full product for that scenario does not yet exist.

For example, let’s revisit the Marketing and Human Resources (Recruiting) scenarios mentioned earlier. 

By tracking the metrics in our xbench DeepSearch (AGI track) evaluations, we observed that the AI search capability was maturing rapidly. Tasks such as scanning resumes and analyzing candidate fit, or finding and assessing KOL (Key Opinion Leader) influencers for marketing, are workflows that AI could potentially achieve. Therefore, we began constructing the xbench-Profession-Recruitment and xbench-Profession-Marketing evaluation suites, which align with real business processes in those fields and aim to predict when each application might reach Tech-Market Fit.

Beyond AI search capabilities, as we foresee AI’s key abilities expanding to multimodal understanding and generation, tasks like producing and placing marketing creatives will enter the range of what could achieve TMF and fall under Profession-Aligned evaluation. Likewise, in recruiting, a senior recruiter’s workflow is not limited to sourcing and evaluating candidates. More challenging steps include long-term candidate relationship management, compensation negotiation, and closing hires. These require AI to have long-term memory, competitive strategy, and decision-making capabilities. These areas – such as long-term memory, multi-agent collaboration and negotiation, problem discovery – represent the next critical intelligence breakthroughs we will monitor. We will continuously watch for breakthroughs in these key capabilities and enrich our Profession-Aligned evaluation sets accordingly.

 xbench connects core AI capabilities to professional applications and business outcomes.

Evaluation Centered on Core AI Capabilities (AGI Tracking)

During 2023–2024, large models made significant breakthroughs in knowledge acquisition, multi-modal processing, memory, instruction-following, and reasoning. These advances collectively led to an explosion in what Agent applications could do. However, there are still clear shortcomings in areas like long-term memory, reliability/truthfulness, problem discovery, multi-agent collaboration, and game-theoretic decision-making. We aim to target these unsolved core capabilities by building and continuously maintaining corresponding evaluation sets.

We believe that for each of these key capabilities, academia has proposed many outstanding evaluation methodologies. Yet due to resource and time constraints, few have been maintained as continuously evolving, dynamic benchmarks. Our goal with xbench is to carry on the spirit of these open evaluations and provide third-party, live evaluations (incorporating both black-box and white-box testing) that are regularly updated.

We decompose an Agent’s capabilities into tiers: fundamental intelligence, professional proficiency, creative ability, and organizational ability. Within each level, we identify the critical elements required to achieve AGI. Importantly, AI development may not progress strictly from basic to advanced in order; for example, even after an AI attains high-level organizational ability, it might still suffer from fundamental reliability issues. This informs how we design our evaluations to monitor different facets in parallel.

In our first public release, our xbench-ScienceQA and xbench-DeepSearch evaluations correspond to the “Knowledge” and “Tool Use” subcategories of capabilities, respectively. They test an Agent’s abilities in these two areas through specific subtasks. Moving forward, we will continue to release new evaluations targeting these critical capability areas and track how current AI products perform.

Table organizing AI capabilities into four tiers with detailed descriptions
AI Agent capabilities are organized into hierarchical tiers, from foundational intelligence to organizational ability.

xbench-ScienceQA – Evaluating Fundamental Intelligence (Knowledge).

This evaluation set is designed to test graduate-level academic knowledge and reasoning ability. We curate high-quality questions that are reliable, span multiple disciplines, are at advanced (higher-education) difficulty, have unambiguous answers, and are not readily answerable via a search engine. Existing related benchmarks like GPQA and SuperGPQA have received significant recognition, but they were one-off releases and lack a mechanism for periodic updates, making it hard to ensure questions remain unseen by models over time. In xbench-ScienceQA, our aim is to build a question set that is updated quarterly, with monthly reporting on the performance of the latest models. We invite PhD students from top universities and seasoned industry experts to contribute questions, and we ensure fairness, discrimination, and correctness of the questions through methods like LLM difficulty testing, search engine checks, and peer review.

xbench-DeepSearch – Evaluating Professional Productivity (Tool Use).

The deep-search capability of AI Agents – involving planning (autonomously devising a strategy), information gathering (search), reasoning and analysis, and summarization – is one of the core skills on the path toward AGI. It also presents a much harder evaluation challenge. Simple fact-based benchmarks (e.g., SimpleQA, Chinese SimpleQA) can evaluate information retrieval skills, but they fail to test an agent’s ability in autonomous planning and complex reasoning. Conversely, cutting-edge science benchmarks like HLE or AIM-E are good at testing a model’s reasoning abilities, but are weaker at measuring planning and information-gathering skills. To better evaluate an Agent’s deep-search ability, we have developed and open-sourced the xbench-DeepSearch benchmark with the following characteristics:

  • It is tailored to the Chinese internet environment to minimize the influence of particular search sources on the results.
  • It is highly challenging, requiring the Agent to have an integrated, end-to-end ability to plan + search + reason + summarize.
  • All questions are manually written and cross-validated, ensuring the problems are novel, the answers are correct and unique, and facilitating automated evaluation.
  • It is continuously updated — we report the latest model performances on a monthly basis and update the evaluation set once every quarter.

Looking ahead through 2025, we anticipate seeing further breakthroughs in both fundamental intelligence and professional productivity for AI. In our upcoming evaluations this year, we will pay special attention to questions such as:

  1. Multimodal reasoning & tool use: Can multimodal models (with images/video) that employ chain-of-thought reasoning generate commercial-grade videos? (Capabilities: multimodal understanding, reasoning, tool use)
  2. Reliability of tool-using Agents: Does the widespread use of MCP (Model-Context Protocol) tools introduce reliability or trustworthiness issues? (Capabilities: tool use, reliability)
  3. Adaptability of GUI Agents: Can Agents with GUI-based tools effectively operate software applications that are dynamically updated or that they were never trained on? (Capabilities: tool use, test-time learning)

Evaluation Centered on Professional Work (Profession-Aligned)

Aligning evaluations with real-world tasks is a core imperative for AI assessment today. Here, we propose a methodology that centers on professional work to construct such evaluations.

Existing real-world benchmarks tend to be capability-centric – they try to cover a wide array of scenarios and domains to guide the development of general-purpose models. Such broad evaluations are very valuable for general AI progress. However, when deploying Agent applications, these systems usually need to solve tasks in specific vertical domains and often require a custom design for those verticals. In those cases, the relevance of generic evaluation results diminishes.

We have already seen high-quality domain-specific evaluations emerge in areas like coding, customer service, and medicine, which in turn have driven rapid progress and productization of AI Agents specialized in those professions. We believe profession-centered evaluations will quickly expand to many more fields and constitute a growing proportion of mainstream AI evaluation. The share of evaluation effort devoted to domain-specific benchmarks will rise swiftly as each domain recognizes the need for tailored assessments.

Designing evaluations centered on professional tasks means starting from the perspective of a human expert in that profession. We analyze the expert’s own workflow and thought process, and then construct tasks, execution environments, and validation methods that align with expert behavior. The process can be visualized as follows:

Multi-panel diagram showing profession selection, benchmark creation, and tracking goals

xbench’s Profession-Aligned benchmarks follow three core principles:

  1. Demand-defined evaluations: For a given profession, we construct the evaluation set by first mapping out its business workflow and task categories, then focusing on those tasks that are objectively evaluable. For aspects of the job that are not yet easily evaluated, we simulate or transform them into an evaluable format.
  2. Live task collection over time: The evaluation tasks are not created in one go as static “exam questions,” but are gradually accumulated from the day-to-day tasks of experts in the field. For tasks that evolve dynamically, we continuously source new evaluation content from real business workflows to ensure the benchmark stays as relevant as possible to current real-world demands.
  3. Value-driven targets: For each task, we record the time it takes a human expert to complete and use industry salary benchmarks to estimate the task’s economic value. Each task is given a preset TMF target – once an AI Agent meets that performance/cost threshold, we cease adding difficulty to that task. In other words, the difficulty of Profession-Aligned evaluations aims to match real-world requirements, rather than becoming endlessly harder. The goal is to define a point at which the Agent’s performance is good enough to be commercially viable, rather than to keep chasing perfection.

As an example, we designed xbench-Profession-Recruitment by focusing on the work of recruitment experts. We partnered with several top headhunting firms to outline how a recruiter’s weekly working hours are distributed across different tasks, and had experts assess the importance of each task. This yielded a structured breakdown of the domain’s tasks. We then aligned each task with its economic value (based on time and salary) and analyzed which tasks are feasible and measurable with current AI technology.

Table detailing recruiting tasks with feasibility, scorability, and economic value columns

For each individual task, we analyze its evaluability and technical feasibility. The first edition of xbench-Profession-Recruitment incorporates several categories of tasks, including parsing job description requirements, identifying candidate personas, filling in missing details of a candidate’s experience, understanding social and professional connections, and searching publicly available talent profiles.

Evergreen Evaluation (Continuous Benchmarking)

Every evaluation task and product has a life cycle. As mentioned, static benchmark sets face the problem of question leakage. The emergence of initiatives like LiveBench and LiveCode (live coding benchmarks) – which use dynamically updated pools of questions – has helped alleviate leakage and overfitting issues. However, in evaluating Agent applications, new challenges arise.

First, Agent products themselves have life cycles. Agent products evolve rapidly, continuously integrating new functions and improvements, and older versions may be taken offline. While we can compare different Agent products at the same point in time, we cannot directly compare the capabilities of products across different points in time if the products or test sets have changed.

Second, the external environment an Agent interacts with is also changing dynamically. Even for the same evaluation question, if solving it requires using internet resources or tools with frequently updated content, the outcomes can differ over time.

Matrix showing evaluation rounds over time with product versions, partially filled cells

The table above shows the kind of results we get from live evaluations of Agents. From such results, we can obtain a ranking of different products tested in the same round. But due to adjustments in the evaluation environment and tasks over time, we are not capturing how an individual product’s capability grows between rounds. This leads us to an important question: How can we design metrics to track an Agent’s continuous capability growth, given that both the evaluation sets and models are constantly evolving?

Statistically, we can address this by treating the incomplete score matrix over time as something we can factorize to estimate each Agent version’s underlying capability. We apply Item Response Theory (IRT) to estimate an Agent’s capability in a way that is robust to changing test items. In IRT, an agent’s ability θ, a question’s difficulty b, and a question’s discrimination factor a are modeled such that the probability of a correct response is:

In this model, the probability p of solving a question falls between 0 and 1. A higher difficulty parameter b lowers the probability of success, while a higher ability θ raises it. Questions with a larger discrimination a have a more gradual score curve as ability θ increases, meaning such questions differentiate performance across a wider range of ability levels.

We used dynamically updated results from the OpenCompass project to validate the IRT approach. OpenCompass has a leaderboard that, since February 2024, has updated its question set every 1–3 months and published evaluation results. 

In the left figure below, we plot the scores of various models at each evaluation time; models from the same family are connected by lines of the same color. The raw leaderboard is excellent for showing models’ rankings at each evaluation point, but because questions change, a model’s scores at different times are not directly comparable.
However, when we use IRT to estimate an ability score for each model at each point in time, we can clearly see the continuous improvement trends in model capabilities. For example, we observed a rapid jump in ability for Google’s Gemini model after October 2024, and two noticeable performance boosts corresponding to the releases of DeepSeek versions r1 and v2.

Two line charts comparing raw scores versus IRT capability scores for AI models over time
IRT methodology reveals continuous capability trends and breakthrough moments across evolving evaluations.

In future Agent evaluations, we will continue to report IRT-based capability scores for various products on each xbench evaluation set. This will allow us to observe over time not just who is ahead, but the rate of improvement for each and any signals of breakthrough beyond the rankings alone.

Evaluating an Agent’s Tech-Market Fit (TMF)

Besides capability, cost is a decisive factor for the real-world deployment of Agent applications. In practice, Inference Scaling — using more computation at inference time — can often boost a model or an Agent’s performance. This could mean running a model with a longer chain-of-thought (e.g., via reinforcement learning to extend reasoning steps) or performing more reasoning and summarization iterations to refine answers. These approaches can yield better results, but at the expense of greater computational cost and latency.

In real-world tasks, we must consider the return on investment of inference scaling. There is an optimal balance to be struck between cost, latency, and performance. Inspired by the ARC-AGI evaluation framework, for each xbench evaluation set, we plan to plot performance-cost curves: specifically, a demand curve on an effectiveness vs. cost chart, a human performance curve for reference, and an optimal supply curve representing the best current products.

On a benchmark’s score vs. cost plot, the upper-left region represents the market-acceptable zone (high performance at low cost) while the bottom-right region represents the technically feasible zone (high performance but at high cost). The cost of human labor defines one boundary of the market-acceptable zone. In the illustrative figure, the left diagram shows a scenario before the technology is deployable – with no overlap between what’s technically possible and what the market will pay for – while the middle diagram shows the scenario after achieving TMF, where the two regions overlap. The overlapping area represents the incremental value created by AI. 

In AI scenarios that achieve TMF, human resources should be refocused on frontier tasks and those that remain unevaluable or unsolvable by AI, while AI handles more routine tasks. As AI and human labor have different scarcity and cost structures, the market will also recalibrate how it prices human contributions.

We believe each professional domain will experience three stages in this evolution:

Three diagrams showing stages of AI-human collaboration plotted on score versus cost axes
 Professional domains evolve through three stages: pre-TMF, human-AI collaboration, and expert-led AI specialization.
  1. Before TMF (No Fit): The technically achievable performance and the market’s minimum acceptable performance do not overlap. At this stage, an Agent application is merely a tool or a concept; it cannot truly deliver results or generate scalable value. The Agent’s impact on human work is minimal.
  2. 2. Agent + Human Collaboration: The technically feasible and market-acceptable regions begin to overlap. In this overlapping zone, AI brings incremental value – for example, (1) delivering a service at lower cost than the cheapest human alternative, and (2) handling work that is repetitive or of moderate quality requirements, thus boosting overall productivity. High-complexity, high-skill tasks (due to data scarcity or inherent difficulty) still need to be performed by humans. Because of their scarcity, the profits or savings gained from AI might be redirected to compensate for the high-end human work.
  3. Specialized AI Agents: Domain experts actively build evaluation systems and guide Agent iteration. The role of the human expert shifts from delivering end results to constructing professional evaluations and training vertical Agents, which then provide services at scale. In this stage, the AI Agent has become a true specialist for the domain.

The transition from Stage 1 to Stage 2 is driven by AI technology breakthroughs and the scaling of compute and data. The transition from Stage 2 to Stage 3 depends on experts who deeply understand the vertical domain’s needs, standards, and accumulated experience. It’s worth noting that in some fields, AI might introduce entirely new ways of meeting demand, altering existing workflows and the structure of production and labor.

Ultimately, AI is likely to bring about shifts in value and changes in the labor structure. We believe that thanks to more efficient productivity and new business models, overall societal welfare will increase – even as roles and tasks evolve.