OS-Marathon Benchmark

Motivation

😆 🧳

🤦 💧

Traveling receipts Food Receipt

Expense system

🤔 Can computer-use agents help to do these long tasks?

Long-Horizon, Repetitive Tasks

📋 TASK:

Do an expense report for me

⚙️ Decompose the task

🤖 ✨ ⭐ 💫

Receipts

✓

...

📁 Sub-Workflow 1
(Data Unit 1)

🔍 Identify the required data

📖 Understand and find the required information from data

✏️ Fill the form

Processing Receipt

READING

OS-Marathon targets long-horizon, repetitive tasks.

✨ News & Updates

1/29/2026: Paper released on arXiv.
1/28/2026: Website released.

💡 Key Takeaways

We identify three key challenges for computer-use agents on long-horizon, repetitive tasks:

Findings

Logical Incoherence: Agents fail to comprehend the underlying logic of sub-workflows, often executing tasks in an incorrect sequence.
Hallucination: Agents frequently hallucinate when attempting to populate system fields.
Long-horizon Inconsistency: Agents fail to plan the full iterative trajectory required to complete the overall workflow.

📑 Abstract

Long-horizon, repetitive workflows are common in professional settings, such as processing expense reports from receipts and entering student grades from exam papers. These tasks are often tedious for humans since they can extend to extreme lengths proportional to the size of the data to process. However, they are ideal for Computer-Use Agents (CUAs) due to their structured, recurring sub-workflows with logic that can be systematically learned. Identifying the absence of an evaluation benchmark as a primary bottleneck, we establish OS-Marathon, comprising 252 long-horizon, repetitive tasks across 2 domains to evaluate state-of-the-art (SOTA) agents. We then introduce a cost-effective method to construct a condensed demonstration using only few-shot examples to teach agents the underlying workflow logic, enabling them to execute similar workflows effectively on larger, unseen data collections. Extensive experiments demonstrate both the inherent challenges of these tasks and the effectiveness of our proposed method.

🚀 Task Workflow Visualization

We visualize the execution process of two tasks from the expense report and transcript domains to illustrate the complete workflow. Use the controls to switch between examples.

Instruction: Could you fill the expense report for me in the system with receipts stored in the receipt folder on the Desktop named 'receipts'.

Instruction: There is a transcript on the Desktop, I need to convert the GPA to another standard. Could you do that for me by using the GPA calculator in the browser?

📊 Statistics of the Benchmark

Domains for Daily Long-horizon, Repetitive Tasks

Difficulty Levels

242

Long-horizon, Repetitive Tasks

Fully Functional Execution Environments

Detailed Statistics

Domain	Levels	Difficulty Level Criteria		#Execu Env	#Task	Task Scalable?
Domain	Levels	#Receipt	Include Multi-Page Doc?	#Execu Env	#Task	Task Scalable?
Expense Reporting	L1	5		5	30
	L2	5			30
	L3	~15			50
	L4	~30			50
		#Doc Column	#Doc Page
Transcript Recording	L1	1	1	2	18
	L2	2	1		30
	L3	/	>1		34

🖥️ Execution Environment Visualization

We visualize the execution environment for the tasks. Note that this illustration is for conceptual purposes only, and the data shown may differ from the data used in practice.

University Expense System Environment

Corporate Expense System Environment

GPA Calculator System Environment 1

GPA Calculator System Environment 2

Corporate Expense Spreadsheet Environment 1

Corporate Expense Spreadsheet Environment 2

University Expense Spreadsheet Environment

🎯 Synthetic Data Generation Visualization

We visualize the synthetic data generation process.

Synthetic data pipeline.

🗂️ Data Sample Visualization

We visualize the synthetic data generated through our synthetic data generation pipeline.

Scroll For More

🔍 Method

We propose Few-shot Condensed Workflow Demonstration (FCWD) to construct a condensed human demonstration from few-shot data by abstracting the workflow into key steps to guide the CUA reasoning.

Teach a CUA the long-horizon, repetitive workflow with a demonstration made with few-shot examples.

✏️ Qualitative Results

We provide the qualitative results of some baseline CUAs on our benchmark. We attach the agents' action prediction on the top right corner of the video.

Method: Agent S2.5 + GPT5
Task Instruction: Could you fill the expense report for me in the system with receipts stored in the receipt folder on the Desktop named 'receipts'.
Problems: Logical Incoherence; Hallucination; Long-horizon Inconsistency.

Method: Agent S2.5 + GPT5
Task Instruction: Could you fill the expense report for me in the excel file on the Desktop named 'expense-report-personal.xlsx' with receipts stored in the receipt folder on the Desktop named 'receipts'. If you need the conversion rate, it is stored in the currency_conversion_rates_to_usd.csv file on the Desktop.
Problems: Struggling with filling the spreadsheet (keeping the style); Long-horizon Inconsistency.

Method: Agent S2.5 + GPT5
Task Instruction: There is a transcript on the Desktop, I need to convert the GPA to another standard. Could you do that for me by using the GPA calculator in the browser?
Problems: Hallucination; Long-horizon Inconsistency.

Method: OpenCUA-7B
Task Instruction: Could you fill the expense report for me in the excel file on the Desktop named 'expense-report-personal.xlsx' with receipts stored in the receipt folder on the Desktop named 'receipts'. If you need the conversion rate, it is stored in the currency_conversion_rates_to_usd.csv file on the Desktop.
Problems: Logical Incoherence; Hallucination; Long-horizon Inconsistency.

Method: UITARS-1.5-7B
Task Instruction: There is a transcript on the Desktop, I need to convert the GPA to another standard. Could you do that for me by using the GPA calculator in the browser?
Problems: Logical Incoherence; Hallucination; Long-horizon Inconsistency.

📐 Quantitative Results

We compare the performance of the agents and human on our benchmark. We evaluate the sub-workflow accuracy (SWA) and success rate (SR) of the agents and human on the Level 1 and 2 tasks within the expense report and transcript domains.

Quantitative comparison of **sub-workflow accuracy** (SWA). We compare a human operator, baseline agents and our agent on Level 1 and 2 tasks within the **expense report domain**. Human success rates in the **website environment** (at steps 50, 100, 150, and 200) are 0.00%, 12.50%, 75.00%, and 75.00%. In the **spreadsheet environment**, human success rates are 0.00%, 0.00%, 62.50%, and 75.00%. All agents (baselines and ours) achieve a 0% success rate across both environments.
Agents	Step 50	Step 100	Step 150	Step 200
Website System Environment
Human	30.00%	70.00%	95.00%	95.00%
OpenCUA-7B	0.00%	0.00%	0.00%	0.00%
UI-TARS-1.5-7B	0.00%	0.00%	0.00%	0.00%
AgentS2.5 + GPT5	0.00%	5.00%	5.00%	5.00%
AgentS2.5 + GPT5 w/ FCWD	12.50%	30.00%	37.50%	37.50%
Spreadsheet Environment
Human	35.00%	77.50%	97.50%	97.50%
OpenCUA-7B	0.00%	0.00%	0.00%	0.00%
UI-TARS-1.5-7B	0.00%	0.00%	0.00%	0.00%
AgentS2.5 + GPT5	0.00%	5.00%	12.50%	12.50%
AgentS2.5 + GPT5 w/ FCWD	5.00%	20.00%	25.00%	25.00%

Quantitative comparison of **sub-workflow accuracy** (SWA) and **success rate** (SR). A human operator, baseline agents and our agent are compared on Level 1 and 2 tasks within the **transcript domain**.
Agents	Step	Sub-Workflow Accuracy	SR
Difficulty Level 1
Human	50	75.00%	50.00%
Human	100	100.00%	100.00%
OpenCUA-7B	50	0.00%	0.00%
OpenCUA-7B	100	0.00%	0.00%
UI-TARS-1.5-7B	50	0.00%	0.00%
UI-TARS-1.5-7B	100	0.00%	0.00%
AgentS2.5 + GPT5	50	19.94%	0.00%
AgentS2.5 + GPT5	100	27.08%	25.00%
AgentS2.5 + GPT5 w/ FCWD	50	66.29%	25.00%
AgentS2.5 + GPT5 w/ FCWD	100	91.74%	50.00%
Difficulty Level 2
Human	50	27.38%	0.00%
	100	53.04%	25.00%
	150	69.18%	25.00%
	200	86.10%	50.00%
OpenCUA-7B	50	0.00%	0.00%
	100	0.00%	0.00%
	150	0.00%	0.00%
	200	0.00%	0.00%
UI-TARS-1.5-7B	50	0.00%	0.00%
	100	0.00%	0.00%
	150	0.00%	0.00%
	200	0.00%	0.00%
AgentS2.5 + GPT5	50	5.88%	0.00%
	100	17.65%	0.00%
	150	22.06%	0.00%
	200	23.53%	0.00%
AgentS2.5 + GPT5 w/ FCWD	50	10.61%	0.00%
	100	25.08%	0.00%
	150	38.78%	0.00%
	200	42.05%	0.00%

🙏 Acknowledgements

This work was conducted during an internship at Microsoft; we thank Microsoft Research (MSR) and Windows Cloud Experience (WCX) for their support. We also thank the authors of the OSWorld benchmark for their open-source infrastructure, which served as a foundation for this project.

📚 Citation

@article{wu2026OS-Marathon,
        title={OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks},
        author={Wu, Jing and Barretto, Daphne and Chen, Yiye and Gydé, Nicholas  and Jian, Yanan and He, Yuhang and Vineet, Vibhav},
        journal={arXiv},
        year={2026}
}

⏱️OS-Marathon