Publications | Jiuzhou Han

2025

ICLR
Agent S: An Open Agentic Framework that Uses Computers Like a Human

Saaket Agashe*, Jiuzhou Han*, Shuyu Gan, and 3 more authors

In The Thirteen International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, Apr 2025

Abs Bib PDF Code

We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark.
@inproceedings{agashe2024agentsopenagentic, title = {Agent S: An Open Agentic Framework that Uses Computers Like a Human}, author = {Agashe*, Saaket and Han*, Jiuzhou and Gan, Shuyu and Yang, Jiachen and Li, Ang and Wang, Xin Eric}, booktitle = {The Thirteen International Conference on Learning Representations, {ICLR} 2025, Singapore, April 24-28, 2025}, year = {2025}, month = apr, }
EMNLP
VerifiAgent: a Unified Verification Agent in Language Model Reasoning

Jiuzhou Han, Wray Buntine, and Ehsan Shareghi

In Findings of the Association for Computational Linguistics: EMNLP 2025, Nov 2025

Abs Bib PDF Code

Large language models demonstrate remarkable reasoning capabilities but often produce unreliable or incorrect responses. Existing verification methods are typically model-specific or domain-restricted, requiring significant computational resources and lacking scalability across diverse reasoning tasks. To address these limitations, we propose VerifiAgent, a unified verification agent that integrates two levels of verification: meta-verification, which assesses completeness and consistency in model responses, and tool-based adaptive verification, where VerifiAgent autonomously selects appropriate verification tools based on the reasoning type, including mathematical, logical, or commonsense reasoning. This adaptive approach ensures both efficiency and robustness across different verification scenarios. Experimental results show that VerifiAgent outperforms baseline verification methods (e.g., deductive verifier, backward verifier) among all reasoning tasks. Additionally, it can further enhance reasoning accuracy by leveraging feedback from verification results. VerifiAgent can also be effectively applied to inference scaling, achieving better results with fewer generated samples and costs compared to existing process reward models in the mathematical reasoning domain.
@inproceedings{han-etal-2025-verifiagent, title = {{V}erifi{A}gent: a Unified Verification Agent in Language Model Reasoning}, author = {Han, Jiuzhou and Buntine, Wray and Shareghi, Ehsan}, editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2025}, month = nov, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.findings-emnlp.891/}, pages = {16410--16431}, isbn = {979-8-89176-335-7} }
AAAI
Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning

Jiuzhou Han, Wray Buntine, and Ehsan Shareghi

Aug 2025

Abs Bib PDF Code

Large language models have demonstrated remarkable capabilities in complex mathematical reasoning tasks, but they inevitably generate errors throughout multi-step solutions. Process-level Reward Models (PRMs) have shown great promise by providing supervision and evaluation at each intermediate step, thereby effectively improving the models’ reasoning abilities. However, training effective PRMs requires high-quality process reward data, yet existing methods for constructing such data are often labour-intensive or inefficient. In this paper, we propose an uncertainty-driven framework for automated process reward data construction, encompassing both data generation and annotation processes for PRMs. Additionally, we identify the limitations of both majority vote and PRMs, and introduce two generic uncertainty-aware output aggregation methods: Hybrid Majority Reward Vote and Weighted Reward Frequency Vote, which combine the strengths of majority vote with PRMs. Extensive experiments on ProcessBench, MATH, and GSMPlus show the effectiveness and efficiency of the proposed PRM data construction framework, and demonstrate that the two output aggregation methods further improve the mathematical reasoning abilities across diverse PRMs.
@misc{han2025uncertaintybasedmethods, title = {Uncertainty-Based Methods for Automated Process Reward Data Construction and Output Aggregation in Mathematical Reasoning}, author = {Han, Jiuzhou and Buntine, Wray and Shareghi, Ehsan}, year = {2025}, month = aug, eprint = {2508.01773}, archiveprefix = {arXiv}, primaryclass = {cs.AI}, }

2024

EACL
Reward Engineering for Generating Semi-structured Explanation

Jiuzhou Han, Wray L. Buntine, and Ehsan Shareghi

In Findings of the Association for Computational Linguistics: EACL 2024, St. Julian’s, Malta, March 17-22, 2024, Aug 2024

Abs Bib PDF Code

Semi-structured explanation depicts the implicit process of a reasoner with an explicit representation. This explanation highlights how available information in a specific query is utilised and supplemented with information a reasoner produces from its internal weights towards generating an answer. Despite the recent improvements in generative capabilities of language models, producing structured explanations to verify a model’s true reasoning capabilities remains a challenge. This issue is particularly pronounced for not-so-large LMs (e.g., FLAN-T5-XXL). In this work, we first underscore the limitations of supervised fine-tuning (SFT) in tackling this challenge, and then introduce a carefully crafted reward engineering method in reinforcement learning (RL) to better address this problem. We investigate multiple reward aggregation methods and provide a detailed discussion which sheds light on the promising potential of RL for future research. Our proposed method on two semi-structured explanation generation benchmarks (ExplaGraph and COPA-SSE) achieves new state-of-the-art results.
@inproceedings{DBLP:conf/eacl/HanBS24, author = {Han, Jiuzhou and Buntine, Wray L. and Shareghi, Ehsan}, editor = {Graham, Yvette and Purver, Matthew}, title = {Reward Engineering for Generating Semi-structured Explanation}, booktitle = {Findings of the Association for Computational Linguistics: {EACL} 2024, St. Julian's, Malta, March 17-22, 2024}, pages = {589--602}, publisher = {Association for Computational Linguistics}, year = {2024}, timestamp = {Tue, 02 Apr 2024 16:32:10 +0200}, biburl = {https://dblp.org/rec/conf/eacl/HanBS24.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }
ACL
PiVe: Prompting with Iterative Verification Improving Graph-based Generative Capability of LLMs

Jiuzhou Han, Nigel Collier, Wray Buntine, and 1 more author

In Findings of the Association for Computational Linguistics ACL 2024, Aug 2024

Abs Bib PDF Code

Large language models (LLMs) have shown great abilities of solving various natural language tasks in different domains. Due to the training objective of LLMs and their pre-training data, LLMs are not very well equipped for tasks involving structured data generation. We propose a framework, Prompting with Iterative Verification (PiVe), to improve graph-based generative capability of LLMs. We show how a small language model could be trained to act as a verifier module for the output of an LLM(i.e., ChatGPT, GPT-4), and to iteratively improve its performance via fine-grained corrective instructions. We also show how the verifier module could apply iterative corrections offline for a more cost-effective solution to the text-to-graph generation task. Experiments on three graph-based datasets show consistent improvement gained via PiVe. Additionally, we create GenWiki-HIQ and highlight that the verifier module can be used as a data augmentation tool to help improve the quality of automatically generated parallel text-graph datasets.
@inproceedings{han-etal-2024-pive, title = {PiVe: Prompting with Iterative Verification Improving Graph-based Generative Capability of LLMs}, author = {Han, Jiuzhou and Collier, Nigel and Buntine, Wray and Shareghi, Ehsan}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Findings of the Association for Computational Linguistics ACL 2024}, month = aug, year = {2024}, address = {Bangkok, Thailand and virtual meeting}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-acl.400}, pages = {6702--6718}, }
ACL
Towards Uncertainty-Aware Language Agent

Jiuzhou Han, Wray Buntine, and Ehsan Shareghi

In Findings of the Association for Computational Linguistics ACL 2024, Aug 2024

Abs Bib PDF Code

While Language Agents have achieved promising success by placing Large Language Models at the core of a more versatile design that dynamically interacts with the external world, the existing approaches neglect the notion of uncertainty during these interactions. We present the Uncertainty-Aware Language Agent (UALA), a framework that orchestrates the interaction between the agent and the external world using uncertainty quantification. Compared with other well-known counterparts like ReAct, our extensive experiments across 3 representative tasks (HotpotQA, StrategyQA, MMLU) and various LLM sizes demonstrate that UALA brings a significant improvement of performance, while having a substantially lower reliance on the external world (i.e., reduced number of tool calls and tokens). Our analyses provide various insights including the great potential of UALA compared with agent fine-tuning, and underscore the unreliability of verbalised confidence of LLMs as a proxy for uncertainty.
@inproceedings{han-etal-2024-towards, title = {Towards Uncertainty-Aware Language Agent}, author = {Han, Jiuzhou and Buntine, Wray and Shareghi, Ehsan}, editor = {Ku, Lun-Wei and Martins, Andre and Srikumar, Vivek}, booktitle = {Findings of the Association for Computational Linguistics ACL 2024}, month = aug, year = {2024}, address = {Bangkok, Thailand and virtual meeting}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2024.findings-acl.398}, pages = {6662--6685}, }
arXiv
Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation, Incremental Fine-Tuning, and Verification

Ramya Keerthy Thatikonda, Jiuzhou Han, Wray Buntine, and 1 more author

Sep 2024

Abs Bib PDF Code

Logical reasoning is a fundamental task in natural language processing that presents significant challenges to Large Language Models (LLMs). The inherent characteristics of logical reasoning makes it well-suited for symbolic representations such as first-order logic (FOL). Research in symbolic logical reasoning explored FOL generation using state-of-the-art LLMs (i.e., GPT-4) to produce FOL translations of natural language (NL) statements, but errors in translation are usually not the focus. We address this by categorizing the translation errors in FOL statements generated by LLMs, specifically for deductive logical reasoning tasks. In order to make progress towards improving the quality of FOL translations for smaller language models such as LLaMA-2 13B and Mistral 7B, we create PROOFFOL, a high-quality FOL-annotated subset of ProofWriter dataset using GPT-4o. The models finetuned on this silver standard data achieve a significant gain in performance when compared to larger language models such as LLaMA-2 70B. In addition to improving the model using large data, we also tackle the issue of data scarcity and introduce an incremental framework encompassing of data augmentation and verification steps. In the augmentation process, a single pair of (premises, conclusion) is split into multiple new instances based on the predicates and FOLs. This data is used for fine-tuning, and the inference on this model generates FOLs with fewer errors over the model trained on the original data. Our investigation on the translation errors leads to generation of a perturbation dataset consisting of simulated NL-to-FOL translation errors and their corresponding corrections. This data is used to train a verifier, which corrects potential syntactic and semantic FOL translation errors. We demonstrate an efficient method for making the most of a limited existing human-annotated dataset. Our results show state-of-the-art performance for ProofWriter and ProntoQA datasets using PROOFFOL on LLaMA-2 and Mistral models.
@misc{thatikonda2024strategiesimprovingnltofoltranslation, title = {Strategies for Improving NL-to-FOL Translation with LLMs: Data Generation, Incremental Fine-Tuning, and Verification}, author = {Thatikonda, Ramya Keerthy and Han, Jiuzhou and Buntine, Wray and Shareghi, Ehsan}, year = {2024}, month = sep, eprint = {2409.16461}, archiveprefix = {arXiv}, primaryclass = {cs.CL}, }

2023

EMNLP
POSQA: Probe the World Models of LLMs with Size Comparisons

Chang Shu*, Jiuzhou Han*, Fangyu Liu, and 2 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2023, Dec 2023

Abs Bib PDF Code

Embodied language comprehension emphasizes that language understanding is not solely a matter of mental processing in the brain but also involves interactions with the physical and social environment. With the explosive growth of Large Language Models (LLMs) and their already ubiquitous presence in our daily lives, it is becoming increasingly necessary to verify their real-world understanding. Inspired by cognitive theories, we propose POSQA: a Physical Object Size Question Answering dataset with simple size comparison questions to examine the extremity and analyze the potential mechanisms of the embodied comprehension of the latest LLMs. We show that even the largest LLMs today perform poorly under the zero-shot setting. We then push their limits with advanced prompting techniques and external knowledge augmentation. Furthermore, we investigate whether their real-world comprehension primarily derives from contextual information or internal weights and analyse the impact of prompt formats and report bias of different objects. Our results show that real-world understanding that LLMs shaped from textual data can be vulnerable to deception and confusion by the surface form of prompts, which makes it less aligned with human behaviours.
@inproceedings{shu-etal-2023-posqa, title = {POSQA: Probe the World Models of {LLM}s with Size Comparisons}, author = {Shu*, Chang and Han*, Jiuzhou and Liu, Fangyu and Shareghi, Ehsan and Collier, Nigel}, editor = {Bouamor, Houda and Pino, Juan and Bali, Kalika}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2023}, month = dec, year = {2023}, address = {Singapore}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2023.findings-emnlp.504}, pages = {7518--7531}, }

2022

EMNLP
Self-supervised Graph Masking Pre-training for Graph-to-Text Generation

Jiuzhou Han, and Ehsan Shareghi

In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, Dec 2022

Abs Bib PDF Code

Large-scale pre-trained language models (PLMs) have advanced Graph-to-Text (G2T) generation by processing the linearised version of a graph. However, the linearisation is known to ignore the structural information. Additionally, PLMs are typically pre-trained on free text which introduces domain mismatch between pre-training and downstream G2T generation tasks. To address these shortcomings, we propose graph masking pre-training strategies that neither require supervision signals nor adjust the architecture of the underlying pre-trained encoder-decoder model. When used with a pre-trained T5, our approach achieves new state-of-the-art results on WebNLG+2020 and EventNarrative G2T generation datasets. Our method also shows to be very effective in the low-resource setting.
@inproceedings{DBLP:conf/emnlp/HanS22, author = {Han, Jiuzhou and Shareghi, Ehsan}, editor = {Goldberg, Yoav and Kozareva, Zornitsa and Zhang, Yue}, title = {Self-supervised Graph Masking Pre-training for Graph-to-Text Generation}, booktitle = {Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, {EMNLP} 2022, Abu Dhabi, United Arab Emirates, December 7-11}, pages = {4845--4853}, publisher = {Association for Computational Linguistics}, year = {2022}, url = {https://aclanthology.org/2022.emnlp-main.321}, timestamp = {Tue, 07 Feb 2023 17:10:51 +0100}, biburl = {https://dblp.org/rec/conf/emnlp/HanS22.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }

2021

INLG
Generating Diverse Descriptions from Semantic Graphs

Jiuzhou Han, Daniel Beck, and Trevor Cohn

In Proceedings of the 14th International Conference on Natural Language Generation, INLG 2021, Aberdeen, Scotland, UK, 20-24 September, Dec 2021

Abs Bib PDF Code

Text generation from semantic graphs is traditionally performed with deterministic methods, which generate a unique description given an input graph. However, the generation problem admits a range of acceptable textual outputs, exhibiting lexical, syntactic and semantic variation. To address this disconnect, we present two main contributions. First, we propose a stochastic graph-to-text model, incorporating a latent variable in an encoder-decoder model, and its use in an ensemble. Second, to assess the diversity of the generated sentences, we propose a new automatic evaluation metric which jointly evaluates output diversity and quality in a multi-reference setting. We evaluate the models on WebNLG datasets in English and Russian, and show an ensemble of stochastic models produces diverse sets of generated sentences while, retaining similar quality to state-of-the-art models.
@inproceedings{DBLP:conf/inlg/HanBC21, author = {Han, Jiuzhou and Beck, Daniel and Cohn, Trevor}, editor = {Belz, Anya and Fan, Angela and Reiter, Ehud and Sripada, Yaji}, title = {Generating Diverse Descriptions from Semantic Graphs}, booktitle = {Proceedings of the 14th International Conference on Natural Language Generation, {INLG} 2021, Aberdeen, Scotland, UK, 20-24 September}, pages = {1--11}, publisher = {Association for Computational Linguistics}, year = {2021}, url = {https://aclanthology.org/2021.inlg-1.1}, timestamp = {Mon, 25 Oct 2021 15:03:55 +0200}, biburl = {https://dblp.org/rec/conf/inlg/HanBC21.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }