Shuzheng Si's Homepage 👋🏻

Welcome to my homepage! This photo was taken before I started my research in NLP. Compared to back then, I've gained quite a bit of weight and lost quite a bit of hair :(

Shuzheng Si

Ph.D. student, Tsinghua University

Hi, I’m Shuzheng Si, a first-year Ph.D. student in the Department of Computer Science and Technology at Tsinghua University. I am lucky to be advised by Prof. Maosong Sun and affiliated with TsinghuaNLP Lab. Previously, I obtained my master’s degree from Peking University, where I was fortunate to be a part of the PKU NLP Group under the supervision of Prof. Baobao Chang at the Institute of Computational Linguistics. My research interests lie in Natural Language Processing (NLP) and Large Language Models (LLMs), specifically focusing on Data-Centric Methods and Data Science for NLP, including:

💡 Scientific Data Acquisition for Post-Training (🕵🏻 Better Understanding of Data): My first line of research aims to synthesize and organize data more scientifically (CANOE, MMICL, UltraIF, UltraEdit), as well as ensure the data quality automatically (NOVA, GATEAU, NUGGETS), thereby further advancing the capabilities of LLMs. Also, by designing more effective data acquisition methods, we can gain deeper insights into the role of data in LLM training. For instance, when enhancing the specific capabilities of LLMs, what characteristics should the selected data possess to be considered high-quality data in order to maximize data efficiency?
🚀 Data-Efficient Tuning Methods for Post-Training (🧑🏻‍🔬 Better Utilization of Data): Current LLMs require training on a vast amount of data, which significantly increases the training costs of models. This line of research attempts to utilize the supervision derived from training data more efficiently, such as leveraging (noisy) data collected from real-world scenarios to train models (SpokenWOZ, SANTA, CENSOR, SCL-RAI), and maximizing the positive impact of the limited data on models (ALSACE, LACING). Ultimately, the goal is to achieve data-efficient artificial intelligence and reach the next level of intelligence (perhaps referred to as AGI) at minimal cost.
🌏 Trustworthy and Helpful Language Processing Engine (🧑🏻‍🏫 Applying My Research to Real-World Consumer Products): I also spend some time applying my research to the development of LingoWhale, which was created by DeepLang AI. LingoWhale attempts to build the next-generation language processing engine and help users access more valuable information in less time through subscription, aggregation, and summarization (More details are shown in this Chinese blog ). Exceptionally low hallucination rates characterize it, ensuring an accurate and comprehensive representation of every viewpoint from the original text. To date, it has provided intelligent text information processing services to hundreds of thousands of Chinese users.

My long-term research goal is to elucidate the influence of data on LLMs and utilize these insights to guide the organization, selection, and synthesis of high-quality data, thereby enhancing the capabilities of LLMs to build the next-generation language processing engine. Recently, I have been very interested in investigating LLMs' hallucinations from a data perspective and mitigating hallucinations through data-driven methods. Feel free to drop an email if you are interested in connecting. 🕊🕊🕊

ssz24(at)mails.tsinghua.edu.cn Google Scholar Semantic Scholar GitHub

Education

Tsinghua University

Ph.D. in Computer Science and Technology Sep. 2024 - Jul. 2028 (expected)
Peking University

M.S. in Software Engineering Sep. 2021 - Jul. 2024
Yunnan University

B.S. in Information Security (Rank: 1/300+) Sep. 2017 - Jul. 2021

Honors & Awards

Merit Student 2022
Top 10 Outstanding Students Nomination Award (Ranked 1st) 2020
First-Class Scholarship 2020
National Scholarship 2019
Provincial Scholarship 2018
Provincial Merit Student 2018

Experience

DeepLang AI

Research Intern Apr. 2024 - Now
Alibaba DAMO Academy

Research Intern Jun. 2022 - Jun. 2023
SenseTime Research

Research Intern Jul. 2021 - Feb. 2022

Service

NLP Research Communities: Reviewer of ACL, EMNLP, NAACL, COLING, and TASLP
ML Research Communities: Reviewer of NeurIPS, ICLR, and ICML
CV Research Communities: Reviewer of ICCV
I am also a member of the BIRD team, led by the talent researcher Jinyang Li, which drives the development of text-to-SQL for real-world database applications

News

2025

▪ 🎉 Five papers accepted by ACL 2025, congrats to all co-authors!

May 15

2024

▪ 🧑🏻‍💻 Started my Ph.D. study journey.

Sep 01

First-Authored Papers (view all papers on Google Scholar )

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

arXiv 2025

Education

Honors & Awards

Experience

Service

News

First-Authored Papers (view all papers on Google Scholar )

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering

Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints

GATEAU: Selecting Influential Samples for Long Context Alignment

GATEAU: Selecting Influential Samples for Long Context Alignment

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning

MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

SANTA: Separate Strategies for Inaccurate and Incomplete Annotation Noise in Distantly-Supervised Named Entity Recognition

SANTA: Separate Strategies for Inaccurate and Incomplete Annotation Noise in Distantly-Supervised Named Entity Recognition

Mining Clues from Incomplete Utterance: A Query-Enhanced Network for Incomplete Utterance Rewriting

Mining Clues from Incomplete Utterance: A Query-Enhanced Network for Incomplete Utterance Rewriting

SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented Inference for Unlabeled Entity Problem in NER

SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented Inference for Unlabeled Entity Problem in NER