Shuzheng Si

Tsinghua University Ph.D. student, Tsinghua University

Hi, I’m Shuzheng Si, a first-year Ph.D. student in the Department of Computer Science and Technology at Tsinghua University. I am lucky to be advised by Prof. Maosong Sun and affiliated with TsinghuaNLP Lab. Previously, I obtained my master’s degree from Peking University, where I was fortunate to be a part of the PKU NLP Group under the supervision of Prof. Baobao Chang at the Institute of Computational Linguistics. My research interests lie in Natural Language Processing (NLP) and Large Language Models (LLMs), specifically focusing on Data-Centric Methods and Data Science for NLP, including:

  • πŸ’‘ Scientific Data Acquisition for Post-Training (πŸ•΅πŸ» Better Understanding of Data): My first line of research aims to synthesize and organize data more scientifically (CANOE, MMICL, UltraIF, UltraEdit), as well as ensure the data quality automatically (NOVA, GATEAU, NUGGETS), thereby further advancing the capabilities of LLMs. Also, by designing more effective data acquisition methods, we can gain deeper insights into the role of data in LLM training. For instance, when enhancing the specific capabilities of LLMs, what characteristics should the selected data possess to be considered high-quality data in order to maximize data efficiency?
  • πŸš€ Data-Efficient Tuning Methods for Post-Training (πŸ§‘πŸ»β€πŸ”¬ Better Utilization of Data): Current LLMs require training on a vast amount of data, which significantly increases the training costs of models. This line of research attempts to utilize the supervision derived from training data more efficiently, such as leveraging (noisy) data collected from real-world scenarios to train models (SpokenWOZ, SANTA, CENSOR, SCL-RAI), and maximizing the positive impact of the limited data on models (ALSACE, LACING). Ultimately, the goal is to achieve data-efficient artificial intelligence and reach the next level of intelligence (perhaps referred to as AGI) at minimal cost.
  • 🌏 Trustworthy and Helpful Language Processing Engine (πŸ§‘πŸ»β€πŸ« Applying My Research to Real-World Consumer Products): I also spend some time applying my research to the development of LingoWhale, which was created by DeepLang AI. LingoWhale attempts to build the next-generation language processing engine and help users access more valuable information in less time through subscription, aggregation, and summarization (More details are shown in this Chinese blog ). Exceptionally low hallucination rates characterize it, ensuring an accurate and comprehensive representation of every viewpoint from the original text. To date, it has provided intelligent text information processing services to hundreds of thousands of Chinese users.
My long-term research goal is to elucidate the influence of data on LLMs and utilize these insights to guide the organization, selection, and synthesis of high-quality data, thereby enhancing the capabilities of LLMs to build the next-generation language processing engine. Recently, I have been very interested in investigating LLMs' hallucinations from a data perspective and mitigating hallucinations through data-driven methods. Feel free to drop an email if you are interested in connecting. πŸ•ŠπŸ•ŠπŸ•Š


Education
  • Tsinghua University

    Tsinghua University

    Ph.D. in Computer Science and Technology Sep. 2024 - Jul. 2028 (expected)

  • Peking University

    Peking University

    M.S. in Software Engineering Sep. 2021 - Jul. 2024

  • Yunnan University

    Yunnan University

    B.S. in Information Security (Rank: 1/300+) Sep. 2017 - Jul. 2021

Honors & Awards
  • Merit Student 2022
  • Top 10 Outstanding Students Nomination Award (Ranked 1st) 2020
  • First-Class Scholarship 2020
  • National Scholarship 2019
  • Provincial Scholarship 2018
  • Provincial Merit Student 2018
Experience
  • DeepLang AI

    DeepLang AI

    Research Intern Apr. 2024 - Now

  • Alibaba DAMO Academy

    Alibaba DAMO Academy

    Research Intern Jun. 2022 - Jun. 2023

  • SenseTime Research

    SenseTime Research

    Research Intern Jul. 2021 - Feb. 2022

Service
  • NLP Research Communities: Reviewer of ACL, EMNLP, NAACL, COLING, and TASLP
  • ML Research Communities: Reviewer of NeurIPS, ICLR, and ICML
  • CV Research Communities: Reviewer of ICCV
  • I am also a member of the BIRD team, led by the talent researcher Jinyang Li, which drives the development of text-to-SQL for real-world database applications
News
2025
β–ͺ πŸŽ‰ Five papers accepted by ACL 2025, congrats to all co-authors!
May 15
2024
β–ͺ πŸ§‘πŸ»β€πŸ’» Started my Ph.D. study journey.
Sep 01
First-Authored Papers (view all papers on Google Scholar )
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

arXiv 2025

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Shuzheng Si, Haozhe Zhao, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Bofei Gao, Kangyang Luo, Wenhao Li, Yufei Huang, Gang Chen, Fanchao Qi, Minjia Zhang, Baobao Chang, Maosong Sun

arXiv 2025

Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering
Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering

Shuzheng Si, Haozhe Zhao, Gang Chen, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Kaikai An, Kangyang Luo, Chen Qian, Fanchao Qi, Baobao Chang, Maosong Sun

ACL 2025

Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering
Aligning Large Language Models to Follow Instructions and Hallucinate Less via Effective Data Filtering

Shuzheng Si, Haozhe Zhao, Gang Chen, Cheng Gao, Yuzhuo Bai, Zhitong Wang, Kaikai An, Kangyang Luo, Chen Qian, Fanchao Qi, Baobao Chang, Maosong Sun

ACL 2025

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints
Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints

Kaikai An*, Shuzheng Si*, Helan Hu, Haozhe Zhao, Yuchi Wang, Qingyan Guo, Baobao Chang

ACL 2025 (* indicates co-first authors)

Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints
Rethinking Semantic Parsing for Large Language Models: Enhancing LLM Performance with Semantic Hints

Kaikai An*, Shuzheng Si*, Helan Hu, Haozhe Zhao, Yuchi Wang, Qingyan Guo, Baobao Chang

ACL 2025 (* indicates co-first authors)

GATEAU: Selecting Influential Samples for Long Context Alignment
GATEAU: Selecting Influential Samples for Long Context Alignment

Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun

arXiv 2024, withdraw from ACL (Findings) 2025

GATEAU: Selecting Influential Samples for Long Context Alignment
GATEAU: Selecting Influential Samples for Long Context Alignment

Shuzheng Si, Haozhe Zhao, Gang Chen, Yunshui Li, Kangyang Luo, Chuancheng Lv, Kaikai An, Fanchao Qi, Baobao Chang, Maosong Sun

arXiv 2024, withdraw from ACL (Findings) 2025

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao*, Shuzheng Si*, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang

arXiv 2024 (* indicates co-first authors)

Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance
Looking Beyond Text: Reducing Language bias in Large Vision-Language Models via Multimodal Dual-Attention and Soft-Image Guidance

Haozhe Zhao*, Shuzheng Si*, Liang Chen, Yichi Zhang, Maosong Sun, Mingjia Zhang, Baobao Chang

arXiv 2024 (* indicates co-first authors)

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning
Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Shuzheng Si, Helan Hu, Haozhe Zhao, Shuang Zeng, Kaikai An, Zefan Cai, Baobao Chang

ACL (Findings) 2024

Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning
Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning

Shuzheng Si, Helan Hu, Haozhe Zhao, Shuang Zeng, Kaikai An, Zefan Cai, Baobao Chang

ACL (Findings) 2024

Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation
Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

Haozhe Zhao*, Zefan Cai*, Shuzheng Si*, Liang Chen, Yufeng He, Kaikai An, Baobao Chang

NAACL 2024 (* indicates co-first authors)

Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation
Mitigating Language-Level Performance Disparity in mPLMs via Teacher Language Selection and Cross-lingual Self-Distillation

Haozhe Zhao*, Zefan Cai*, Shuzheng Si*, Liang Chen, Yufeng He, Kaikai An, Baobao Chang

NAACL 2024 (* indicates co-first authors)

MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning
MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning

Haozhe Zhao*, Zefan Cai*, Shuzheng Si*, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

ICLR 2024 (* indicates co-first authors)

MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning
MMICL: Empowering Vision-Language Model with Multi-Modal In-Context Learning

Haozhe Zhao*, Zefan Cai*, Shuzheng Si*, Xiaojian Ma, Kaikai An, Liang Chen, Zixuan Liu, Sheng Wang, Wenjuan Han, Baobao Chang

ICLR 2024 (* indicates co-first authors)

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, Yongbin Li

NeurIPS 2023 (reviewer’s score: 9/ 9/ 7/ 6/ 5)

SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents

Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, Yongbin Li

NeurIPS 2023 (reviewer’s score: 9/ 9/ 7/ 6/ 5)

SANTA: Separate Strategies for Inaccurate and Incomplete Annotation Noise in Distantly-Supervised Named Entity Recognition
SANTA: Separate Strategies for Inaccurate and Incomplete Annotation Noise in Distantly-Supervised Named Entity Recognition

Shuzheng Si, Zefan Cai, Shuang Zeng, Guoqiang Feng, Jiaxing Lin, Baobao Chang

ACL (Findings) 2023

SANTA: Separate Strategies for Inaccurate and Incomplete Annotation Noise in Distantly-Supervised Named Entity Recognition
SANTA: Separate Strategies for Inaccurate and Incomplete Annotation Noise in Distantly-Supervised Named Entity Recognition

Shuzheng Si, Zefan Cai, Shuang Zeng, Guoqiang Feng, Jiaxing Lin, Baobao Chang

ACL (Findings) 2023

Mining Clues from Incomplete Utterance: A Query-Enhanced Network for Incomplete Utterance Rewriting
Mining Clues from Incomplete Utterance: A Query-Enhanced Network for Incomplete Utterance Rewriting

Shuzheng Si, Shuang Zeng, Baobao Chang

NAACL 2022

Mining Clues from Incomplete Utterance: A Query-Enhanced Network for Incomplete Utterance Rewriting
Mining Clues from Incomplete Utterance: A Query-Enhanced Network for Incomplete Utterance Rewriting

Shuzheng Si, Shuang Zeng, Baobao Chang

NAACL 2022

SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented Inference for Unlabeled Entity Problem in NER
SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented Inference for Unlabeled Entity Problem in NER

Shuzheng Si, Shuang Zeng, Jiaxing Lin, Baobao Chang

COLING 2022

SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented Inference for Unlabeled Entity Problem in NER
SCL-RAI: Span-based Contrastive Learning with Retrieval Augmented Inference for Unlabeled Entity Problem in NER

Shuzheng Si, Shuang Zeng, Jiaxing Lin, Baobao Chang

COLING 2022