Hi, Iβm Shuzheng Si, a first-year Ph.D. student in the Department of Computer Science and Technology at Tsinghua University. I am lucky to be advised by Prof. Maosong Sun and affiliated with TsinghuaNLP Lab. Previously, I obtained my masterβs degree from Peking University, where I was fortunate to be a part of the PKU NLP Group under the supervision of Prof. Baobao Chang at the Institute of Computational Linguistics. My research interests lie in Natural Language Processing (NLP) and Large Language Models (LLMs), specifically focusing on Data-Centric Methods and Data Science for NLP, including:
π‘ Scientific Data Acquisition for Post-Training (π΅π» Better Understanding of Data): My first line of research aims to synthesize and organize data more scientifically (CANOE, MMICL, UltraIF, UltraEdit), as well as ensure the data quality automatically (NOVA, GATEAU, NUGGETS), thereby further advancing the capabilities of LLMs. Also, by designing more effective data acquisition methods, we can gain deeper insights into the role of data in LLM training. For instance, when enhancing the specific capabilities of LLMs, what characteristics should the selected data possess to be considered high-quality data in order to maximize data efficiency?
π Data-Efficient Tuning Methods for Post-Training (π§π»βπ¬ Better Utilization of Data): Current LLMs require training on a vast amount of data, which significantly increases the training costs of models. This line of research attempts to utilize the supervision derived from training data more efficiently, such as leveraging (noisy) data collected from real-world scenarios to train models (SpokenWOZ, SANTA, CENSOR, SCL-RAI), and maximizing the positive impact of the limited data on models (ALSACE, LACING). Ultimately, the goal is to achieve data-efficient artificial intelligence and reach the next level of intelligence (perhaps referred to as AGI) at minimal cost.
π Trustworthy and Helpful Language Processing Engine (π§π»βπ« Applying My Research to Real-World Consumer Products): I also spend some time applying my research to the development of LingoWhale, which was created by DeepLang AI. LingoWhale attempts to build the next-generation language processing engine and help users access more valuable information in less time through subscription, aggregation, and summarization (More details are shown in this Chinese blog ). Exceptionally low hallucination rates characterize it, ensuring an accurate and comprehensive representation of every viewpoint from the original text. To date, it has provided intelligent text information processing services to hundreds of thousands of Chinese users.
My long-term research goal is to elucidate the influence of data on LLMs and utilize these insights to guide the organization, selection, and synthesis of high-quality data, thereby enhancing the capabilities of LLMs to build the next-generation language processing engine. Recently, I have been very interested in investigating LLMs' hallucinations from a data perspective and mitigating hallucinations through data-driven methods. Feel free to drop an email if you are interested in connecting. πππ
Ph.D. in Computer Science and Technology Sep. 2024 - Jul. 2028 (expected)
Peking University
M.S. in Software Engineering Sep. 2021 - Jul. 2024
Yunnan University
B.S. in Information Security (Rank: 1/300+) Sep. 2017 - Jul. 2021
Honors & Awards
Merit Student 2022
Top 10 Outstanding Students Nomination Award (Ranked 1st) 2020
First-Class Scholarship 2020
National Scholarship 2019
Provincial Scholarship 2018
Provincial Merit Student 2018
Experience
DeepLang AI
Research Intern Apr. 2024 - Now
Alibaba DAMO Academy
Research Intern Jun. 2022 - Jun. 2023
SenseTime Research
Research Intern Jul. 2021 - Feb. 2022
Service
NLP Research Communities: Reviewer of ACL, EMNLP, NAACL, COLING, and TASLP
ML Research Communities: Reviewer of NeurIPS, ICLR, and ICML
CV Research Communities: Reviewer of ICCV
I am also a member of the BIRD team, led by the talent researcher Jinyang Li, which drives the development of text-to-SQL for real-world database applications
News
2025
βͺ π Five papers accepted by ACL 2025, congrats to all co-authors!
May 15
2024
βͺ π§π»βπ» Started my Ph.D. study journey.
Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning
Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning