I am an undergraduate student at Shanghai Jiao Tong University majoring in Artificial Intelligence, where I enrolled in Fall 2023. My research interests center on Agent, Multimodal, and Large Language Models. I am currently focusing on Agent and Multimodal research. I have gained practical experience through coursework and research projects. I am eager to further explore these areas and welcome opportunities to exchange ideas and collaborate with peers who share similar interests.
🔬 My Research
- Research Interests: Agent, Multimodal, Large Language Model, AI for games
- Current Focus: Agent & Multimodal
🎖 Honors and Awards
- 2024, 2025 Zhiyuan Honors Scholarship
Award for top students majoring in science. - 2024, 2025 Undergraduate Excellence Scholarship
Awarded to students with outstanding comprehensive evaluation rankings.
📖 Educations
💻 Internships
- Explored Computer Vision fundamentals.
- Learned to read research papers and reproduced basic CV algorithms.
- Conducting research on MultiModal Large Language Model.
- Participated in training a GUI recognition model for future GUI Agent (collaborative project with Huawei). Responsible for data synthesis, annotation, and cleaning.
- Conducting research on visual token pruning for multimodal models.
- First author & Project leader of AcademiClaw, a benchmark for OpenClaw.
- Participant in davinci-magihuman.
- Built downstream applications based on davinci-magihuman, including digital human meetings and digital human live streaming. [Soon Public]
- Participated in the development of WeSay (a speech recognition app). [Private repo]
📝 Publications
AcademiClaw: When Students Set Challenges for AI Agents
- The first academic-level benchmark for OpenClaw sourced directly from undergraduate students.
- 80 complex, long-horizon tasks across 25+ domains; best frontier model achieves only ~55% pass rate.
- Contributes to evaluating academic capabilities and advancing the OpenClaw community.
Speed by Simplicity: A Single-Stream Architecture for Fast Audio-Video Generative Foundation Model
- First open-source model to achieve joint audio-video generation with a pure single-stream Transformer.
- 80% win rate over Ovi 1.1 and 60.9% over LTX 2.3 in human evaluation, reaching open-source SOTA.
- Generates a 5-second clip in ~2s on a single H100; supports 6+ languages.
📂 Projects
AcademiClaw
The first academic-level benchmark for OpenClaw sourced directly from undergraduate students.
Paper-RAG
An intent-aware paper recommendation and research assistant powered by RAG.
GUI-Project
Data preparation for training a GUI recognition model for future GUI Agent.
ViT-on-Image-Classification
ViT on image classification, esp. small-scale datasets (CIFAR-10).