Gengyuan Zhang

pronoun: he/him

Hi, I am Gengyuan(张耕源). I am currently pursuing my PhD degree at Ludwig-Maximilian University (aka LMU Munich/University of Munich), supervised by Prof. Volker Tresp.

My research interests include Video Understanding and Multimodal Reasoning as an intersection of Computer Vision and Natural Language Processing.

Prior to this, I attained my bachelor degree (2018) in Zhejiang University, China and my master degree (2021) in Technical University of Munich, Germany.

Originally, I am from Hunan, China.

uni email: zhang{at}dbs[dot]ifi[dot]lmu[dot]de
personal email: gengyuanmax{at}gmail[dot]com
hobbies: Plants, Crusaeder Kings III, Travelling, Cooking
have a cute Dackel (dachshund)

I am open to any collaboration and full-time job opportunities.

news

Apr 7, 2025	I start my internship @Amazon London!
Mar 5, 2025	One paper accepted by ICLR 2025 Workshop World Model
Feb 26, 2025	Two papers accepted at CVPR2025! See you in Nashville.
Feb 20, 2025	Our new paper is now on arXiv Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs!
Oct 28, 2024	One new paper is accepted by WACV 2025, Tuscon, Arizona!

selected publications

Localizing Events in Videos with Multimodal Queries

Gengyuan Zhang, Mang Ling Ada Fok, Yan Xia, and 5 more authors

arXiv preprint arXiv:2406.10079, 2024

Abs arXiv Code

Video understanding is a pivotal task in the digital era, yet the dynamic and multievent nature of videos makes them labor-intensive and computationally demanding to process. Thus, localizing a specific event given a semantic query has gained importance in both user-oriented applications like video search and academic research into video foundation models. A significant limitation in current research is that semantic queries are typically in natural language that depicts the semantics of the target event. This setting overlooks the potential for multimodal semantic queries composed of images and texts. To address this gap, we introduce a new benchmark, ICQ, for localizing events in videos with multimodal queries, along with a new evaluation dataset ICQ-Highlight. Our new benchmark aims to evaluate how well models can localize an event given a multimodal semantic query that consists of a reference image, which depicts the event, and a refinement text to adjust the images’ semantics. To systematically benchmark model performance, we include 4 styles of reference images and 5 types of refinement texts, allowing us to explore model performance across different domains. We propose 3 adaptation methods that tailor existing models to our new setting and evaluate 10 SOTA models, ranging from specialized to large-scale foundation models. We believe this benchmark is an initial step toward investigating multimodal queries in video event localization.
Multi-event Video-Text Retrieval

Gengyuan Zhang, Jisen Ren, Jindong Gu, and 1 more author

In Proceedings of the IEEE/CVF International Conference on Computer Vision 2023, 2023

Abs arXiv Code

Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies.
Memory Helps, but Confabulation Misleads: Understanding Streaming Events in Videos with MLLMs

Gengyuan Zhang, Mingcong Ding, Tong Liu, and 2 more authors

2025

Abs

Multimodal large language models (MLLMs) have demonstrated strong performance in understanding videos holistically, yet their ability to process streaming videos-videos are treated as a sequence of visual events-remains underexplored. Intuitively, leveraging past events as memory can enrich contextual and temporal understanding of the current event. In this paper, we show that leveraging memories as contexts helps MLLMs better understand video events. However, because such memories rely on predictions of preceding events, they may contain misinformation, leading to confabulation and degraded performance. To address this, we propose a confabulation-aware memory modification method that mitigates confabulated memory for memory-enhanced event understanding.
Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries

Roberto Amoroso*, Gengyuan Zhang*, Rajat Koner, and 4 more authors

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025, 2025

Abs

Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos, identify the most relevant information based on contextual cues from a given question, and reason accurately to provide answers. Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities. This progress is largely driven by the effective alignment between visual data and the language space of MLLMs. However, for video QA, an additional space-time alignment poses a considerable challenge for extracting question-relevant information across frames. In this work, we investigate diverse temporal modeling techniques to integrate with MLLMs, aiming to achieve question-guided temporal modeling that leverages pre-trained visual and textual alignment in MLLMs. We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs. Our evaluation across multiple video QA benchmarks demonstrates that T-Former competes favorably with existing temporal modeling approaches and aligns with recent advancements in video QA.
Time-dependent Entity Embedding is not All You Need: A Re-evaluation of Temporal Knowledge Graph Completion Models under a Unified Framework

Zhen Han*, Gengyuan Zhang*, Yunpu Ma, and 1 more author

In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Nov 2021

Abs

Various temporal knowledge graph (KG) completion models have been proposed in the recent literature. The models usually contain two parts, a temporal embedding layer and a score function derived from existing static KG modeling approaches. Since the approaches differ along several dimensions, including different score functions and training strategies, the individual contributions of different temporal embedding techniques to model performance are not always clear. In this work, we systematically study six temporal embedding approaches and empirically quantify their performance across a wide range of configurations with about 3000 experiments and 13159 GPU hours. We classify the temporal embeddings into two classes: (1) timestamp embeddings and (2) time-dependent entity embeddings. Despite the common belief that the latter is more expressive, an extensive experimental study shows that timestamp embeddings can achieve on-par or even better performance with significantly fewer parameters. Moreover, we find that when trained appropriately, the relative performance differences between various temporal embeddings often shrink and sometimes even reverse when compared to prior results. For example, TTransE (CITATION), one of the first temporal KG models, can outperform more recent architectures on ICEWS datasets. To foster further research, we provide the first unified open-source framework for temporal KG completion models with full composability, where temporal embeddings, score functions, loss functions, regularizers, and the explicit modeling of reciprocal relations can be combined arbitrarily.
Can vision-language models be a good guesser? exploring vlms for times and location reasoning

Gengyuan Zhang, Yurui Zhang, Kerui Zhang, and 1 more author

In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Nov 2024

Abs arXiv Code

Vision-Language Models (VLMs) are expected to be capable of reasoning with commonsense knowledge as human beings. One example is that humans can reason where and when an image is taken based on their knowledge. This makes us wonder if, based on visual cues, Vision-Language Models that are pre-trained with large-scale image-text resources can achieve and even outperform human’s capability in reasoning times and location. To address this question, we propose a two-stage \recognition\space and \reasoning\space probing task, applied to discriminative and generative VLMs to uncover whether VLMs can recognize times and location-relevant features and further reason about it. To facilitate the investigation, we introduce WikiTiLo, a well-curated image dataset compromising images with rich socio-cultural cues. In the extensive experimental studies, we find that although VLMs can effectively retain relevant features in visual encoders, they still fail to make perfect reasoning. We will release our dataset and codes to facilitate future studies.