Research Areas

Research Philosophy

My research focuses on advancing multimodal artificial intelligence systems that can understand, generate, and interact with complex real-world data across video, language, and 3D spaces.

I am particularly interested in developing AI systems for video understanding and video generation, bridging the gap between visual perception and language understanding through state-of-the-art Vision-Language Models (VLMs).

Primary Research Areas

Video Understanding

CVPR 2025

Developing AI systems for long video analysis with instructed learnable memory. Our ReWind model (CVPR 2025) enables comprehensive understanding of extended video content through novel memory mechanisms.

Long Video Analysis Memory Models Temporal Reasoning

Video Generation

Building production-scale video generation models using Vision-Language Models and diffusion architectures. Focus on controllable generation and high-quality synthetic video production.

Diffusion Models VLMs Generative AI

Vision-Language Models

Expertise in state-of-the-art VLMs including LLaVA, CLIP, QwenVL, and LayoutVLM. Developing multimodal perception systems that bridge vision and language understanding.

LLaVA CLIP QwenVL Multimodal AI

Autonomous Systems & 4D LiDAR

ECCV 2026 Submitted

Research on OmniLiDAR: Controllable and Multi-Sensor 4D LiDAR Generation (submitted to ECCV 2026). Focus on 3D scene reconstruction, sensor fusion, and autonomous perception systems.

4D LiDAR Sensor Fusion 3D Reconstruction

Vision-Language-Action (VLA)

Developing models that combine visual perception, language understanding, and action prediction for robotics and autonomous systems. Enabling AI systems to understand and execute complex instructions.

Robotics Action Prediction Embodied AI

Generative Diffusion Models

Researching state-of-the-art diffusion models for creating and manipulating visual content. Applications in image synthesis, video generation, and creative AI tools.

Diffusion Image Synthesis Generative AI

Kinematic Time Series

Deep learning for motion sensor data analysis and trajectory reconstruction using Temporal Convolutional Networks (TCNs). Applications in pen trajectory reconstruction and motion analysis.

TCN Time Series Sensor Fusion

Document AI & OCR

Multimodal perception models for document analysis, handwriting recognition, and historical document processing. Deploying OCR capabilities across multiple languages.

OCR Handwriting Document Analysis

Current Research Focus

Multimodal AI at Huawei Finland Research Center

2023 - Present

Leading cutting-edge research on multimodal AI systems with focus on:

  • Video Understanding: Co-developed ReWind (CVPR 2025), a large language model for long video understanding with instructed learnable memory
  • Video Generation: Building production-scale video generation models using VLMs and diffusion architectures on ModelArts MLOps platform
  • OCR & Multimodal Perception: Spearheading OCR capabilities, deploying multimodal perception models across multiple languages
  • 4D LiDAR Generation: OmniLiDAR project on controllable and multi-sensor 4D LiDAR generation (submitted to ECCV 2026)

Explore My Publications

Discover the research outputs and contributions in these areas

Conference Papers Journal Papers