Thanh V. T. Tran

I am Thanh Tran, an AI Research Resident at the FPT Software – AI Center, working under the supervision of Dr. Van Nguyen and Professor Truong-Son Hy. I’m starting my Ph.D. at Nanyang Technological University (NTU) in Fall 2026, advised by Professor Woon-Seng Gan.

I’m always open to collaborations, discussions, and new opportunities. Feel free to reach out if you’re interested in my research or would like to discuss potential projects.

Research: My research spans several key areas in artificial intelligence, with a primary focus on multimodal AI, generative models, and AI for scientific discovery.

1. Multimodal AI and Audio-Visual Learning. I develop deep learning models for audio-visual understanding and generation, including video-to-audio synthesis, automated video dubbing, and speech reconstruction from silent videos.

2. Generative Models for Speech and Audio. I work on flow models for text-to-speech and audio generation, aiming to build efficient, low-latency systems for real-world deployment.

3. AI for Scientific Discovery. Inspired by evolutionary algorithms, I optimize protein sequences using black-box optimization methods in discrete and latent spaces.

News

Jun 17, 2026	Flowley got accepted at ECCV 2026, wrapping up my journey at FPT Software – AI Center.
Jun 04, 2026	DiFlow-TTS got accepted at Interspeech 2026 (Long Paper track).
May 01, 2026	DiFlowDubber got accepted at CVPR Findings 2026. DiFlowDubber and Flowley also got accepted at Sight and Sound Workshop, CVPR 2026.
Jan 10, 2026	Honored to receive the Best Performance Award 2025, ranking in the top 3 out of 100+ AI engineers and researchers at FPT Software – AI Center.
May 20, 2025	RESOUND got accepted at Interspeech 2025.
Dec 21, 2024	ConxGNN got accepted at ICASSP 2025.
Nov 17, 2024	GROOT got accepted at KDD 2025.

Selected Publications

ECCV

Precise Video-to-Audio Generation with Cross-Modal Alignment in Latent Space

Thanh V. T. Tran, Ngoc-Son Nguyen, Luong Tran, and 4 more authors

European Conference on Computer Vision, 2026

Abs PDF Code Poster Slides Website

Video-to-audio (V2A) generation aims to synthesize realistic audio that is both semantically consistent with and temporally synchronized to a silent video. Despite recent progress, many methods still rely on multi-stage training, resulting in high computational costs and long runtimes, or transform visual input into text to leverage pretrained text-to-audio models, sacrificing fine-grained temporal cues. To overcome these limitations, we propose Flowley, an end-to-end, single-stage training architecture that produces soundtracks by combining visual features with textual prompts. Crucially, we introduce Progressive Soft-masked Cross-Attention, which embeds audio-visual synchronization directly within its attention mechanism, adding zero additional computational cost compared to standard attention layers. We further observe that existing V2A benchmarks lack sound-oriented descriptive captions, which can potentially degrade the quality of the synthesized audio. To remedy this, we propose SoundCap, a plug-and-play pipeline for creating detailed, sound-aware captions that guide the model. Remarkably, without integrating any pretrained audio-visual alignment modules, Flowley achieves state-of-the-art performance on VGGSound across multiple metrics. Moreover, by incorporating SoundCap, we further exceed the performance of the strongest existing close-sourced methods in terms of audio quality in the zero-shot setting.
Interspeech

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Factorized Discrete Flow Matching

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, and 2 more authors

Interspeech, 2026

Abs PDF Website

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address this limitation, we propose DiFlow-TTS, a zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that jointly generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.
ICASSP

Effective Context Modeling Framework for Emotion Recognition in Conversations

Cuong Tran Van^*, Thanh V. T. Tran^*, Van Nguyen, and 1 more author

International Conference on Acoustics, Speech, and Signal Processing, 2025

Abs Proc Code Poster

Emotion Recognition in Conversations (ERC) facilitates a deeper understanding of the emotions conveyed by speakers in each utterance within a conversation. Recently, Graph Neural Networks (GNNs) have demonstrated their strengths in capturing data relationships, particularly in contextual information modeling and multimodal fusion. However, existing methods often struggle to fully capture the complex interactions between multiple modalities and conversational context, limiting their expressiveness. To overcome these limitations, we propose ConxGNN, a novel GNN-based framework designed to capture contextual information in conversations. ConxGNN features two key parallel modules: a multi-scale heterogeneous graph that captures the diverse effects of utterances on emotional changes, and a hypergraph that models the multivariate relationships among modalities and utterances. The outputs from these modules are integrated into a fusion layer, where a cross-modal attention mechanism is applied to produce a contextually enriched representation. Additionally, ConxGNN tackles the challenge of recognizing minority or semantically similar emotion classes by incorporating a re-weighting scheme into the loss functions. Experimental results on the IEMOCAP and MELD benchmark datasets demonstrate the effectiveness of our method, achieving state-of-the-art performance compared to previous baselines.
KDD

GROOT: Effective Design of Biological Sequences with Limited Experimental Data

Thanh V. T. Tran^*, Nhat Khang Ngo^*, Viet Anh Nguyen, and 1 more author

Conference on Knowledge Discovery and Data Mining, 2025

Abs Proc Code Poster Slides

Latent space optimization (LSO) is a powerful method for designing discrete, high-dimensional biological sequences that maximize expensive black-box functions, such as wet lab experiments. This is accomplished by learning a latent space from available data and using a surrogate model fΦ to guide optimization algorithms toward optimal outputs. However, existing methods struggle when labeled data is limited, as training fΦ with few labeled data points can lead to subpar outputs, offering no advantage over the training data itself. We address this challenge by introducing GROOT, a GRaph-based Latent SmOOThing for Biological Sequence Optimization. In particular, GROOT generates pseudo-labels for neighbors sampled around the training latent embeddings. These pseudo-labels are then refined and smoothed by Label Propagation. Additionally, we theoretically and empirically justify our approach, demonstrate GROOT’s ability to extrapolate to regions beyond the training set while maintaining reliability within an upper bound of their expected distances from the training regions. We evaluate GROOT on various biological sequence design tasks, including protein optimization (GFP and AAV) and three tasks with exact oracles from Design-Bench. The results demonstrate that GROOT equalizes and surpasses existing methods without requiring access to black-box oracles or vast amounts of labeled data, highlighting its practicality and effectiveness.