DeepAVFusion
PublicOfficial codebase for "Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling".
attention-mechanismaudio-visual-correspondenceaudio-visual-learningmasked-autoencodermasked-image-modelingmultimodal-learningself-supervised-learningsound-source-localizationsound-source-separationtransformer-architecture
Creat:2023-12-02T11:09:31
Update:2025-03-24T16:09:29
32
Stars
0
Stars Increase