Encoder and Decoder of LLM Multimodal

Z.ai Launches GLM-4.6V AI Model to Let AI Agents See Natively

V, a multimodal model that has introduced native visual function calling to bypass text conversion in agentic workflows.

Microsoft

LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation - Microsoft Research

CLIP is one of the most important multimodal foundational models today, aligning visual and textual signals into a shared feature space using a simple contrastive learning loss on large-scale ...

Z.ai debuts open source GLM-4.6V, a native tool-calling vision model for multimodal reasoning

Chinese AI startup Zhipu AI aka Z.ai has released its GLM-4.6V series, a new generation of open-source vision-language models ...

Semiconductor Engineering

Multimodal LLM Assistant for Chip Physical Design (National Taiwan Univ., UCLA, NVIDIA)

A new technical paper titled “Multimodal Chip Physical Design Engineer Assistant” was published by researchers at National Taiwan University, University of California, Los Angeles and NVIDIA Research.

Microsoft

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders

Despite significant advances in Multimodal Large Language Models (MLLMs), understanding complex temporal dynamics in videos remains a major challenge. Our experiments show that current Video Large ...

VentureBeat

China's Alibaba challenges U.S. tech giants with open source Qwen3-Omni AI model accepting text, audio, image and video

U.S. tech giants are facing a reckoning from the East. Even as Nvidia pledged today to invest a staggering $100 billion into its own customer OpenAI's data centers — a move that raised eyebrows across ...

IEEE

DM-FNet: Unified multimodal medical image fusion via diffusion process-trained encoder-decoder

Abstract: Multimodal medical image fusion (MMIF) extracts the most meaningful information from multiple source images, enabling a more comprehensive and accurate diagnosis. Achieving high-quality ...

eLife

Modality-Agnostic Decoding of Vision and Language from fMRI

Not revised: This Reviewed Preprint includes the authors’ original preprint (without revision), an eLife assessment, and public reviews. The authors introduce a densely-sampled dataset where 6 ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results