Back to Rankings返回排行榜
Top 100 · Multimodal AI前 100 · 多模态 AI
100 repositories sorted by multimodal ai 按 多模态 AI 排序,共 100 个仓库
| # | Repository仓库 | Stars | Forks | Language语言 | Issues | Description描述 | Last Commit最后提交 |
|---|---|---|---|---|---|---|---|
| 1 | transformers huggingface | 160.3k | 33.1k | Python | 1049 | 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. 🤗 Transformers:文本、视觉、音频和多模态模型中最先进的机器学习模型的模型定义框架,用于推理和训练。 | 2026-05-05 |
| 2 | anything-llm Mintplex-Labs | 59.6k | 6.4k | JavaScript | 317 | The all-in-one AI productivity accelerator. On device and privacy first with no annoying setup or configuration.一体化人工智能生产力加速器。以设备和隐私为先,无需烦人的设置或配置。 | 2026-05-04 |
| 3 | UI-TARS-desktop bytedance | 29.6k | 2.9k | TypeScript | 315 | The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra开源多模式 AI 代理堆栈:连接尖端 AI 模型和代理基础设施 | 2026-04-29 |
| 4 | sglang sgl-project | 27.1k | 5.7k | Python | 637 | SGLang is a high-performance serving framework for large language models and multimodal models.SGLang 是一个用于大型语言模型和多模态模型的高性能服务框架。 | 2026-05-06 |
| 5 | haystack deepset-ai | 25.1k | 2.8k | MDX | 95 | Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.开源 AI 编排框架,用于构建上下文工程、生产就绪的 LLM 应用程序。通过对检索、路由、内存和生成的显式控制来设计模块化管道和代理工作流程。专为可扩展代理、RAG、多模式应用程序、语义搜索和对话系统而构建。 | 2026-05-05 |
| 6 | LLaVA haotian-liu | 24.8k | 2.8k | Python | 1096 | [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.[NeurIPS'23 Oral] 视觉指令调优 (LLaVA) 旨在实现 GPT-4V 级别及以上的功能。 | 2024-08-12 |
| 7 | MiniCPM-o OpenBMB | 24.5k | 1.9k | Python | 26 | A Gemini 2.5 Flash Level MLLM for Vision, Speech, and Full-Duplex Multimodal Live Streaming on Your Phone适用于手机上的视觉、语音和全双工多模态直播的 Gemini 2.5 闪存级 MLLM | 2026-04-27 |
| 8 | unilm microsoft | 22.1k | 2.7k | Python | 641 | Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities跨任务、语言和模式的大规模自监督预训练 | 2026-01-23 |
| 9 | serve jina-ai | 21.9k | 2.2k | Python | 1 | ☁️ Build multimodal AI applications with cloud-native stack☁️ 使用云原生堆栈构建多模式人工智能应用程序 | 2025-03-24 |
| 10 | Qwen3-VL QwenLM | 19.1k | 1.8k | Jupyter Notebook | 375 | Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.Qwen3-VL是阿里云Qwen团队开发的多模态大语言模型系列。 | 2026-01-30 |
| 11 | screenpipe screenpipe | 18.5k | 1.7k | Rust | 12 | Run agents that work based on what you do. 24/7 local screen & mic recording for the superintelligence era运行根据您的工作而工作的代理。超级智能时代的24/7本地屏幕和麦克风录音 | 2026-05-06 |
| 12 | Awesome-Multimodal-Large-Language-Models BradyFU | 17.7k | 1.1k | N/A | 45 | :sparkles::sparkles:Latest Advances on Multimodal Large Language Models:sparkles::sparkles:多模态大语言模型的最新进展 | 2026-05-01 |
| 13 | Janus deepseek-ai | 17.7k | 2.2k | Python | 159 | Janus-Series: Unified Multimodal Understanding and Generation ModelsJanus 系列:统一多模态理解和生成模型 | 2025-02-01 |
| 14 | NeMo NVIDIA-NeMo | 17.2k | 3.4k | Python | 71 | A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)一个可扩展的生成式人工智能框架,专为从事大型语言模型、多模式和语音人工智能(自动语音识别和文本转语音)工作的研究人员和开发人员而构建 | 2026-05-05 |
| 15 | ms-swift modelscope | 14.0k | 1.4k | Python | 984 | Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-R1, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, Phi4, ...) (AAAI 2025).使用 PEFT 或全参数 CPT/SFT/DPO/GRPO 600+ LLM(Qwen3.6、DeepSeek-R1、GLM-5.1、InternLM3、Llama4...)和 300+ MLLM(Qwen3-VL、Qwen3-Omni、InternVL3.5、Ovis2.5、GLM4.5v、Gemma4、Llava、Phi4、 ...)(AAAI 2025)。 | 2026-05-05 |
| 16 | pipecat pipecat-ai | 11.9k | 2.0k | Python | 93 | Open Source framework for voice and multimodal conversational AI用于语音和多模式会话 AI 的开源框架 | 2026-05-06 |
| 17 | rerun rerun-io | 10.6k | 722 | Rust | 1307 | An open source SDK for logging, storing, querying, and visualizing multimodal and multi-rate data用于记录、存储、查询和可视化多模式和多速率数据的开源 SDK | 2026-05-05 |
| 18 | runanywhere-sdks RunanywhereAI | 10.4k | 356 | C++ | 32 | Production ready toolkit to run AI locally用于本地运行 AI 的生产就绪工具包 | 2026-05-05 |
| 19 | self-operating-computer OthersideAI | 10.2k | 1.4k | Python | 81 | A framework to enable multimodal models to operate a computer.使多模式模型能够操作计算机的框架。 | 2025-09-19 |
| 20 | lancedb lancedb | 10.2k | 869 | HTML | 561 | Developer-friendly OSS embedded retrieval library for multimodal AI. Search More; Manage Less.适用于多模式 AI 的开发人员友好型 OSS 嵌入式检索库。搜索更多;少管理。 | 2026-05-05 |
| 21 | pyod yzhao062 | 9.8k | 1.5k | Python | 196 | A Python library for anomaly detection across tabular, time series, graph, text, and image data. 60+ detectors, benchmark-backed ADEngine orchestration, and an agentic workflow for AI agents.用于跨表格、时间序列、图形、文本和图像数据进行异常检测的 Python 库。 60 多个检测器、基准支持的 ADEngine 编排以及 AI 代理的代理工作流程。 | 2026-04-16 |
| 22 | gorse gorse-io | 9.7k | 897 | Go | 102 | AI powered open source recommender system engine supports classical/LLM rankers and multimodal content via embedding人工智能驱动的开源推荐系统引擎通过嵌入支持经典/LLM 排名和多模式内容 | 2026-05-06 |
| 23 | seatunnel apache | 9.3k | 2.2k | Java | 365 | SeaTunnel is a multimodal, high-performance, distributed, massive data integration tool.SeaTunnel是一个多模态、高性能、分布式、海量数据集成工具。 | 2026-05-05 |
| 24 | inference xorbitsai | 9.3k | 824 | Python | 21 | Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.通过更改一行代码即可将 GPT 替换为任何 LLM。 Xinference 可让您在云、本地或笔记本电脑上运行开源、语音和多模式模型 — 所有这些都通过一个统一的、可用于生产的推理 API。 | 2026-05-04 |
| 25 | deeplake activeloopai | 9.1k | 709 | C++ | 54 | Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.Deeplake 是代理的人工智能数据运行时。它为无服务器 postgres 提供多模式数据湖,从而实现可扩展的检索和训练。 | 2026-02-16 |
| 26 | BentoML bentoml | 8.6k | 959 | Python | 135 | The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!服务 AI 应用程序和模型的最简单方法 - 构建模型推理 API、作业队列、LLM 应用程序、多模型管道等等! | 2026-05-04 |
| 27 | MobileAgent X-PLUG | 8.6k | 871 | Python | 183 | Mobile-Agent: The Powerful GUI Agent FamilyMobile-Agent:强大的 GUI 代理系列 | 2026-04-14 |
| 28 | mmagic open-mmlab | 7.4k | 1.1k | Jupyter Notebook | 61 | OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic 🪄: Generative-AI (AIGC), easy-to-use APIs, awsome model zoo, diffusion models, for text-to-image generation, image/video restoration/enhancement, etc.OpenMMLab 多模式高级、生成和智能创建工具箱。解锁魔法🪄:生成式人工智能 (AIGC)、易于使用的 API、出色的模型动物园、扩散模型,用于文本到图像生成、图像/视频恢复/增强等。 | 2024-08-06 |
| 29 | GLM-4 zai-org | 7.1k | 619 | Python | 35 | GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型 | 2025-07-04 |
| 30 | all-in-rag datawhalechina | 7.0k | 3.4k | Python | 11 | 🔍大模型应用开发实战一:RAG 技术全栈指南,在线阅读地址:https://datawhalechina.github.io/all-in-rag/ | 2026-05-02 |
| 31 | mlx-audio Blaizzy | 6.9k | 578 | Python | 57 | A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.基于 Apple MLX 框架构建的文本转语音 (TTS)、语音转文本 (STT) 和语音转语音 (STS) 库,可在 Apple Silicon 上提供高效的语音分析。 | 2026-05-03 |
| 32 | awesome-multimodal-ml pliang279 | 6.9k | 899 | N/A | 6 | Reading list for research topics in multimodal machine learning多模态机器学习研究主题的阅读清单 | 2024-08-20 |
| 33 | AppAgent TencentQQGYLab | 6.7k | 741 | Python | 87 | AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.AppAgent:作为智能手机用户的多模式代理,一个基于法学硕士的多模式代理框架,旨在操作智能手机应用程序。 | 2025-03-19 |
| 34 | courses SkalskiP | 6.4k | 594 | Python | 5 | This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)该存储库是有关人工智能 (AI) 的各种课程和资源的链接的精选集合 | 2024-04-22 |
| 35 | lance lance-format | 6.4k | 650 | Rust | 969 | Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..多模式 AI 的开放 Lakehouse 格式。只需 2 行代码即可从 Parquet 进行转换,以实现速度提高 100 倍的随机访问、向量索引和数据版本控制。与 Pandas、DuckDB、Polars、Pyarrow 和 PyTorch 兼容,即将推出更多集成。 | 2026-05-05 |
| 36 | podcastfy souzatharsis | 6.3k | 722 | Python | 84 | An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAINotebookLM 播客功能的开源 Python 替代方案:使用 GenAI 将多模式内容转换为迷人的多语言音频对话 | 2026-05-04 |
| 37 | jaaz 11cafe | 6.2k | 607 | TypeScript | 37 | The world's first open-source multimodal creative assistant This is a substitute for Canva and Manus that prioritizes privacy and is usable locally.全球首款开源多模态创意助手 这是 Canva 和 Manus 的替代品,优先考虑隐私且可在本地使用。 | 2026-03-02 |
| 38 | ai-notes swyxio | 6.2k | 552 | HTML | 3 | notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.供软件工程师了解新的人工智能发展的笔记。用作 https://latent.space 写作和产品头脑风暴的数据存储,但已清理 /Resources 文件夹下的规范引用。 | 2026-02-16 |
| 39 | VLM-R1 om-ai-lab | 6.0k | 377 | Python | 164 | Solve Visual Understanding with Reinforced VLMs使用增强型 VLM 解决视觉理解问题 | 2026-03-12 |
| 40 | Bagel ByteDance-Seed | 5.9k | 523 | Python | 140 | Open-source unified multimodal model开源统一多式联运模型 | 2026-05-04 |
| 41 | genkit genkit-ai | 5.9k | 727 | TypeScript | 673 | Open-source framework for building AI-powered apps in JavaScript, Go, and Python, built and used in production by Google用于使用 JavaScript、Go 和 Python 构建人工智能驱动的应用程序的开源框架,由 Google 在生产中构建和使用 | 2026-05-06 |
| 42 | pyspur PySpur-Dev | 5.7k | 425 | TypeScript | 29 | A visual playground for agentic workflows: Iterate over your agents 10x faster代理工作流程的可视化游乐场:代理迭代速度提高 10 倍 | 2025-07-20 |
| 43 | mmf facebookresearch | 5.6k | 945 | Python | 115 | A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)Facebook AI Research (FAIR) 的视觉和语言多模态研究模块化框架 | 2026-04-07 |
| 44 | UltraRAG OpenBMB | 5.5k | 413 | Python | 6 | [GitHub Trending #2] A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines[GitHub 趋势 #2] 用于构建复杂且创新的 RAG 管道的低代码 MCP 框架 | 2026-05-05 |
| 45 | neuraltalk karpathy | 5.5k | 1.3k | Python | 26 | NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.NeuralTalk 是一个 Python+numpy 项目,用于学习用句子描述图像的多模态循环神经网络。 | 2020-12-22 |
| 46 | Daft Eventual-Inc | 5.5k | 461 | Rust | 256 | High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale适用于人工智能和多模式工作负载的高性能数据引擎。处理任何规模的图像、音频、视频和结构化数据 | 2026-05-05 |
| 47 | DeepSeek-VL2 deepseek-ai | 5.3k | 1.8k | Python | 101 | DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal UnderstandingDeepSeek-VL2:用于高级多模态理解的专家混合视觉语言模型 | 2025-02-26 |
| 48 | xtuner InternLM | 5.1k | 419 | Python | 240 | A Next-Generation Training Engine Built for Ultra-Large MoE Models专为超大型 MoE 模型打造的下一代训练引擎 | 2026-05-05 |
| 49 | align-anything PKU-Alignment | 4.7k | 506 | Python | 29 | Align Anything: Training All-modality Model with Feedback对齐一切:通过反馈训练全模态模型 | 2025-11-27 |
| 50 | vllm-omni vllm-project | 4.6k | 876 | Python | 385 | A framework for efficient model inference with omni-modality models全模态模型的高效模型推理框架 | 2026-05-06 |
| 51 | tree-of-thoughts kyegomez | 4.6k | 375 | Python | 12 | Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70% 即插即用实现思想之树:使用大型语言模型深思熟虑地解决问题,将模型推理能力提升至少 70% | 2025-07-29 |
| 52 | Awesome-AIGC-Tutorials luban-agi | 4.5k | 301 | N/A | 5 | Curated tutorials and resources for Large Language Models, AI Painting, and more. 针对大型语言模型、AI 绘画等的精选教程和资源。 | 2024-03-31 |
| 53 | ultravox fixie-ai | 4.4k | 372 | Python | 53 | A fast multimodal LLM for real-time voice用于实时语音的快速多模式法学硕士 | 2025-12-12 |
| 54 | img2dataset rom1504 | 4.4k | 375 | Python | 125 | Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.轻松将大量图像 URL 转换为图像数据集。可以在一台机器上 20 小时内下载、调整大小和打包 100M 网址。 | 2025-10-19 |
| 55 | VisualGLM-6B zai-org | 4.2k | 424 | Python | 269 | Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型 | 2024-08-23 |
| 56 | Fengshenbang-LM IDEA-CCNL | 4.1k | 380 | Python | 104 | Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。 | 2024-08-13 |
| 57 | lmms-eval EvolvingLMMs-Lab | 4.1k | 579 | Python | 26 | One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks跨文本、图像、视频和音频任务的一站式多模态评估工具包 | 2026-04-29 |
| 58 | open_flamingo mlfoundations | 4.1k | 319 | Python | 45 | An open-source framework for training large multimodal models.用于训练大型多模式模型的开源框架。 | 2024-08-31 |
| 59 | OmniGen2 VectorSpaceLab | 4.1k | 25 | Jupyter Notebook | 100 | OmniGen2: Exploration to Advanced Multimodal Generation. https://arxiv.org/abs/2506.18871OmniGen2:对高级多模式生成的探索。 https://arxiv.org/abs/2506.18871 | 2026-03-20 |
| 60 | mm-cot amazon-science | 4.0k | 331 | Python | 44 | Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)《语言模型中的多模态思维链推理》正式实现(敬请期待,更多内容将会更新) | 2024-06-12 |
| 61 | OmniRoute diegosouzapw | 4.0k | 644 | TypeScript | 40 | Never stop coding. Free AI gateway: one endpoint, 160+ providers, RTK+Caveman stacked compression up to ~95% eligible context savings, smart auto-fallback, MCP/A2A, multimodal APIs, Desktop/PWA.永远不要停止编码。免费 AI 网关:一个端点、160 多个提供商、RTK+Caveman 堆叠压缩高达约 95% 的符合条件的上下文节省、智能自动回退、MCP/A2A、多模式 API、桌面/PWA。 | 2026-05-05 |
| 62 | Qwen2.5-Omni QwenLM | 4.0k | 323 | Jupyter Notebook | 213 | Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know. | 2025-06-12 |
| 63 | mmpretrain open-mmlab | 3.8k | 1.1k | Python | 202 | OpenMMLab Pre-training Toolbox and BenchmarkOpenMMLab 预训练工具箱和基准测试 | 2024-11-01 |
| 64 | discoart jina-ai | 3.8k | 243 | Python | 25 | 🪩 Create Disco Diffusion artworks in one line🪩 用一行创建 Disco Diffusion 艺术品 | 2023-05-16 |
| 65 | VILA NVlabs | 3.8k | 319 | Python | 67 | VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.VILA 是一系列最先进的视觉语言模型 (VLM),适用于跨边缘、数据中心和云的各种多模式 AI 任务。 | 2026-03-12 |
| 66 | NExT-GPT NExT-GPT | 3.6k | 361 | Python | 81 | Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language ModelError 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know. | 2025-05-13 |
| 67 | Awesome-LLM-Reasoning atfortes | 3.6k | 205 | N/A | 5 | From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know. | 2026-04-20 |
| 68 | morphik-core morphik-org | 3.6k | 299 | Python | 13 | The most accurate document search and store for building AI apps用于构建人工智能应用程序的最准确的文档搜索和存储 | 2026-04-02 |
| 69 | mini-omni gpt-omni | 3.5k | 310 | Python | 36 | open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities. Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know. | 2024-11-05 |
| 70 | mteb embeddings-benchmark | 3.2k | 612 | Python | 273 | MTEB: Massive Text Embedding BenchmarkMTEB:海量文本嵌入基准 | 2026-05-05 |
| 71 | SimpleMem aiming-lab | 3.2k | 334 | Python | 9 | SimpleMem: Efficient Lifelong Memory for LLM Agents — Text & MultimodalSimpleMem:LLM 代理的高效终身记忆 - 文本和多模式 | 2026-04-04 |
| 72 | InternGPT OpenGVLab | 3.2k | 235 | Python | 19 | InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models.现在它支持DragGAN、ChatGPT、ImageBind、多模式聊天(如GPT-4、SAM)、交互式图像编辑等。请在igpt.opengvlab.com上尝试(支持DragGAN、ChatGPT、ImageBind、SAM的在线演示系统) | 2024-08-20 |
| 73 | Skywork-R1V SkyworkAI | 3.2k | 280 | Python | 28 | Skywork-R1V is an advanced multimodal AI model series developed by Skywork AI, specializing in vision-language reasoning.Skywork-R1V是Skywork AI开发的先进多模态AI模型系列,专注于视觉语言推理。 | 2025-12-15 |
| 74 | torchscale microsoft | 3.1k | 225 | Python | 29 | Foundation Architecture for (M)LLMs(M)LLM 的基础架构 | 2024-04-11 |
| 75 | docarray docarray | 3.1k | 241 | Python | 68 | Represent, send, store and search multimodal data表示、发送、存储和搜索多模式数据 | 2026-03-27 |
| 76 | HunyuanImage-3.0 Tencent-Hunyuan | 3.0k | 162 | Python | 40 | HunyuanImage-3.0: A Powerful Native Multimodal Model for Image GenerationHunyuanImage-3.0:强大的原生图像生成多模态模型 | 2026-02-03 |
| 77 | awesome-embodied-vla-va-vln jonyzhang2023 | 3.0k | 137 | N/A | 0 | A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches. 嵌入式人工智能最先进研究的精选列表,重点关注视觉-语言-动作 (VLA) 模型、视觉-语言导航 (VLN) 和相关的多模态学习方法。 | 2026-04-15 |
| 78 | InternLM-XComposer InternLM | 2.9k | 176 | Python | 139 | InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio InteractionsInternLM-XComposer2.5-OmniLive:用于长期流媒体视频和音频交互的综合多模式系统 | 2025-05-26 |
| 79 | vortex vortex-data | 2.9k | 150 | Rust | 189 | An extensible, state-of-the-art framework for columnar compression, and the fastest FOSS columnar file format. Formerly at @spiraldb, now an Incubation Stage project at LFAI&Data, part of the Linux Foundation.可扩展、最先进的列式压缩框架,以及最快的 FOSS 列式文件格式。以前在 @spiraldb,现在是 LFAI&Data(Linux 基金会的一部分)的孵化阶段项目。 | 2026-05-05 |
| 80 | OSWorld xlang-ai | 2.8k | 447 | Python | 147 | [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments[NeurIPS 2024] OSWorld:真实计算机环境中开放式任务的多模式代理基准测试 | 2026-05-01 |
| 81 | helm stanford-crfm | 2.8k | 382 | Python | 48 | Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.语言模型的整体评估 (HELM) 是由斯坦福大学基础模型研究中心 (CRFM) 创建的开源 Python 框架,用于对基础模型进行整体、可重复和透明的评估,包括大语言模型 (LLM) 和多模态模型。 | 2026-05-05 |
| 82 | Awesome-AI4Med FreedomIntelligence | 2.8k | 474 | N/A | 0 | A curated list of medical LLMs, multimodal systems, datasets, benchmarks, and more. 🏥医学法学硕士、多模式系统、数据集、基准等的精选列表。 🏥 | 2026-04-27 |
| 83 | clip-retrieval rom1504 | 2.8k | 239 | Jupyter Notebook | 80 | Easily compute clip embeddings and build a clip retrieval system with them轻松计算剪辑嵌入并用它们构建剪辑检索系统 | 2026-03-28 |
| 84 | datachain datachain-ai | 2.7k | 140 | Python | 63 | Data Memory: the operational data context layer for AI agents - typed, versioned datasets over images, video, docs and tables数据内存:人工智能代理的操作数据上下文层 - 图像、视频、文档和表格上的类型化、版本化数据集 | 2026-05-05 |
| 85 | MUNIT NVlabs | 2.7k | 485 | Python | 63 | Multimodal Unsupervised Image-to-Image Translation多模态无监督图像到图像翻译 | 2022-09-20 |
| 86 | autodistill autodistill | 2.7k | 213 | Python | 39 | Images to inference with no labeling (use foundation models to train supervised models).无需标记即可进行推理的图像(使用基础模型来训练监督模型)。 | 2025-05-14 |
| 87 | maestro roboflow | 2.7k | 221 | Python | 17 | streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL简化多模式模型的微调过程:PaliGemma 2、Florence-2 和 Qwen2.5-VL | 2026-05-01 |
| 88 | OmAgent om-ai-lab | 2.6k | 288 | Python | 7 | [EMNLP-2024] Build multimodal language agents for fast prototype and production[EMNLP-2024] 构建多模式语言代理以实现快速原型和生产 | 2025-03-19 |
| 89 | OFA OFA-Sys | 2.6k | 249 | Python | 109 | Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkOFA 的官方存储库 (ICML 2022)。论文:OFA:通过简单的序列到序列学习框架统一架构、任务和模式 | 2024-04-24 |
| 90 | mPLUG-Owl X-PLUG | 2.5k | 190 | Python | 99 | mPLUG-Owl: The Powerful Multi-modal Large Language Model FamilymPLUG-Owl:强大的多模态大语言模型系列 | 2025-04-02 |
| 91 | OCRFlux chatdoc-com | 2.5k | 151 | Python | 69 | OCRFlux is a lightweight yet powerful multimodal toolkit that significantly advances PDF-to-Markdown conversion, excelling in complex layout handling, complicated table parsing and cross-page content merging.OCRFlux 是一个轻量级但功能强大的多模式工具包,可显着推进 PDF 到 Markdown 的转换,在复杂的布局处理、复杂的表格解析和跨页面内容合并方面表现出色。 | 2026-04-14 |
| 92 | HuixiangDou InternLM | 2.5k | 181 | Python | 32 | HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical AssistanceHuiyangDou:利用基于法学硕士的技术援助克服群聊场景 | 2025-11-24 |
| 93 | OmniSVG OmniSVG | 2.5k | 94 | Python | 36 | [NeurIPS 2025] OmniSVG is the first family of end-to-end multimodal SVG generators that leverage pre-trained Vision-Language Models (VLMs), capable of generating complex and detailed SVGs, from simple icons to intricate anime characters.[NeurIPS 2025] OmniSVG 是第一个端到端多模式 SVG 生成器系列,它利用预先训练的视觉语言模型 (VLM),能够生成复杂而详细的 SVG,从简单的图标到复杂的动漫角色。 | 2026-03-01 |
| 94 | stability-sdk Stability-AI | 2.4k | 344 | Jupyter Notebook | 37 | SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)用于与 stable.ai API 交互的 SDK(例如稳定扩散推理) | 2025-08-05 |
| 95 | Awesome-Text-to-Image Yutong-Zhou-cv | 2.4k | 205 | N/A | 0 | (ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.(ෆ`꒳´ෆ) 文本到图像生成/合成的调查。 | 2026-02-07 |
| 96 | tribev2 facebookresearch | 2.4k | 544 | Jupyter Notebook | 15 | This repository contains the code to train and evaluate TRIBE v2, a multimodal model for brain response prediction该存储库包含训练和评估 TRIBE v2 的代码,TRIBE v2 是一种用于大脑反应预测的多模式模型 | 2026-03-30 |
| 97 | mPLUG-DocOwl X-PLUG | 2.4k | 148 | Python | 70 | mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document UnderstandingmPLUG-DocOwl:用于文档理解的模块化多模态大语言模型 | 2025-05-30 |
| 98 | GLM-V zai-org | 2.3k | 167 | Python | 11 | GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement LearningGLM-4.6V/4.5V/4.1V-Thinking:通过可扩展的强化学习实现多功能多模态推理 | 2026-04-06 |
| 99 | hcaptcha-challenger QIN2DIM | 2.3k | 422 | Python | 41 | 🥂 Gracefully face hCaptcha challenge with multimodal large language model.🥂 利用多模态大语言模型优雅地面对 hCaptcha 挑战。 | 2026-01-28 |
| 100 | perception_models facebookresearch | 2.3k | 157 | Jupyter Notebook | 44 | State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!最先进的图像和视频 CLIP、多模态大型语言模型等等! | 2026-04-13 |
No repositories match your search
没有匹配的仓库