Github Ranking /
2026-05-06
Back to Rankings返回排行榜

👁️ Top 100 · Multimodal AI前 100 · 多模态 AI

100 repositories sorted by multimodal ai 按 多模态 AI 排序,共 100 个仓库

📦 100 repos个仓库 🕐 2026-05-06
# Repository仓库 Stars Forks Language语言 Issues Description描述 Last Commit最后提交
1 transformers huggingface 160.3k 33.1k Python 1049 🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training. 🤗 Transformers:文本、视觉、音频和多模态模型中最先进的机器学习模型的模型定义框架,用于推理和训练。 2026-05-05
2 anything-llm Mintplex-Labs 59.6k 6.4k JavaScript 317 The all-in-one AI productivity accelerator. On device and privacy first with no annoying setup or configuration.一体化人工智能生产力加速器。以设备和隐私为先,无需烦人的设置或配置。 2026-05-04
3 UI-TARS-desktop bytedance 29.6k 2.9k TypeScript 315 The Open-Source Multimodal AI Agent Stack: Connecting Cutting-Edge AI Models and Agent Infra开源多模式 AI 代理堆栈:连接尖端 AI 模型和代理基础设施 2026-04-29
4 sglang sgl-project 27.1k 5.7k Python 637 SGLang is a high-performance serving framework for large language models and multimodal models.SGLang 是一个用于大型语言模型和多模态模型的高性能服务框架。 2026-05-06
5 haystack deepset-ai 25.1k 2.8k MDX 95 Open-source AI orchestration framework for building context-engineered, production-ready LLM applications. Design modular pipelines and agent workflows with explicit control over retrieval, routing, memory, and generation. Built for scalable agents, RAG, multimodal applications, semantic search, and conversational systems.开源 AI 编排框架,用于构建上下文工程、生产就绪的 LLM 应用程序。通过对检索、路由、内存和生成的显式控制来设计模块化管道和代理工作流程。专为可扩展代理、RAG、多模式应用程序、语义搜索和对话系统而构建。 2026-05-05
6 LLaVA haotian-liu 24.8k 2.8k Python 1096 [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.[NeurIPS'23 Oral] 视觉指令调优 (LLaVA) 旨在实现 GPT-4V 级别及以上的功能。 2024-08-12
7 MiniCPM-o OpenBMB 24.5k 1.9k Python 26 A Gemini 2.5 Flash Level MLLM for Vision, Speech, and Full-Duplex Multimodal Live Streaming on Your Phone适用于手机上的视觉、语音和全双工多模态直播的 Gemini 2.5 闪存级 MLLM 2026-04-27
8 unilm microsoft 22.1k 2.7k Python 641 Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities跨任务、语言和模式的大规模自监督预训练 2026-01-23
9 serve jina-ai 21.9k 2.2k Python 1 ☁️ Build multimodal AI applications with cloud-native stack☁️ 使用云原生堆栈构建多模式人工智能应用程序 2025-03-24
10 Qwen3-VL QwenLM 19.1k 1.8k Jupyter Notebook 375 Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.Qwen3-VL是阿里云Qwen团队开发的多模态大语言模型系列。 2026-01-30
11 screenpipe screenpipe 18.5k 1.7k Rust 12 Run agents that work based on what you do. 24/7 local screen & mic recording for the superintelligence era运行根据您的工作而工作的代理。超级智能时代的24/7本地屏幕和麦克风录音 2026-05-06
12 Awesome-Multimodal-Large-Language-Models BradyFU 17.7k 1.1k N/A 45 :sparkles::sparkles:Latest Advances on Multimodal Large Language Models:sparkles::sparkles:多模态大语言模型的最新进展 2026-05-01
13 Janus deepseek-ai 17.7k 2.2k Python 159 Janus-Series: Unified Multimodal Understanding and Generation ModelsJanus 系列:统一多模态理解和生成模型 2025-02-01
14 NeMo NVIDIA-NeMo 17.2k 3.4k Python 71 A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)一个可扩展的生成式人工智能框架,专为从事大型语言模型、多模式和语音人工智能(自动语音识别和文本转语音)工作的研究人员和开发人员而构建 2026-05-05
15 ms-swift modelscope 14.0k 1.4k Python 984 Use PEFT or Full-parameter to CPT/SFT/DPO/GRPO 600+ LLMs (Qwen3.6, DeepSeek-R1, GLM-5.1, InternLM3, Llama4, ...) and 300+ MLLMs (Qwen3-VL, Qwen3-Omni, InternVL3.5, Ovis2.5, GLM4.5v, Gemma4, Llava, Phi4, ...) (AAAI 2025).使用 PEFT 或全参数 CPT/SFT/DPO/GRPO 600+ LLM(Qwen3.6、DeepSeek-R1、GLM-5.1、InternLM3、Llama4...)和 300+ MLLM(Qwen3-VL、Qwen3-Omni、InternVL3.5、Ovis2.5、GLM4.5v、Gemma4、Llava、Phi4、 ...)(AAAI 2025)。 2026-05-05
16 pipecat pipecat-ai 11.9k 2.0k Python 93 Open Source framework for voice and multimodal conversational AI用于语音和多模式会话 AI 的开源框架 2026-05-06
17 rerun rerun-io 10.6k 722 Rust 1307 An open source SDK for logging, storing, querying, and visualizing multimodal and multi-rate data用于记录、存储、查询和可视化多模式和多速率数据的开源 SDK 2026-05-05
18 runanywhere-sdks RunanywhereAI 10.4k 356 C++ 32 Production ready toolkit to run AI locally用于本地运行 AI 的生产就绪工具包 2026-05-05
19 self-operating-computer OthersideAI 10.2k 1.4k Python 81 A framework to enable multimodal models to operate a computer.使多模式模型能够操作计算机的框架。 2025-09-19
20 lancedb lancedb 10.2k 869 HTML 561 Developer-friendly OSS embedded retrieval library for multimodal AI. Search More; Manage Less.适用于多模式 AI 的开发人员友好型 OSS 嵌入式检索库。搜索更多;少管理。 2026-05-05
21 pyod yzhao062 9.8k 1.5k Python 196 A Python library for anomaly detection across tabular, time series, graph, text, and image data. 60+ detectors, benchmark-backed ADEngine orchestration, and an agentic workflow for AI agents.用于跨表格、时间序列、图形、文本和图像数据进行异常检测的 Python 库。 60 多个检测器、基准支持的 ADEngine 编排以及 AI 代理的代理工作流程。 2026-04-16
22 gorse gorse-io 9.7k 897 Go 102 AI powered open source recommender system engine supports classical/LLM rankers and multimodal content via embedding人工智能驱动的开源推荐系统引擎通过嵌入支持经典/LLM 排名和多模式内容 2026-05-06
23 seatunnel apache 9.3k 2.2k Java 365 SeaTunnel is a multimodal, high-performance, distributed, massive data integration tool.SeaTunnel是一个多模态、高性能、分布式、海量数据集成工具。 2026-05-05
24 inference xorbitsai 9.3k 824 Python 21 Swap GPT for any LLM by changing a single line of code. Xinference lets you run open-source, speech, and multimodal models on cloud, on-prem, or your laptop — all through one unified, production-ready inference API.通过更改一行代码即可将 GPT 替换为任何 LLM。 Xinference 可让您在云、本地或笔记本电脑上运行开源、语音和多模式模型 — 所有这些都通过一个统一的、可用于生产的推理 API。 2026-05-04
25 deeplake activeloopai 9.1k 709 C++ 54 Deeplake is AI Data Runtime for Agents. It provides serverless postgres with a multimodal datalake, enabling scalable retrieval and training.Deeplake 是代理的人工智能数据运行时。它为无服务器 postgres 提供多模式数据湖,从而实现可扩展的检索和训练。 2026-02-16
26 BentoML bentoml 8.6k 959 Python 135 The easiest way to serve AI apps and models - Build Model Inference APIs, Job queues, LLM apps, Multi-model pipelines, and more!服务 AI 应用程序和模型的最简单方法 - 构建模型推理 API、作业队列、LLM 应用程序、多模型管道等等! 2026-05-04
27 MobileAgent X-PLUG 8.6k 871 Python 183 Mobile-Agent: The Powerful GUI Agent FamilyMobile-Agent:强大的 GUI 代理系列 2026-04-14
28 mmagic open-mmlab 7.4k 1.1k Jupyter Notebook 61 OpenMMLab Multimodal Advanced, Generative, and Intelligent Creation Toolbox. Unlock the magic 🪄: Generative-AI (AIGC), easy-to-use APIs, awsome model zoo, diffusion models, for text-to-image generation, image/video restoration/enhancement, etc.OpenMMLab 多模式高级、生成和智能创建工具箱。解锁魔法🪄:生成式人工智能 (AIGC)、易于使用的 API、出色的模型动物园、扩散模型,用于文本到图像生成、图像/视频恢复/增强等。 2024-08-06
29 GLM-4 zai-org 7.1k 619 Python 35 GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型 2025-07-04
30 all-in-rag datawhalechina 7.0k 3.4k Python 11 🔍大模型应用开发实战一:RAG 技术全栈指南,在线阅读地址:https://datawhalechina.github.io/all-in-rag/ 2026-05-02
31 mlx-audio Blaizzy 6.9k 578 Python 57 A text-to-speech (TTS), speech-to-text (STT) and speech-to-speech (STS) library built on Apple's MLX framework, providing efficient speech analysis on Apple Silicon.基于 Apple MLX 框架构建的文本转语音 (TTS)、语音转文本 (STT) 和语音转语音 (STS) 库,可在 Apple Silicon 上提供高效的语音分析。 2026-05-03
32 awesome-multimodal-ml pliang279 6.9k 899 N/A 6 Reading list for research topics in multimodal machine learning多模态机器学习研究主题的阅读清单 2024-08-20
33 AppAgent TencentQQGYLab 6.7k 741 Python 87 AppAgent: Multimodal Agents as Smartphone Users, an LLM-based multimodal agent framework designed to operate smartphone apps.AppAgent:作为智能手机用户的多模式代理,一个基于法学硕士的多模式代理框架,旨在操作智能手机应用程序。 2025-03-19
34 courses SkalskiP 6.4k 594 Python 5 This repository is a curated collection of links to various courses and resources about Artificial Intelligence (AI)该存储库是有关人工智能 (AI) 的各种课程和资源的链接的精选集合 2024-04-22
35 lance lance-format 6.4k 650 Rust 969 Open Lakehouse Format for Multimodal AI. Convert from Parquet in 2 lines of code for 100x faster random access, vector index, and data versioning. Compatible with Pandas, DuckDB, Polars, Pyarrow, and PyTorch with more integrations coming..多模式 AI 的开放 Lakehouse 格式。只需 2 行代码即可从 Parquet 进行转换,以实现速度提高 100 倍的随机访问、向量索引和数据版本控制。与 Pandas、DuckDB、Polars、Pyarrow 和 PyTorch 兼容,即将推出更多集成。 2026-05-05
36 podcastfy souzatharsis 6.3k 722 Python 84 An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAINotebookLM 播客功能的开源 Python 替代方案:使用 GenAI 将多模式内容转换为迷人的多语言音频对话 2026-05-04
37 jaaz 11cafe 6.2k 607 TypeScript 37 The world's first open-source multimodal creative assistant This is a substitute for Canva and Manus that prioritizes privacy and is usable locally.全球首款开源多模态创意助手 这是 Canva 和 Manus 的替代品,优先考虑隐私且可在本地使用。 2026-03-02
38 ai-notes swyxio 6.2k 552 HTML 3 notes for software engineers getting up to speed on new AI developments. Serves as datastore for https://latent.space writing, and product brainstorming, but has cleaned up canonical references under the /Resources folder.供软件工程师了解新的人工智能发展的笔记。用作 https://latent.space 写作和产品头脑风暴的数据存储,但已清理 /Resources 文件夹下的规范引用。 2026-02-16
39 VLM-R1 om-ai-lab 6.0k 377 Python 164 Solve Visual Understanding with Reinforced VLMs使用增强型 VLM 解决视觉理解问题 2026-03-12
40 Bagel ByteDance-Seed 5.9k 523 Python 140 Open-source unified multimodal model开源统一多式联运模型 2026-05-04
41 genkit genkit-ai 5.9k 727 TypeScript 673 Open-source framework for building AI-powered apps in JavaScript, Go, and Python, built and used in production by Google用于使用 JavaScript、Go 和 Python 构建人工智能驱动的应用程序的开源框架,由 Google 在生产中构建和使用 2026-05-06
42 pyspur PySpur-Dev 5.7k 425 TypeScript 29 A visual playground for agentic workflows: Iterate over your agents 10x faster代理工作流程的可视化游乐场:代理迭代速度提高 10 倍 2025-07-20
43 mmf facebookresearch 5.6k 945 Python 115 A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)Facebook AI Research (FAIR) 的视觉和语言多模态研究模块化框架 2026-04-07
44 UltraRAG OpenBMB 5.5k 413 Python 6 [GitHub Trending #2] A Low-Code MCP Framework for Building Complex and Innovative RAG Pipelines[GitHub 趋势 #2] 用于构建复杂且创新的 RAG 管道的低代码 MCP 框架 2026-05-05
45 neuraltalk karpathy 5.5k 1.3k Python 26 NeuralTalk is a Python+numpy project for learning Multimodal Recurrent Neural Networks that describe images with sentences.NeuralTalk 是一个 Python+numpy 项目,用于学习用句子描述图像的多模态循环神经网络。 2020-12-22
46 Daft Eventual-Inc 5.5k 461 Rust 256 High-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale适用于人工智能和多模式工作负载的高性能数据引擎。处理任何规模的图像、音频、视频和结构化数据 2026-05-05
47 DeepSeek-VL2 deepseek-ai 5.3k 1.8k Python 101 DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal UnderstandingDeepSeek-VL2:用于高级多模态理解的专家混合视觉语言模型 2025-02-26
48 xtuner InternLM 5.1k 419 Python 240 A Next-Generation Training Engine Built for Ultra-Large MoE Models专为超大型 MoE 模型打造的下一代训练引擎 2026-05-05
49 align-anything PKU-Alignment 4.7k 506 Python 29 Align Anything: Training All-modality Model with Feedback对齐一切:通过反馈训练全模态模型 2025-11-27
50 vllm-omni vllm-project 4.6k 876 Python 385 A framework for efficient model inference with omni-modality models全模态模型的高效模型推理框架 2026-05-06
51 tree-of-thoughts kyegomez 4.6k 375 Python 12 Plug in and Play Implementation of Tree of Thoughts: Deliberate Problem Solving with Large Language Models that Elevates Model Reasoning by atleast 70% 即插即用实现思想之树:使用大型语言模型深思熟虑地解决问题,将模型推理能力提升至少 70% 2025-07-29
52 Awesome-AIGC-Tutorials luban-agi 4.5k 301 N/A 5 Curated tutorials and resources for Large Language Models, AI Painting, and more. 针对大型语言模型、AI 绘画等的精选教程和资源。 2024-03-31
53 ultravox fixie-ai 4.4k 372 Python 53 A fast multimodal LLM for real-time voice用于实时语音的快速多模式法学硕士 2025-12-12
54 img2dataset rom1504 4.4k 375 Python 125 Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.轻松将大量图像 URL 转换为图像数据集。可以在一台机器上 20 小时内下载、调整大小和打包 100M 网址。 2025-10-19
55 VisualGLM-6B zai-org 4.2k 424 Python 269 Chinese and English multimodal conversational language model | 多模态中英双语对话语言模型 2024-08-23
56 Fengshenbang-LM IDEA-CCNL 4.1k 380 Python 104 Fengshenbang-LM(封神榜大模型)是IDEA研究院认知计算与自然语言研究中心主导的大模型开源体系,成为中文AIGC和认知智能的基础设施。 2024-08-13
57 lmms-eval EvolvingLMMs-Lab 4.1k 579 Python 26 One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks跨文本、图像、视频和音频任务的一站式多模态评估工具包 2026-04-29
58 open_flamingo mlfoundations 4.1k 319 Python 45 An open-source framework for training large multimodal models.用于训练大型多模式模型的开源框架。 2024-08-31
59 OmniGen2 VectorSpaceLab 4.1k 25 Jupyter Notebook 100 OmniGen2: Exploration to Advanced Multimodal Generation. https://arxiv.org/abs/2506.18871OmniGen2:对高级多模式生成的探索。 https://arxiv.org/abs/2506.18871 2026-03-20
60 mm-cot amazon-science 4.0k 331 Python 44 Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)《语言模型中的多模态思维链推理》正式实现(敬请期待,更多内容将会更新) 2024-06-12
61 OmniRoute diegosouzapw 4.0k 644 TypeScript 40 Never stop coding. Free AI gateway: one endpoint, 160+ providers, RTK+Caveman stacked compression up to ~95% eligible context savings, smart auto-fallback, MCP/A2A, multimodal APIs, Desktop/PWA.永远不要停止编码。免费 AI 网关:一个端点、160 多个提供商、RTK+Caveman 堆叠压缩高达约 95% 的符合条件的上下文节省、智能自动回退、MCP/A2A、多模式 API、桌面/PWA。 2026-05-05
62 Qwen2.5-Omni QwenLM 4.0k 323 Jupyter Notebook 213 Qwen2.5-Omni is an end-to-end multimodal model by Qwen team at Alibaba Cloud, capable of understanding text, audio, vision, video, and performing real-time speech generation.Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know. 2025-06-12
63 mmpretrain open-mmlab 3.8k 1.1k Python 202 OpenMMLab Pre-training Toolbox and BenchmarkOpenMMLab 预训练工具箱和基准测试 2024-11-01
64 discoart jina-ai 3.8k 243 Python 25 🪩 Create Disco Diffusion artworks in one line🪩 用一行创建 Disco Diffusion 艺术品 2023-05-16
65 VILA NVlabs 3.8k 319 Python 67 VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.VILA 是一系列最先进的视觉语言模型 (VLM),适用于跨边缘、数据中心和云的各种多模式 AI 任务。 2026-03-12
66 NExT-GPT NExT-GPT 3.6k 361 Python 81 Code and models for ICML 2024 paper, NExT-GPT: Any-to-Any Multimodal Large Language ModelError 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know. 2025-05-13
67 Awesome-LLM-Reasoning atfortes 3.6k 205 N/A 5 From Chain-of-Thought prompting to OpenAI o1 and DeepSeek-R1 🍓Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know. 2026-04-20
68 morphik-core morphik-org 3.6k 299 Python 13 The most accurate document search and store for building AI apps用于构建人工智能应用程序的最准确的文档搜索和存储 2026-04-02
69 mini-omni gpt-omni 3.5k 310 Python 36 open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities. Error 500 (Server Error)!!1500.That’s an error.There was an error. Please try again later.That’s all we know. 2024-11-05
70 mteb embeddings-benchmark 3.2k 612 Python 273 MTEB: Massive Text Embedding BenchmarkMTEB:海量文本嵌入基准 2026-05-05
71 SimpleMem aiming-lab 3.2k 334 Python 9 SimpleMem: Efficient Lifelong Memory for LLM Agents — Text & MultimodalSimpleMem:LLM 代理的高效终身记忆 - 文本和多模式 2026-04-04
72 InternGPT OpenGVLab 3.2k 235 Python 19 InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models.现在它支持DragGAN、ChatGPT、ImageBind、多模式聊天(如GPT-4、SAM)、交互式图像编辑等。请在igpt.opengvlab.com上尝试(支持DragGAN、ChatGPT、ImageBind、SAM的在线演示系统) 2024-08-20
73 Skywork-R1V SkyworkAI 3.2k 280 Python 28 Skywork-R1V is an advanced multimodal AI model series developed by Skywork AI, specializing in vision-language reasoning.Skywork-R1V是Skywork AI开发的先进多模态AI模型系列,专注于视觉语言推理。 2025-12-15
74 torchscale microsoft 3.1k 225 Python 29 Foundation Architecture for (M)LLMs(M)LLM 的基础架构 2024-04-11
75 docarray docarray 3.1k 241 Python 68 Represent, send, store and search multimodal data表示、发送、存储和搜索多模式数据 2026-03-27
76 HunyuanImage-3.0 Tencent-Hunyuan 3.0k 162 Python 40 HunyuanImage-3.0: A Powerful Native Multimodal Model for Image GenerationHunyuanImage-3.0:强大的原生图像生成多模态模型 2026-02-03
77 awesome-embodied-vla-va-vln jonyzhang2023 3.0k 137 N/A 0 A curated list of state-of-the-art research in embodied AI, focusing on vision-language-action (VLA) models, vision-language navigation (VLN), and related multimodal learning approaches. 嵌入式人工智能最先进研究的精选列表,重点关注视觉-语言-动作 (VLA) 模型、视觉-语言导航 (VLN) 和相关的多模态学习方法。 2026-04-15
78 InternLM-XComposer InternLM 2.9k 176 Python 139 InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio InteractionsInternLM-XComposer2.5-OmniLive:用于长期流媒体视频和音频交互的综合多模式系统 2025-05-26
79 vortex vortex-data 2.9k 150 Rust 189 An extensible, state-of-the-art framework for columnar compression, and the fastest FOSS columnar file format. Formerly at @spiraldb, now an Incubation Stage project at LFAI&Data, part of the Linux Foundation.可扩展、最先进的列式压缩框架,以及最快的 FOSS 列式文件格式。以前在 @spiraldb,现在是 LFAI&Data(Linux 基金会的一部分)的孵化阶段项目。 2026-05-05
80 OSWorld xlang-ai 2.8k 447 Python 147 [NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments[NeurIPS 2024] OSWorld:真实计算机环境中开放式任务的多模式代理基准测试 2026-05-01
81 helm stanford-crfm 2.8k 382 Python 48 Holistic Evaluation of Language Models (HELM) is an open source Python framework created by the Center for Research on Foundation Models (CRFM) at Stanford for holistic, reproducible and transparent evaluation of foundation models, including large language models (LLMs) and multimodal models.语言模型的整体评估 (HELM) 是由斯坦福大学基础模型研究中心 (CRFM) 创建的开源 Python 框架,用于对基础模型进行整体、可重复和透明的评估,包括大语言模型 (LLM) 和多模态模型。 2026-05-05
82 Awesome-AI4Med FreedomIntelligence 2.8k 474 N/A 0 A curated list of medical LLMs, multimodal systems, datasets, benchmarks, and more. 🏥医学法学硕士、多模式系统、数据集、基准等的精选列表。 🏥 2026-04-27
83 clip-retrieval rom1504 2.8k 239 Jupyter Notebook 80 Easily compute clip embeddings and build a clip retrieval system with them轻松计算剪辑嵌入并用它们构建剪辑检索系统 2026-03-28
84 datachain datachain-ai 2.7k 140 Python 63 Data Memory: the operational data context layer for AI agents - typed, versioned datasets over images, video, docs and tables数据内存:人工智能代理的操作数据上下文层 - 图像、视频、文档和表格上的类型化、版本化数据集 2026-05-05
85 MUNIT NVlabs 2.7k 485 Python 63 Multimodal Unsupervised Image-to-Image Translation多模态无监督图像到图像翻译 2022-09-20
86 autodistill autodistill 2.7k 213 Python 39 Images to inference with no labeling (use foundation models to train supervised models).无需标记即可进行推理的图像(使用基础模型来训练监督模型)。 2025-05-14
87 maestro roboflow 2.7k 221 Python 17 streamline the fine-tuning process for multimodal models: PaliGemma 2, Florence-2, and Qwen2.5-VL简化多模式模型的微调过程:PaliGemma 2、Florence-2 和 Qwen2.5-VL 2026-05-01
88 OmAgent om-ai-lab 2.6k 288 Python 7 [EMNLP-2024] Build multimodal language agents for fast prototype and production[EMNLP-2024] 构建多模式语言代理以实现快速原型和生产 2025-03-19
89 OFA OFA-Sys 2.6k 249 Python 109 Official repository of OFA (ICML 2022). Paper: OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning FrameworkOFA 的官方存储库 (ICML 2022)。论文:OFA:通过简单的序列到序列学习框架统一架构、任务和模式 2024-04-24
90 mPLUG-Owl X-PLUG 2.5k 190 Python 99 mPLUG-Owl: The Powerful Multi-modal Large Language Model FamilymPLUG-Owl:强大的多模态大语言模型系列 2025-04-02
91 OCRFlux chatdoc-com 2.5k 151 Python 69 OCRFlux is a lightweight yet powerful multimodal toolkit that significantly advances PDF-to-Markdown conversion, excelling in complex layout handling, complicated table parsing and cross-page content merging.OCRFlux 是一个轻量级但功能强大的多模式工具包,可显着推进 PDF 到 Markdown 的转换,在复杂的布局处理、复杂的表格解析和跨页面内容合并方面表现出色。 2026-04-14
92 HuixiangDou InternLM 2.5k 181 Python 32 HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical AssistanceHuiyangDou:利用基于法学硕士的技术援助克服群聊场景 2025-11-24
93 OmniSVG OmniSVG 2.5k 94 Python 36 [NeurIPS 2025] OmniSVG is the first family of end-to-end multimodal SVG generators that leverage pre-trained Vision-Language Models (VLMs), capable of generating complex and detailed SVGs, from simple icons to intricate anime characters.[NeurIPS 2025] OmniSVG 是第一个端到端多模式 SVG 生成器系列,它利用预先训练的视觉语言模型 (VLM),能够生成复杂而详细的 SVG,从简单的图标到复杂的动漫角色。 2026-03-01
94 stability-sdk Stability-AI 2.4k 344 Jupyter Notebook 37 SDK for interacting with stability.ai APIs (e.g. stable diffusion inference)用于与 stable.ai API 交互的 SDK(例如稳定扩散推理) 2025-08-05
95 Awesome-Text-to-Image Yutong-Zhou-cv 2.4k 205 N/A 0 (ෆ`꒳´ෆ) A Survey on Text-to-Image Generation/Synthesis.(ෆ`꒳´ෆ) 文本到图像生成/合成的调查。 2026-02-07
96 tribev2 facebookresearch 2.4k 544 Jupyter Notebook 15 This repository contains the code to train and evaluate TRIBE v2, a multimodal model for brain response prediction该存储库包含训练和评估 TRIBE v2 的代码,TRIBE v2 是一种用于大脑反应预测的多模式模型 2026-03-30
97 mPLUG-DocOwl X-PLUG 2.4k 148 Python 70 mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document UnderstandingmPLUG-DocOwl:用于文档理解的模块化多模态大语言模型 2025-05-30
98 GLM-V zai-org 2.3k 167 Python 11 GLM-4.6V/4.5V/4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement LearningGLM-4.6V/4.5V/4.1V-Thinking:通过可扩展的强化学习实现多功能多模态推理 2026-04-06
99 hcaptcha-challenger QIN2DIM 2.3k 422 Python 41 🥂 Gracefully face hCaptcha challenge with multimodal large language model.🥂 利用多模态大语言模型优雅地面对 hCaptcha 挑战。 2026-01-28
100 perception_models facebookresearch 2.3k 157 Jupyter Notebook 44 State-of-the-art Image & Video CLIP, Multimodal Large Language Models, and More!最先进的图像和视频 CLIP、多模态大型语言模型等等! 2026-04-13
No repositories match your search 没有匹配的仓库