当前位置: 首页 > news >正文

可定制多目标视频生成;LLM驱动的文生图;控制视频生成中运动目标轨迹;扩散模型做全景分割;实时多功能SAM;各种分割任务统一模型

本文首发于公众号:机器感知

可定制多目标视频生成;LLM驱动的文生图;控制视频生成中运动目标轨迹;扩散模型做全景分割;实时多功能SAM;各种分割任务统一模型

LoMA: Lossless Compressed Memory Attention

The ability to handle long texts is one of the most important capabilities of Large Language Models (LLMs), but as the text length increases, the consumption of resources also increases dramatically. At present, reducing resource consumption by compressing the KV cache is a common approach. Although there are many existing compression methods, they share a common drawback: the compression is not lossless. We propose a new method, Lossless Compressed Memory Attention (LoMA), which allows for lossless compression of information into special memory token KV pairs according to a set compression ratio.

Image Translation as Diffusion Visual Programmers

图片

We introduce the novel Diffusion Visual Programmer (DVP), a neuro-symbolic image translation framework. Our proposed DVP seamlessly embeds a condition-flexible diffusion model within the GPT architecture, orchestrating a coherent sequence of visual programs (i.e., computer vision models) for various pro-symbolic steps, which span RoI identification, style transfer, and position manipulation, facilitating transparent and controllable image translation processes. 

CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches designed for single subjects suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, we aim to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 69 individual subjects and 57 meaningful pairs. 

A-KIT: Adaptive Kalman-Informed Transformer

图片

The extended Kalman filter (EKF) is a widely adopted method for sensor fusion in navigation applications. While common EKF implementation assumes a constant process noise, in real-world scenarios, the process noise varies, leading to inaccuracies in the estimated state and potentially causing the filter to diverge. To cope with such situations, we derive and introduce A-KIT, an adaptive Kalman-informed transformer to learn the varying process noise covariance online. The A-KIT outperforms the conventional EKF by more than 49.5% and model-based adaptive EKF by an average of 35.4% in terms of position accuracy.

DiffusionGPT: LLM-Driven Text-to-Image Generation System

图片

A major challenge persists in current text-to-image systems are often unable to handle diverse inputs, or are limited to single model results. Current unified attempts often fall into two orthogonal aspects: i) parse Diverse Prompts in input stage; ii) activate expert model to output. To combine the best of both worlds, we propose DiffusionGPT, which leverages Large Language Models (LLM) to offer a unified generation system capable of seamlessly accommodating various types of prompts and integrating domain-expert models. Moreover, we introduce Advantage Databases, where the Tree-of-Thought is enriched with human feedback, aligning the model selection process with human preferences.

Motion-Zero: Zero-Shot Moving Object Control Framework for  Diffusion-Based Video Generation

图片

In this paper, we propose a novel zero-shot moving object trajectory control framework, Motion-Zero, to enable a bounding-box-trajectories-controlled text-to-video diffusion model.To this end, an initial noise prior module is designed to provide a position-based prior to improve the stability of the appearance of the moving object and the accuracy of position. In addition, based on the attention map of the U-net, spatial constraints are directly applied to the denoising process of diffusion models, which further ensures the positional and spatial consistency of moving objects during the inference. Furthermore, temporal consistency is guaranteed with a proposed shift temporal attention mechanism.

A Simple Latent Diffusion Approach for Panoptic Segmentation and Mask  Inpainting

图片

This work builds upon Stable Diffusion and proposes a latent diffusion approach for panoptic segmentation. The use of a generative model unlocks the exploration of mask completion or inpainting, which has applications in interactive segmentation. The experimental validation yields promising results for both panoptic segmentation and mask inpainting.

RAP-SAM: Towards Real-Time All-Purpose Segment Anything

图片

This work explores a new real-time segmentation setting, named all-purpose segmentation in real-time, to transfer VFMs in real-time deployment. It contains three different tasks, including interactive segmentation, panoptic segmentation, and video segmentation. We aim to use one model to achieve the above tasks in real-time. Then, we present Real-Time All Purpose SAM (RAP-SAM). It contains an efficient encoder and an efficient decoupled decoder to perform prompt-driven decoding.

OMG-Seg: Is One Model Good Enough For All Segmentation?

图片

In this work, we address various segmentation tasks, each traditionally tackled by distinct or partially unified models. We propose OMG-Seg, One Model that is Good enough to efficiently and effectively handle all the segmentation tasks, including image semantic, instance, and panoptic segmentation, as well as their video counterparts, open vocabulary settings, prompt-driven, interactive segmentation like SAM, and video object segmentation. To our knowledge, this is the first model to handle all these tasks in one model and achieve satisfactory performance.

FreGrad: Lightweight and Fast Frequency-aware Diffusion Vocoder

图片

The goal of this paper is to generate realistic audio with a lightweight and fast diffusion-based vocoder, named FreGrad. In our experiments, FreGrad achieves 3.7 times faster training time and 2.2 times faster inference speed compared to our baseline while reducing the model size by 0.6 times (only 1.78M parameters) without sacrificing the output quality.

相关文章:

  • Kotlin协程的JVM实现源码分析(上)
  • 抖动与相噪
  • 【面试】测试/测开(ING3)
  • UI开发布局-HarmonyOS应用UI开发布局
  • 【Python 千题 —— 基础篇】参加聚会
  • 软件测试阶段简介_单元测试、集成测试、配置项测试、系统测试
  • 表的增删改查 进阶(二)
  • MySQL(四)——约束
  • Python GUI 新手入门教程:轻松构建图形用户界面
  • [足式机器人]Part2 Dr. CAN学习笔记- Kalman Filter卡尔曼滤波器Ch05
  • 我用 ChatGPT 做了一次探索性数据分析,真的太太太实用了!
  • 【算法与数据结构】Java实现查找与排序
  • TPU编程竞赛系列|第八届集创赛“算能杯“报名开启!
  • 阿里云服务器配置选择之线下IDC直接映射
  • 【备战蓝桥杯】吃奶酪问题 / 超硬核,文附template拓展知识!
  • [nginx文档翻译系列] 控制nginx
  • 【每日笔记】【Go学习笔记】2019-01-10 codis proxy处理流程
  • E-HPC支持多队列管理和自动伸缩
  • hadoop入门学习教程--DKHadoop完整安装步骤
  • HTTP那些事
  • Java的Interrupt与线程中断
  • LeetCode541. Reverse String II -- 按步长反转字符串
  • overflow: hidden IE7无效
  • RxJS 实现摩斯密码(Morse) 【内附脑图】
  • RxJS: 简单入门
  • Spark RDD学习: aggregate函数
  • vue学习系列(二)vue-cli
  • 成为一名优秀的Developer的书单
  • 想使用 MongoDB ,你应该了解这8个方面!
  • 从如何停掉 Promise 链说起
  • ​创新驱动,边缘计算领袖:亚马逊云科技海外服务器服务再进化
  • ​一、什么是射频识别?二、射频识别系统组成及工作原理三、射频识别系统分类四、RFID与物联网​
  • #NOIP 2014#day.2 T1 无限网络发射器选址
  • ( )的作用是将计算机中的信息传送给用户,计算机应用基础 吉大15春学期《计算机应用基础》在线作业二及答案...
  • (2021|NIPS,扩散,无条件分数估计,条件分数估计)无分类器引导扩散
  • (3)(3.5) 遥测无线电区域条例
  • (离散数学)逻辑连接词
  • (力扣记录)235. 二叉搜索树的最近公共祖先
  • (论文阅读笔记)Network planning with deep reinforcement learning
  • (转)PlayerPrefs在Windows下存到哪里去了?
  • (转载)从 Java 代码到 Java 堆
  • ./include/caffe/util/cudnn.hpp: In function ‘const char* cudnnGetErrorString(cudnnStatus_t)’: ./incl
  • .describe() python_Python-Win32com-Excel
  • .net专家(高海东的专栏)
  • .考试倒计时43天!来提分啦!
  • @staticmethod和@classmethod的作用与区别
  • [20160807][系统设计的三次迭代]
  • [AHOI2009]中国象棋 DP,递推,组合数
  • [C#]C# winform部署yolov8目标检测的openvino模型
  • [C++参考]拷贝构造函数的参数必须是引用类型
  • [DM复习]关联规则挖掘(下)
  • [GN] 后端接口已经写好 初次布局前端需要的操作(例)
  • [HackMyVM]靶场Boxing
  • [HDU 3555] Bomb [数位DP]
  • [Python] scikit-learn之mean_squared_error函数(Mean Squared Error(MSE))介绍和使用案例