Tetromino Pixel

"Tetromino Pixel," a term encompassing various research directions, broadly focuses on leveraging pixel-level information from images and videos to achieve higher-level tasks. Current research emphasizes using deep learning models, including transformers, U-Nets, and diffusion models, to process visual data and integrate it with other modalities like text and 3D point clouds for applications such as image captioning, object detection, 3D reconstruction, and robotic control. This work is significant for advancing multimodal AI, improving the efficiency and interpretability of computer vision systems, and enabling new capabilities in areas like autonomous navigation and medical image analysis.

Papers

December 15, 2023

Osprey: Pixel Understanding with Visual Instruction Tuning
Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, Jianke Zhu
Fine Grained Instruction Tuning Tetromino Pixel Visual Instruction Tuning Instruction Data

December 4, 2023

PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, Xiaojie Jin
Large Multimodal Model Pixel Level Tetromino Pixel Target Mask Reasoning Segmentation

December 2, 2023

Learning county from pixels: Corn yield prediction with attention-weighted multiple instance learning
Xiaoyu Wang, Yuchi Ma, Qunying Huang, Zhengwei Yang, Zhou Zhang
Multiple Instance Learning Satellite Imagery Tetromino Pixel Yield Prediction Local Learning

November 29, 2023

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?
Xiujun Li, Yujie Lu, Zhe Gan, Jianfeng Gao, William Yang Wang, Yejin Choi
Text Modality Multimodal Large Language Model Multimodal Model Human Instruction Tetromino Pixel Multimodal Instruction Interactive Instruction

November 27, 2023

November 22, 2023

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models
Shehan Munasinghe, Rusiru Thushara, Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, Mubarak Shah, Fahad Khan
Large Multimodal Model Video Understanding Tetromino Pixel Large Video Language Model Video Dialog

November 6, 2023

GLaMM: Pixel Grounding Large Multimodal Model
Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan
Large Multimodal Model Tetromino Pixel Visual Domain Text Grounding

November 2, 2023

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding
Tianrui Hui, Zihan Ding, Junshi Huang, Xiaoming Wei, Xiaolin Wei, Jiao Dai, Jizhong Han, Si Liu
Contrastive Loss Tetromino Pixel Panoptic Narrative Grounding Object Context

October 16, 2023

Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels
Thomas Jiralerspong, Flemming Kondrup, Doina Precup, Khimya Khetarpal
Tetromino Pixel Forecast Utterance Dimensional State Space Single Task Learning REinforcement Learning Tree Search Planning

September 27, 2023

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou
Latent Diffusion Model Video Diffusion Model Tetromino Pixel Text to Video Generation Text to Video Diffusion Model Video Text Alignment Video Generation Benchmark

September 18, 2023

Scribble-based 3D Multiple Abdominal Organ Segmentation via Triple-branch Multi-dilated Network with Pixel- and Class-wise Consistency
Meng Han, Xiangde Luo, Wenjun Liao, Shichuan Zhang, Shaoting Zhang, Guotai Wang
Tetromino Pixel Multi Organ Segmentation Abdominal Organ Segmentation Supervised Segmentation

August 30, 2023

From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications
Shreyank N Gowda, Dheeraj Pandey, Shashank Narayana Gowda
Deep Learning Computer Vision Neural Radiance Field Financial Application Comprehensive Survey Tetromino Pixel Head Generation Talking Head Human Portrait

August 1, 2023

Pixel to policy: DQN Encoders for within & cross-game reinforcement learning
Ashrya Agrawal, Priyanshi Shah, Sourabh Prakash
Reinforcement Learning Transfer Learning Tetromino Pixel DQN Agent Cross Domain Reinforcement Learning

June 24, 2023

Learning from Pixels with Expert Observations
Minh-Huy Hoang, Long Dinh, Hai Nguyen
Reinforcement Learning LeArning Abstract Sparse Reward Tetromino Pixel Goal Conditioned Reinforcement Learning Expert Level Performance Visual Goal Expert Observation

June 15, 2023

Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers
Dominick Reilly, Aman Chadha, Srijan Das
Vision Transformer Human Pose Tetromino Pixel Pose Prediction Pose Representation Pose Attention

June 1, 2023

Differential Diffusion: Giving Each Pixel Its Strength
Eran Levin, Ohad Fried
Diffusion Model Image Generation Image Synthesis Image Editing Tetromino Pixel Estimated Team Strength

May 31, 2023

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, Kristina Toutanova
Action Space Human Instruction Tetromino Pixel Graphical User Interface

May 29, 2023

RLAD: Reinforcement Learning from Pixels for Autonomous Driving in Urban Environments
Daniel Coelho, Miguel Oliveira, Vitor Santos
Reinforcement Learning Autonomous Driving Urban Environment Tetromino Pixel Convolutional Encoder Urban Autonomous Driving

May 22, 2023

MFT: Long-Term Tracking of Every Pixel
Michal Neoral, Jonáš Šerých, Jiří Matas
Optical Flow Web Tracking Tetromino Pixel Point Tracking Long Term Tracking Dense Tracking Optimal Flow

Tetromino Pixel

Papers

Osprey: Pixel Understanding with Visual Instruction Tuning

PixelLM: Pixel Reasoning with Large Multimodal Model

Learning county from pixels: Corn yield prediction with attention-weighted multiple instance learning

Text as Images: Can Multimodal Large Language Models Follow Printed Instructions in Pixels?

From Pixels to Titles: Video Game Identification by Screenshots using Convolutional Neural Networks

Beyond Pixels: Exploring Human-Readable SVG Generation for Simple Images with Vision Language Models

PG-Video-LLaVA: Pixel Grounding Large Video-Language Models

GLaMM: Pixel Grounding Large Multimodal Model

Enriching Phrases with Coupled Pixel and Object Contexts for Panoptic Narrative Grounding

Forecaster: Towards Temporally Abstract Tree-Search Planning from Pixels

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation

Scribble-based 3D Multiple Abdominal Organ Segmentation via Triple-branch Multi-dilated Network with Pixel- and Class-wise Consistency

From Pixels to Portraits: A Comprehensive Survey of Talking Head Generation Techniques and Applications

Pixel to policy: DQN Encoders for within & cross-game reinforcement learning

Learning from Pixels with Expert Observations

Seeing the Pose in the Pixels: Learning Pose-Aware Representations in Vision Transformers

Differential Diffusion: Giving Each Pixel Its Strength

From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces

RLAD: Reinforcement Learning from Pixels for Autonomous Driving in Urban Environments

MFT: Long-Term Tracking of Every Pixel