Pix2seq

Pix2Seq codebase: multi-tasks with generative modeling (autoregressive and diffusion)

Generate Convert Improve

Install / Use

/learn @google-research/Pix2seq

About this skill

Quality Score

0/100

README

Pix2Seq codebase: multi-tasks with generative modeling

This is the official implementation of Pix2Seq in Tensorflow 2 with efficient TPUs/GPUs support. The original Pix2Seq code aims to be a general framework that turns RGB pixels into semantically meaningful sequences. We now extend it to be a generic codebase, with task-centric organization that supports different tasks as well as their combination, using generative modeling (both autoregressive and diffusion models, see below).

<div align="center"> <img width="95%" alt="Pix2Seq Illustration" src="pix2seq.gif"> </div> <div align="center"> An illustration of Pix2Seq for object detection (from <a href="https://ai.googleblog.com/2022/04/pix2seq-new-language-interface-for.html">our Google AI blog post</a>). </div>

(<span style="color:red">NEW!</span>) FitTransformer (FIT)

We added (official) implementations of FitTransformer (FIT) (as an encoder, a diffusion decoder, or an autoregressive decoder) see architectures/transformers.py.

(<span style="color:red">NEW!</span>) Diffusion models

We added (official) implementations of diffusion models (such as Bit Diffusion, RIN, see references below) built on top of the original Pix2Seq codebase and they can be found in tasks/, models/, and architectures/.

Please note that we have not yet added proper documentations on training these models.

Models

Objects365 object detection pretrained checkpoints

Backbone | Total params (M) | Image size | Google cloud storage location -------------: | ---------------: | ---------: | -----------: ResNet-50 | 36.6 | 640x640 | gs://pix2seq/obj365_pretrain/resnet_640x640_b256_s400k ResNet-50 (C4) | 84.7 | 640x640 | gs://pix2seq/obj365_pretrain/resnetc_640x640_b256_s400k ViT-B | 115.2 | 640x640 | gs://pix2seq/obj365_pretrain/vit_b_640x640_b256_s400k ViT-L | 341.2 | 640x640 | gs://pix2seq/obj365_pretrain/vit_l_640x640_b256_s400k

COCO object detection fine-tuned checkpoints

Backbone | Total params (M) | Image size | COCO AP | Google cloud storage location -------------: | ---------------: | ---------: | --------: | -----------: ResNet-50 | 36.6 | 640x640 | 39.1 | gs://pix2seq/coco_det_finetune/resnet_640x640 ResNet-50 | 36.6 | 1024x1024 | 41.7 | gs://pix2seq/coco_det_finetune/resnet_1024x1024 ResNet-50 | 36.6 | 1333x1333 | 42.6 | gs://pix2seq/coco_det_finetune/resnet_1333x1333 ResNet-50 (C4) | 84.7 | 640x640 | 44.7 | gs://pix2seq/coco_det_finetune/resnetc_640x640 ResNet-50 (C4) | 84.7 | 1024x1024 | 46.9 | gs://pix2seq/coco_det_finetune/resnetc_1024x1024 ResNet-50 (C4) | 84.7 | 1333x1333 | 47.3 | gs://pix2seq/coco_det_finetune/resnetc_1333x1333 ViT-B | 115.2 | 640x640 | 44.2 | gs://pix2seq/coco_det_finetune/vit_b_640x640 ViT-B | 115.2 | 1024x1024 | 46.5 | gs://pix2seq/coco_det_finetune/vit_b_1024x1024 ViT-B | 115.2 | 1333x1333 | 47.1 | gs://pix2seq/coco_det_finetune/vit_b_1333x1333 ViT-L | 341.2 | 640x640 | 47.6 | gs://pix2seq/coco_det_finetune/vit_l_640x640 ViT-L | 341.2 | 1024x1024 | 49.2 | gs://pix2seq/coco_det_finetune/vit_l_1024x1024 ViT-L | 341.2 | 1333x1333 | 50.0 | gs://pix2seq/coco_det_finetune/vit_l_1333x1333

Multitask checkpoints

Jointly fine-tuned on coco object detection, instance segmentation, captioning and keypoint detection.

Backbone | Total params (M) | Image size | COCO AP | Google cloud storage location -------------: | ---------------: | ---------: | --------: | -----------: ViT-B | 115.2 | 640x640 | 44.2 | gs://pix2seq/multi_task/ckpt/vit_b_640x640 ViT-B | 115.2 | 1024x1024 | 46.5 | gs://pix2seq/multi_task/ckpt/vit_b_1024x1024

Usage

Colabs

See colabs for inference and fine-tuning demos. Give it a try!

Basic setup before running the code

The following setup is required before running the code.

git clone https://github.com/google-research/pix2seq.git
pip install -r requirements.txt

Download COCO annotations from gs://pix2seq/multi_task/data/coco/json to /tmp/coco_annotations (dir can be updated in the configs).

annotations_dir=/tmp/coco_annotations
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/captions_train2017_eval_compatible.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/captions_val2017_eval_compatible.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/instances_train2017.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/instances_val2017.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/person_keypoints_train2017.json $annotations_dir
wget https://storage.googleapis.com/pix2seq/multi_task/data/coco/json/person_keypoints_val2017.json $annotations_dir

(Optional) If accessing the pretrained checkpoints in Cloud is slowing down or blocking the start of training/eval, you can download them manually with following command gsutil cp -r gs://cloud_folder local_folder, and update pretrained_ckpt in the config file accordingly.

(Optional) If training fails at the start (due to NcclAllReduce error), try a different cross_device_ops for tf.distribute.MirroredStrategy in utils.py:build_strategy function.

Instructions for training (fine-tuning) of object detection models.

Below is the instruction for starting a training job, where we've set up a configuration mainly for fine-tuning the objects365 pretrained models.

Step 1: check config_det_finetune.py and update if necessary, such as encoder_variant, image_size.

Step 2: run python3 run.py --mode=train --model_dir=/tmp/model_dir --config=configs/config_det_finetune.py --config.train.batch_size=32 --config.train.epochs=20 --config.optimization.learning_rate=3e-5.

(Optional) Setup tensorboard for training curves with tensorboard --logdir=/tmp/model_dir. Note: eval on this drill fine-tuning run (with vit-b 640x640 and 20 epochs) should give ~43.5 AP. Exact configurations used to reproduce the COCO fine-tuning results can be found in gs://pix2seq/coco_det_finetune/...

(Optional) Set --run_eagerly=True for interactive debugging (which will be slower).

Instructions for evaluation of object detection models.

Below is the instruction for starting an evaluation job, which monitors the specified directory and perform (continuous) evaluation of the latest and un-evaluated checkpoints. It can be started in parallel to or after the training.

Step 1: check config_det_finetune.py and update if necessary, such as encoder_variant, image_size. Set checkpoint_dir if the checkpoints to evaluate are not in model_dir (e.g., for evaluating our provided fine-tuning checkpoints).

Step 2: run python3 run.py --mode=eval --model_dir=/tmp/model_dir --config=configs/config_det_finetune.py --config.dataset.coco_annotations_dir=/path/to/annotations --config.eval.batch_size=40.

(Optional) Setup tensorboard for eval curves and detection visualizations with tensorboard --logdir=/tmp/model_dir.

Instructions for evaluation of multi-task models.

In configs/config_multi_task.py uncomment the line with checkpoint_dir=get_multi_task_checkpoint_dir(...). To evaluate for image size 1024x1024 update image_size in the config.

Object detection

con

Related Skills

YC-Killer

2.7k

A library of enterprise-grade AI agents designed to democratize artificial intelligence and provide free, open-source alternatives to overvalued Y Combinator startups. If you are excited about democratizing AI access & AI agents, please star ⭐️ this repository and use the link in the readme to join our open source AI research team.

flutter-tutor

Flutter Learning Tutor Guide You are a friendly computer science tutor specializing in Flutter development. Your role is to guide the student through learning Flutter step by step, not to provide d

groundhog

398

Groundhog's primary purpose is to teach people how Cursor and all these other coding agents work under the hood. If you understand how these coding assistants work from first principles, then you can drive these tools harder (or perhaps make your own!).

last30days-skill

16.9k

AI agent skill that researches any topic across Reddit, X, YouTube, HN, Polymarket, and the web - then synthesizes a grounded summary