CompVis
diff --git a/‎README.md
+148 b/‎README.md
+148
diff --git a/‎assets/foxchain.png
848 KB b/‎assets/foxchain.png
848 KB
diff --git a/‎assets/image_editing.png
1.43 MB b/‎assets/image_editing.png
1.43 MB
diff --git a/‎assets/imagebart_poster.pdf
1.76 MB b/‎assets/imagebart_poster.pdf
1.76 MB
diff --git a/‎assets/modelfigure.png
672 KB b/‎assets/modelfigure.png
672 KB
diff --git a/‎assets/sample-001.jpg
17.6 KB b/‎assets/sample-001.jpg
17.6 KB
diff --git a/‎configs/ffhq/2_scales/ffhq-custom-scale1.yaml
+78 b/‎configs/ffhq/2_scales/ffhq-custom-scale1.yaml
+78
diff --git a/‎configs/ffhq/2_scales/ffhq-custom-scale2.yaml
+70 b/‎configs/ffhq/2_scales/ffhq-custom-scale2.yaml
+70
diff --git a/‎configs/ffhq/4_scales/ffhq-geometric-scale1.yaml
+89 b/‎configs/ffhq/4_scales/ffhq-geometric-scale1.yaml
+89
@@ -0,0 +1,148 @@
+# ImageBART
+#### [NeurIPS 2021](https://nips.cc/)
+
+![teaser](assets/modelfigure.png)
+<br/>
+[Patrick Esser](https://github.com/pesser)\*,
+[Robin Rombach](https://github.com/rromb)\*,
+[Andreas Blattmann](https://github.com/ablattmann)\*,
+[Björn Ommer](https://ommer-lab.com/)<br/>
+\* equal contribution
+
+[arXiv](https://arxiv.org/abs/2108.08827) | [BibTeX](#bibtex) | [Poster](assets/imagebart_poster.pdf) 
+
+## Requirements
+A suitable [conda](https://conda.io/) environment named `imagebart` can be created
+and activated with:
+
+```
+conda env create -f environment.yaml
+conda activate imagebart
+```
+
+## Get the Models
+
+We provide pretrained weights and hyperparameters for models trained on the following datasets:
+
+* FFHQ: 
+    * [4 scales, geometric noise schedule](https://ommer-lab.com/files/ffhq_4_scales_geometric.zip): `wget -c https://ommer-lab.com/files/ffhq_4_scales_geometric.zip`
+    * [2 scales, custom noise schedule](https://ommer-lab.com/files/ffhq_2_scales_custom.zip): `wget -c https://ommer-lab.com/files/ffhq_2_scales_custom.zip`
+* LSUN, 3 scales, custom noise schedules:
+    * [Churches](https://ommer-lab.com/files/churches_3_scales.zip): `wget -c https://ommer-lab.com/files/churches_3_scales.zip`
+    * [Bedrooms](https://ommer-lab.com/files/bedrooms_3_scales.zip): `wget -c https://ommer-lab.com/files/bedrooms_3_scales.zip`
+    * [Cats](https://ommer-lab.com/files/cats_3_scales.zip): `wget -c https://ommer-lab.com/files/cats_3_scales.zip`
+* Class-conditional ImageNet:
+    * [5 scales, custom noise schedule](https://ommer-lab.com/files/cin_5_scales_custom.zip): `wget -c https://ommer-lab.com/files/cin_5_scales_custom.zip`
+    * [4 scales, geometric noise schedule](https://ommer-lab.com/files/cin_4_scales_geometric.zip): `wget -c https://ommer-lab.com/files/cin_4_scales_geometric.zip`
+
+Download the respective files and extract their contents to a directory `./models/`.
+
+Moreover, we provide all the required VQGANs as a .zip at [https://ommer-lab.com/files/vqgan.zip](https://ommer-lab.com/files/vqgan.zip), 
+which contents have to be extracted to `./vqgan/`.    
+
+## Get the Data
+Running the training configs or the [inpainting script](scripts/inpaint_imagebart.py) requires 
+a dataset available locally. For ImageNet and FFHQ, see this repo's parent directory [taming-transformers](https://github.com/CompVis/taming-transformers).
+The LSUN datasets can be conveniently downloaded via the script available [here](https://github.com/fyu/lsun).
+We performed a custom split into training and validation images, and provide the corresponding filenames
+at [https://ommer-lab.com/files/lsun.zip](https://ommer-lab.com/files/lsun.zip). 
+After downloading, extract them to `./data/lsun`. The beds/cats/churches subsets should
+also be placed/symlinked at `./data/lsun/bedrooms`/`./data/lsun/cats`/`./data/lsun/churches`, respectively.
+
+## Inference
+
+### Unconditional Sampling
+We provide a script for sampling from unconditional models trained on the LSUN-{bedrooms,bedrooms,cats}- and FFHQ-datasets.
+
+#### FFHQ
+
+On the FFHQ dataset, we provide two distinct pretrained models, one with a chain of length 4 and a geometric noise schedule as proposed by Sohl-Dickstein et al. [[1]](##References) , and another one with a chain of length 2 and a custom schedule.
+These models can be started with
+```shell script
+CUDA_VISIBLE_DEVICES=<gpu_id> streamlit run scripts/sample_imagebart.py configs/sampling/ffhq/<config>
+```
+
+#### LSUN
+For the models trained on the LSUN-datasets, use 
+```shell script
+CUDA_VISIBLE_DEVICES=<gpu_id> streamlit run scripts/sample_imagebart.py configs/sampling/lsun/<config>
+```
+
+### Class Conditional Sampling on ImageNet
+
+
+To sample from class-conditional ImageNet models, use
+```shell script
+CUDA_VISIBLE_DEVICES=<gpu_id> streamlit run scripts/sample_imagebart.py configs/sampling/imagenet/<config>
+```
+
+### Image Editing with Unconditional Models
+
+We also provide a script for image editing with our unconditional models. For our FFHQ-model with geometric schedule this can be started with
+```shell script
+CUDA_VISIBLE_DEVICES=<gpu_id> streamlit run scripts/inpaint_imagebart.py configs/sampling/ffhq/ffhq_4scales_geometric.yaml
+```
+resulting in samples similar to the following.
+![teaser](assets/image_editing.png)
+
+
+## Training
+In general, there are two options for training the autoregressive transition probabilities of the 
+reverse Markov chain: (i) train them jointly, taking into account a weighting of the 
+individual scale contributions, or (ii) train them independently, which means that each 
+training process optimizes a single transition and the scales must be stacked after training. 
+We conduct most of our experiments using the latter option, but provide configurations for both cases.
+
+### Training Scales Independently
+For training scales independently, each transition requires a seperate optimization process, which can 
+started via
+
+```
+CUDA_VISIBLE_DEVICES=<gpu_id> python main.py --base configs/<data>/<config>.yaml -t --gpus 0, 
+```
+
+We provide training configs for a four scale training of FFHQ using a geometric schedule, 
+a four scale geometric training on ImageNet and various three-scale experiments on LSUN.
+See also the overview of our [pretrained models](#get-the-models).
+
+
+### Training Scales Jointly
+
+For completeness, we also provide a config to run a joint training with 4 scales on FFHQ.
+Training can be started by running
+
+```
+CUDA_VISIBLE_DEVICES=<gpu_id> python main.py --base configs/ffhq/ffhq_4_scales_joint-training.yaml -t --gpus 0, 
+```
+
+
+## Shout-Outs
+Many thanks to all who make their work and implementations publicly available. 
+For this work, these were in particular: 
+
+- The extremely clear and extensible encoder-decoder transformer implementations by [lucidrains](https://github.com/lucidrains): 
+https://github.com/lucidrains/x-transformers
+- Emiel Hoogeboom et al's paper on multinomial diffusion and argmax flows: https://arxiv.org/abs/2102.05379 
+
+
+![teaser](assets/foxchain.png)
+
+## References
+
+[1] Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N. &amp; Ganguli, S.. (2015). Deep Unsupervised Learning using Nonequilibrium Thermodynamics. <i>Proceedings of the 32nd International Conference on Machine Learning
+
+## Bibtex
+
+```
+@article{DBLP:journals/corr/abs-2108-08827,
+  author    = {Patrick Esser and
+               Robin Rombach and
+               Andreas Blattmann and
+               Bj{\"{o}}rn Ommer},
+  title     = {ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive
+               Image Synthesis},
+  journal   = {CoRR},
+  volume    = {abs/2108.08827},
+  year      = {2021}
+}
+```
@@ -0,0 +1,78 @@
+model:
+  base_learning_rate: 0.0625
+  target: imagebart.models.diffusion.DenoisingXTransformer
+  params:
+    first_stage_key: image
+    monitor: val/loss
+    n_scales: 2
+    single_scale: 1
+    top_k: 548
+    alpha: 0.0
+    redraw_prob: ffhq_bernoulli_PSIM
+    use_ema: true
+
+    scheduler_config:
+      target: imagebart.lr_scheduler.LambdaWarmUpCosineScheduler
+      params:
+        verbosity_interval: 0
+        warm_up_steps: 10000
+        max_decay_steps: 1500001
+        lr_start: 2.5e-06
+        lr_max: 0.0001
+        lr_min: 1.0e-08
+    transformer_config:
+      target: imagebart.modules.xtransformers.x_transformer.XTransformer
+      params:
+        wrap_decoder: false
+        dim: 1152
+        enc_num_tokens: 548
+        enc_depth: 32
+        enc_heads: 16
+        enc_max_seq_len: 257
+        dec_num_tokens: 548
+        dec_depth: 6
+        dec_heads: 16
+        tie_token_emb: false
+        dec_max_seq_len: 256
+    first_stage_config:
+      target: imagebart.models.vqgan.VQGANWrapper
+      params:
+        ckpt_path: vqgan/vqgan-ffhq.ckpt
+        remap: data/vqgan_indices/ffhq_indices.npy
+        sane_index_shape: true
+        embed_dim: 256
+        n_embed: 1024
+        ddconfig:
+          double_z: false
+          z_channels: 256
+          resolution: 256
+          in_channels: 3
+          out_ch: 3
+          ch: 128
+          ch_mult:
+          - 1
+          - 1
+          - 2
+          - 2
+          - 4
+          num_res_blocks: 2
+          attn_resolutions:
+          - 16
+          dropout: 0.0
+        lossconfig:
+          target: taming.modules.losses.vqperceptual.DummyLoss
+
+data:
+  target: main.DataModuleFromConfig
+  params:
+    batch_size: 18
+    num_workers: 32
+    wrap: false
+    train:
+      target: taming.data.faceshq.FFHQTrain
+      params:
+        size: 256
+    validation:
+      target: taming.data.faceshq.FFHQValidation
+      params:
+        size: 256
@@ -0,0 +1,70 @@
+model:
+  base_learning_rate: 0.0625
+  target: imagebart.models.diffusion.DecoderOnlyDenoiser
+  params:
+    first_stage_key: image
+    monitor: val/loss
+    n_scales: 2
+    single_scale: 2
+    top_k: 548
+    alpha: 1.0
+    redraw_prob: ffhq_bernoulli_PSIM
+    use_ema: true
+    scheduler_config:
+      target: imagebart.lr_scheduler.LambdaWarmUpCosineScheduler
+      params:
+        verbosity_interval: 0
+        warm_up_steps: 10000
+        max_decay_steps: 1500001
+        lr_start: 2.5e-06
+        lr_max: 0.0001
+        lr_min: 1.0e-08
+    transformer_config:
+      target: imagebart.modules.transformer.mingpt.GPT
+      params:
+        vocab_size: 548
+        block_size: 256
+        n_layer: 36
+        n_head: 16
+        n_embd: 1216
+    first_stage_config:
+      target: imagebart.models.vqgan.VQGANWrapper
+      params:
+        ckpt_path: vqgan/vqgan-ffhq.ckpt
+        remap: data/vqgan_indices/ffhq_indices.npy
+        sane_index_shape: true
+        embed_dim: 256
+        n_embed: 1024
+        ddconfig:
+          double_z: false
+          z_channels: 256
+          resolution: 256
+          in_channels: 3
+          out_ch: 3
+          ch: 128
+          ch_mult:
+          - 1
+          - 1
+          - 2
+          - 2
+          - 4
+          num_res_blocks: 2
+          attn_resolutions:
+          - 16
+          dropout: 0.0
+        lossconfig:
+          target: taming.modules.losses.vqperceptual.DummyLoss
+
+data:
+  target: main.DataModuleFromConfig
+  params:
+    batch_size: 18
+    wrap: false
+    train:
+      target: taming.data.faceshq.FFHQTrain
+      params:
+        size: 256
+    validation:
+      target: taming.data.faceshq.FFHQValidation
+      params:
+        size: 256
@@ -0,0 +1,89 @@
+model:
+  base_learning_rate: 0.0625
+  target: imagebart.models.diffusion.DenoisingXTransformer
+  params:
+    first_stage_key: "image"
+    monitor: "val/loss"
+    n_scales: 4
+    single_scale: 1
+    top_k: 548
+    alpha: 0.0
+    redraw_prob: geometric
+    use_ema: True
+
+    scheduler_config:
+      target: imagebart.lr_scheduler.LambdaWarmUpCosineScheduler
+      params:
+        verbosity_interval: 0   # 0 or negative to disable
+        warm_up_steps: 10000
+        max_decay_steps: 1500001
+        lr_start: 2.5e-6
+        lr_max: 1.0e-4
+        lr_min: 1.0e-8
+
+    transformer_config:
+      target: imagebart.modules.xtransformers.x_transformer.ResidualScaledXTransformer
+      params:
+        scale_pos: 0
+        n_scales: 4
+        xt_start: 1
+        xt_size: 256   # predict x_{t-1}
+        wrap_decoder: False
+        dim: 752
+        enc_num_tokens: 548
+        enc_depth: 18
+        enc_heads: 16
+        enc_max_seq_len: 257
+        dec_num_tokens: 548
+        dec_depth: 6
+        dec_heads: 16
+        tie_token_emb: False
+        dec_max_seq_len: 256
+
+    first_stage_config:
+      target: imagebart.models.vqgan.VQGANWrapper
+      params:
+        ckpt_path: vqgan/vqgan-ffhq.ckpt
+        remap: "data/vqgan_indices/ffhq_indices.npy"
+        sane_index_shape: True
+        embed_dim: 256
+        n_embed: 1024
+        ddconfig:
+          double_z: false
+          z_channels: 256
+          resolution: 256
+          in_channels: 3
+          out_ch: 3
+          ch: 128
+          ch_mult: [ 1,1,2,2,4 ]  # num_down = len(ch_mult)-1
+          num_res_blocks: 2
+          attn_resolutions: [ 16 ]
+          dropout: 0.0
+        lossconfig:
+          target: taming.modules.losses.vqperceptual.DummyLoss
+
+data:
+  target: main.DataModuleFromConfig
+  params:
+    batch_size: 16
+    num_workers: 32
+    wrap: False
+    train:
+      target: taming.data.faceshq.FFHQTrain
+      params:
+        size: 256
+    validation:
+      target: taming.data.faceshq.FFHQValidation
+      params:
+        size: 256
+
+lightning:
+  callbacks:
+    image_logger:
+      target: main.ImageLogger
+      params:
+        batch_frequency: 1000
+        max_images: 4
+        increase_log_steps: False
+  trainer:
+    benchmark: True