AIGC model

Modeling techniques for AIGC

Taking a face generation as an example, we will introduce representative generative models and highlight their differences. The generation process aims to estimate the probability distribution of the face data \(p(x)\). With bayes rule, we can estimate the probability of the data given the model \(p(x|z)\) by marginalizing over the latent variable \(z\):

\[p(x) = \int p(x|z)p(z) dz\]

Autoencoder (AE) models \(p(x)\) where \(x\) is a face. We cannot control the generated face.

Variational Autoencoder (VAE) models \(p(x|z)\) where \(z\) is a latent continuous variable (e.g., expression).

Vector Quantized Variational Autoencoder (VQ-VAE) models \(p(x|z)\) where \(z\) is a discrete latent variable (e.g., gender).

Autoregressive models a joint distribution \(p(x_1, x_2, ..., x_T)\) where \(x_1, x_2, ..., x_T\) are the pixels of the face.

Generative Adversarial Networks (GANs) employs a discriminator \(D(x)\) to distinguish the real data \(x\) from the generated data \(G(z)\). The generator \(G(z)\) tries to fool the discriminator.

Diffusion Models WIP

Neural Radience Fields (NeRF) WIP

3D Gaussian Splatting (3DGS) WIP

Generative Video Models WIP

Foundation model development

Developing a large-scale foundation model is a challenging task due to the high computational cost and the need for large-scale data. In this section, we will introduce the key components of a foundation model and accelerate the development process with NeMo.

NeMo is a scalable and high-performant generative AI framwork developed by NVIDIA that provides a set of tools for building large-scale foundation models (LLM, MLLM, and TTS).

https://docs.nvidia.com/nemo-framework/user-guide/latest/_images/nemo-llm-mm-stack.png — NeMo Overview

As shown in the figure, the lifecycle of a foundation model development includes the following steps:

data curation: extract/synthetic high-quality data
training and customization: supervised fine-tuning and parameter-efficient fine-tuning
alignment: align the model with human values (DPO, SteerLM, RLHF)
deployment and inference: TensorRT-LLM/vLLM on NVIDIA Triton inference server
multimodal models development: multimodal llms, vision-language models, text2image and NeRF

Tutorial notebooks are listed here.

References

Kaiming He. “6.S978 Deep Generative Models (MIT EECS, 2024 Fall).”