How Stable Diffusion Works | AI Art | GotFunnyPictures
Meme Encyclopedia
Images
Editorials
More
GotFunnyPictures is the property of Literally Media ©2024 Literally Media. All Rights Reserved.

1257 Views Created about a year ago By Revic • Updated about a year ago

Created By Revic • Updated about a year ago

How Stable Diffusion Works * Stable Diffusion is a Denoising Algorithm, which means that it tries to remove 'noise' from images. * The algorithm is calibrated by showing it partial images covered in artificial noise, and seeing how well it guesses what the noise-to-remove is. * The algorithm never saves the training images. The file size stays the same whether it trains on 1 image or 1 million images. • The calibrations are given a tiny nudge depending on how wrong each guess is, not enough to make a difference from any one image. Eventually a general solution to image denoising emerges. * Words are mapped to unique weights which are added to the denoising algoirthm, and thus the calibration needs to work in balance of the impact of words. e.g. 'photo' or 'cartoon' would weight some denoising choices differently. Step 1 Name Step 5 'city': [0.0037, 0.0141, 0.0066, 0.0024, -0.0440, 'building': [0.0051, 0.0122, -0.0262, 0.0161, -0.0058, Common calibration nudge size: 0.000005 Step 10 The algorithm doesn't use full resolution images. A highly compressed internal format is used, where 512x512x3 (for RGB color) becomes 64x64x4. The encoder/ decoder model downscales & upscales at each end of the denoiser. txt2img generates initial random 'noise.' Whereas img2img and training blend an input image with random noise by a given strength factor. The denoiser is a 'U-Net' model, which shrinks the image to even smaller resolutions, to consider details of different scales, then considers the fine details as it enlarges the image again. Within the U- Net, 'cross-attention' layers are used to associate different parts of the image with different words in the prompt. sd_v1-4.ckpt 4,165,411 KB sd_v1-5.ckpt 4,165,411 KB The input text is encoded using a third model, the CLIP Text Encoder. New faces and art styles which the model never trained on can still be drawn by finding the weights for an imaginary word which would sit somewhere between other words, using textual inversion and running the model in reverse from example output images. Eventually the denoiser gets so good that it can resolve an entirely new image from just noise, by running it several times in a row to keep improving the image. H CLIP Text Encoder Size -0.0111, 64x64 32x 32 "City, street, buildings, photo" -0.0125, 0.0086,....] -0.0008, 0.0072, 0.0058, ...] Step 20 512x512 TH Encoder Denoiser "U-Net" Decoder G 512x512 32x 32 64x64
Origin Entry:

AI Art

Source

Reddit


Notes

A short, surface-level, but seemingly generally accepted as accurate explanation of how Stable Diffusion generates images.

Textile Embed
!https://i.kym-cdn.com/photos/images/newsfeed/002/494/743/535.png!

Comments ( 4 )

Sorry, but you must activate your account to post a comment.