Fréchet inception distance (FID)
What is the Fréchet inception distance (FID)?
Fréchet inception distance (FID) is a metric for quantifying the realism and diversity of images generated by generative adversarial networks (GANs). Realistic could mean that generated images of people look like real images of people. Diverse means they are different enough from the original to be interesting and novel.
FID is generally used for analyzing images and not text, sounds or other modalities. Other related metrics are being developed for these domains.
FID is used for assessing individual images generated by GANs, the impact of neural network model changes on realism and the relative merits of multiple GAN models for generating images. It assesses visual quality and diversity well within a single metric. A lower score can measure when generated images are more like real images. For example, it can help weed out images of people with extra fingers or eyes in the wrong place.
First introduced in 2017, FID is one of the best automated measures for improving GANs for image generation. However, it can lead to various problems that developers must consider. Furthermore, it does not seem to work as well for other generative AI techniques, such as stable diffusion models, variational autoencoders or transformers.
Combining Fréchet distance and inception
Fréchet inception distance is a combination of the terms Fréchet distance and Google's inception model.
The Fréchet distance quantifies the similarity of two curves. First introduced in 1906 by Maurice Fréchet, it quantifies the minimum length of leash required between a dog and walker while each walked a separate curved path of a certain distance. The same calculation is also useful for many other problems in handwriting recognition, robotics, geographic information systems and protein structure analysis.
The Inception-v3 model used in FID is one in a library of modules introduced by Google as part of its GoogLeNet convolutional neural network in 2014. It was first discussed in a research paper titled "Going deeper with convolutions." These components transform raw imagery into a latent space for representing the mathematical properties of images at multiple scales and in different locations within the image. For example, this could help align images of a cat in a latent space used for analysis, whether the image is zoomed in on the face or paw or whether the cat is located at the top or the bottom of the image.
The original inception models were introduced to help improve the performance of new neural networks on the ImageNet Large-Scale Visual Recognition Challenge in 2014. The various inception models represent both global and local information in smaller neural network layers for training deep neural networks while reducing computational complexity. Google explored variations, including Inception-ResNet, Inception-v2, Inception-v3 and Inception-v4.
These various inception models are sometimes used to extract features in computer vision tasks and detect objects. Despite not being the latest model, the Inception-v3 model combined with the Fréchet distance is best suited for analyzing GAN imagery.
Fréchet inception distance vs. inception score
Ian Goodfellow and research team at the University of Montreal first introduced GANs in a 2014 paper. In the early days, a competitive adversarial network was responsible for improving image quality.
In 2016, Goodfellow worked with researchers at OpenAI to improve GAN training using an inception score. This new metric evaluated the diversity and quality of generated images. It calculated the Kullback-Leibler (KL) divergence for assessing the diversity of generated images. The KL score determines how the probability distribution varies between two sets of numbers. In this case, the numbers represented the distribution of colors, shapes and edges at varying levels of scale calculated using the inception model.
However, an inception score suffers some limitations regarding how it compares to human judgment. It could also be adversely affected by different image sizes. Consequently, researchers continued to explore better techniques for assessing GAN image quality.
A research team at the Johannes Kepler University Linz introduced FID in a 2017 paper. The paper explored a better way of training both the GAN generator neural network and the discriminator neural network at different time scales. They reported that FID captures the similarity of generated images to real ones better than the inception score. Since then, the FID has continued to be the most popular approach for assessing GAN image quality.
How is the FID measured?
FID is measured by computing the differences between the representations of features, such as edges and lines, and higher-order phenomena, such as the shapes of eyes or paws that are transformed into an intermediate latent space. FID is calculated using the following steps:
- Preprocess the images. Ensure the two images are compatible using basic processing. This can include resizing to a given dimension size, such as 640x480 pixels, and then normalizing pixel values.
- Extract feature representations. Pass the real and generated images through the Inception-v3 model. This transforms the raw pixels into numerical vectors to represent aspects of the images, such as lines, edges and higher-order shapes.
- Calculate statistics. Statistical analysis is performed to determine the mean and covariance matrix of the features in each image.
- Compute the Fréchet distance. Compare the difference between each image's computed mean and covariance matrixes.
- Obtain the FID. Compare the Fréchet distance between the real and generated images. Lower numbers indicated the images are more similar.
What is FID used for?
The primary use of FID is to evaluate the quality of images generated by GAN models. It provides a simple metric for assessing individual images or tuning the models used to generate them. Uses of FID include the following:
- GAN evaluation. FID provides a metric for assessing how well a particular GAN model is performing in terms of generating realistic and diverse images. This can help compare different models or compare the performance of a model during training.
- Model selection. FID can help compare the performance of GAN model variations or architectures.
- Tuning hyperparameters. FID can assess the impact of changing hyperparameters on GAN model performance to guide adjustments toward more optimal configurations.
- Novelty detection. FID can help identify images that are highly different, which could indicate novel examples.
- Research. FID provides a simple way of comparing the merits of different GAN models for researchers.
What are the limitations of FID?
FID is widely used in evaluating the quality of images generated by GANs. However, it is not used for other types of media, such as music or text or with different kinds of neural network architectures. In addition, several other limitations should be taken into consideration:
- Use of pre-trained models. FID uses a pre-trained Inception-v3 model as part of the process, which could introduce bias based on how the model was trained. This could be an issue if the training data differs substantially from the domain of the generated images. For example, an inception model pre-trained on cats and other animals may not work as well on buildings.
- Insensitivity. FID may miss some aspects of image quality, such as fine-grained details or textures. As a result, certain kinds of image imperfections might not be caught by FID scores.
- Requirement for consistent preprocessing. All the images -- including training data, real images, and generated images -- need to be scaled, cropped and normalized consistently. Differences in preprocessing can affect FID scores.
- Subjectivity. FID scores do not necessarily capture all aspects of human perception and preferences. It's important to include human evaluators as part of the process to refine GAN models.
- Overfitting. Exclusive focus on FID could lead to models that achieve high scores but do not look realistic. Consequently, developers need to perform human analysis to weed out problematic models.
The future of FID
FID continues to be a popular metric for assessing the performance of GAN image-generating models. At the same time, other kinds of generative models, such as Stable Diffusion and transformers, are growing in popularity for image generation. These different techniques will require new types of metrics for assessing image quality.
In the meantime, researchers will likely refine and improve FID metrics to make them more robust. For example, better pre-trained Inception-v3 models or new inception models could help to overcome bias or fine-tune data sets for different domains.
Other evaluation metrics may also emerge that could separately represent qualities, such as diversity or realism, that are rolled into a single metric with FIDs.
In the meantime, FID provides a relatively simple way to capture the quality of GAN models in a single metric. Like other metrics, this could guide research into better metrics come along.