Exploring Self-Supervised Learning for Computer Vision

Chapter 1: Introduction to Self-Supervised Learning

In the realm of artificial intelligence (AI), a significant portion of its current capabilities derives from supervised models trained on vast datasets. Many of these datasets are annotated by humans, a process that can be tedious, time-intensive, prone to mistakes, and costly. Self-supervised learning (SSL) presents a transformative approach, enabling machines to learn from data that lacks labels. This article delves into the mechanics of SSL and its application within the field of computer vision. We will contrast basic methods with cutting-edge techniques, highlighting SSL's potential in medical diagnostics—a sector that stands to benefit immensely while requiring a thorough understanding of the methodology for effective implementation.

What is Self-Supervised Learning?

Yann LeCun, Chief AI Scientist at Meta, describes self-supervised learning as "one of the most promising ways to build background knowledge and approximate a form of common sense in AI systems." The essence of SSL lies in training models on data without the need for explicit annotations.

In comparison to two conventional learning paradigms—supervised and unsupervised learning—self-supervised learning occupies a unique middle ground. Supervised learning involves feeding the model input data along with corresponding labels, allowing it to discern patterns that can generalize to unseen data. Conversely, unsupervised learning deals solely with inputs, aiming to identify inherent patterns, cluster similar data points, or detect anomalies.

SSL shares characteristics with unsupervised learning as it operates on unlabeled data. However, it also incorporates a supervised element, as the model generates its own pseudo-labels during the training process. This concept is not entirely novel; SSL has been effectively employed in natural language processing (NLP) to train large language models such as BERT and GPT, where the model predicts the next word in a sequence based on the preceding context.

In recent years, self-supervised learning has gained traction in computer vision, driven by significant breakthroughs from companies like Google, DeepMind, and Meta. The core principle remains unchanged: models autonomously create pseudo-labels, whether by masking portions of an image or predicting the angle of rotation after transforming the image.

Exploring Medical Applications of Self-Supervised Learning

The medical field generates an overwhelming volume of imaging data. IBM estimates that as much as 90% of all medical data is image-based, with the World Health Organization reporting approximately 3.6 billion X-ray examinations conducted annually. This massive dataset presents an excellent opportunity for machine learning algorithms to assist in diagnostic and treatment processes. However, a notable hurdle exists.

To learn effectively, traditional supervised models require not only examples but also annotations. For instance, when training a model with X-ray images, it must be informed about the medical conditions to identify within them. Unfortunately, obtaining these annotations is challenging and resource-intensive, often necessitating the expertise of doctors whose time is better spent with patients.

Self-Supervised Learning Addresses Annotation Scarcity

In scenarios with limited annotations, like recognizing medical conditions in X-ray images, we often find ourselves with a wealth of data but only a fraction of it labeled. A conventional supervised approach would limit us to the small annotated dataset. However, self-supervised learning allows us to utilize unlabeled images as well.

The initial step involves the self-supervised model generating pseudo-labels from the unlabeled data, a process known as self-supervised pre-training. During this stage, the model undertakes a pretext task—such as predicting a masked segment of an image or determining its rotation angle. We will later explore how to select these pretext tasks effectively.

The outcome is a pre-trained model that has absorbed the patterns within the unlabeled data. While it may lack knowledge about specific medical conditions (as those insights are tied to the labels it hasn't encountered), it can still recognize consistent differences between various X-ray images. This foundational knowledge is what LeCun refers to when discussing the development of background knowledge.

Following the pre-training phase, the model is fine-tuned in a conventional supervised manner using the labeled portion of the dataset. The advantage here is that with the model's newfound background knowledge, providing just a few annotated examples can significantly enhance its ability to tackle the downstream task—detecting medical conditions in X-ray images.

Pretext Tasks: How Models Learn

Now, let’s examine the pretext tasks that the model engages with during pre-training. A variety of methods exist, and the options are virtually limitless, provided that we can derive pseudo-labels from the input data itself. Some of the most common approaches include:

Masked Prediction: This involves hiding a section of the input image and tasking the model with predicting the masked area based on the remaining visible content.
Transformation Prediction: A collection of methods falls under this category, where the model is presented with transformed images—such as rotated or color-shifted images—and is required to predict the parameters of these transformations.
Jigsaw Puzzle: In this approach, the model is challenged to rearrange pieces of an image that have been randomly shuffled to restore the original configuration.
Instance Discrimination: This task requires multiple views of the same object (e.g., different angles of a cat), with the goal of recognizing whether two images depict the same entity.

The aim of each pretext task is to compel the model to learn the underlying structure and patterns in the data. Recent research indicates that contrastive learning techniques are particularly effective in achieving this objective.

Contrastive Learning: A Closer Look

Contrastive learning operates on the principle of comparing samples to identify shared patterns and differentiate them from distinct ones. This approach can be applied in both supervised and self-supervised contexts.

For instance, consider a security scenario in which you want to install a door that opens exclusively for verified employees. You might only have a handful of images of each employee to train your model. A practical solution involves training a model to determine whether two images portray the same individual. During training, the model would receive three images: two of the same person and one of a different individual, learning to recognize the similarities and differences.

The self-supervised variant of contrastive learning employs a similar methodology. The model is presented with three images: an anchor image, a transformed version of that anchor (positive example), and another random image (negative example). The goal is to teach the model that the first two images are similar, while the last one is not.

Let’s take a moment to explore some prominent self-supervised contrastive architectures in greater detail.

Triplet Loss: A straightforward approach where the model processes an anchor, positive, and negative image through a backbone model to generate embeddings. The triplet loss function then encourages the model to position the anchor and positive images closely in the latent space while distancing the negative image.
SimCLR: This model, developed by Google Research in 2020, utilizes the anchor and its transformed version, passing them through a ResNet encoder. The Noise Contrastive Estimation (NCE) loss seeks to maximize the similarity between the two embeddings while minimizing the similarity to embeddings from other images in the batch.
MoCo: This model from Facebook AI Research improves upon SimCLR by reducing the required batch size. It employs two separate encoder networks for the anchor and positive images, utilizing a memory bank to sample negative examples.
BYOL: Developed by DeepMind, BYOL employs two networks that are updated using the moving average of one another's weights. Rather than contrasting images directly, it focuses on mapping the positive example and anchor to the same location in the embedding space.

Emerging architectures continue to evolve, and research is increasingly aimed at enhancing model transferability across a variety of tasks. Notable examples include Barlow Twins, SwAV, SimSiam, and the latest models like Tico and VICRegL.

Choosing Transformations: Critical Considerations

Having discussed the mechanics of self-supervised learning and its resolution of annotation scarcity, as well as various pretext tasks and advanced contrastive architectures, we now turn our attention to the crucial task of selecting transformations for the anchor images. This choice can significantly impact the effectiveness of SSL in real-world applications.

Research by SimCLR and MoCo has identified the most effective transformations, such as random cropping, color jitter, and blurring. However, the effects of different transformations can vary, introducing invariances to the model that may not always be beneficial.

For example, consider a dataset comprising images of birds, flowers, and elephants. Depending on the transformation employed—color shift, rotation, or texture change—certain downstream tasks may be facilitated while others could suffer. If color shift is used, the model will learn to treat semantically similar but differently-colored images as equivalent, potentially hindering its ability to differentiate between species that primarily differ in color.

The lesson here is clear: the choice of transformations must align with the specific downstream tasks you intend for the model to perform. Inappropriate transformations during pre-training can lead to suboptimal performance later on.

Applying Transformations to Medical Imaging

Let’s examine how critical the selection of transformations is for X-ray images. If we disregard the specific task of classifying images based on medical conditions and simply follow the recommendations of researchers to apply random cropping, we may encounter issues.

For instance, if a portion of the image indicating lung damage is inadvertently cropped out, the model could erroneously learn that a damaged lung and a healthy lung are similar. Such misguided pre-training would complicate the model's ability to accurately recognize lung damage later on.

Similarly, applying color jitter or blurring to grayscale X-ray images may prove counterproductive, as these variations might signify specific medical conditions. Therefore, it’s essential to tailor transformations to the dataset and the downstream tasks at hand.

Real-World Implementation: X-ray Classification

In collaboration with my colleagues at Tooploox, we sought to explore the potential benefits of self-supervised learning in medical diagnosis. We utilized the CheXpert dataset, which contains approximately 220,000 chest X-ray images labeled with ten non-exclusive classes representing various medical conditions.

For our experiments, we selected a random subset of around 200,000 images for self-supervised pre-training, deliberately ignoring the accompanying labels. We applied slight random rotations, horizontal flips, and perspective transformations to the anchor images.

Upon completion of pre-training, we fine-tuned the model using labeled datasets of varying sizes, ranging from 250 to 10,000 images. Our objective was to assess how performance fluctuated with different labeled set sizes.

Finally, we tested our models on a carefully curated subset of 300 manually labeled images, with the aim of evaluating their efficacy.

Performance Assessment

We compared three distinct model architectures:

A conventional transfer learning model utilizing ResNet18, trained solely on the labeled fine-tuning set—illustrating the scenario without self-supervised learning.
A simple triplet loss model employing ResNet18 as a backbone, pre-trained using our chosen transformations.
Meta's MoCo model, also utilizing ResNet18 and our selected transformations.

Each model underwent ten training and testing iterations with different labeled fine-tuning set sizes, and we evaluated their performance using the area under the ROC curve (AUC).

Results

Our findings revealed that the self-supervised models consistently outperformed the supervised baseline. Noteworthy insights include:

The greatest improvement from self-supervised pre-training occurred with the smallest labeled set, showing a 10 percentage point advantage with just 250 labeled examples.
Even with a larger labeled dataset (10,000 examples), self-supervised pre-training still provided a gain of around 6 percentage points over the supervised baseline.
Among the self-supervised models, MoCo demonstrated greater performance enhancements than the Triplet Loss approach, particularly with smaller labeled datasets.

We also examined class frequencies within our data. The results indicated that self-supervision benefitted all classes, with the most significant advantages observed for relatively rare classes.

Conclusion

Self-supervised learning in computer vision has made remarkable strides over the past few years, with contrastive architectures developed by major AI research institutions, particularly Meta, setting new benchmarks.

This evolution signifies two critical developments: the ability to harness unlabeled datasets efficiently will revolutionize various industries where annotated data is scarce, and the training of foundation models capable of transferring knowledge to diverse downstream tasks is pivotal for advancing AI generalization.

Acknowledgments

This article is derived from a presentation I delivered at the Data Science Summit in Warsaw, Poland, on December 18, 2022. The associated slides are available for reference.

The collaborative research into the application of self-supervised learning in medical contexts was conducted with my colleagues at Tooploox. More details can be found on the company blog.

Thank you for reading! Stay updated on the fast-evolving landscape of machine learning and AI by subscribing to my newsletter, AI Pulse. For consulting inquiries, feel free to reach out or book a one-on-one session with me.

Consider exploring my other articles for more insights!

The first video, "Vahan Huroyan - Recent Developments in Self-Supervised Learning for Computer Vision," provides insights into the latest advancements in the field.

The second video, "10L – Self-supervised learning in computer vision," offers an overview of key concepts and applications in SSL.

rhondamuse.com