Exploring the Segment Anything Model: Object Detection Analysis
Written on
Chapter 1: Introduction to the Segment Anything Model
In my recent review of the Segment Anything Model (SAM), I discovered that beyond the vit-h (huge) variant, there are two additional configurations referred to as vit-b (base) and vit-l (large). The primary distinction among these three modes relates to their transformer layer configurations, attention heads, and hidden dimension sizes.
I was curious whether these variations would influence the number of objects each SAM mode could identify within the same image. To investigate this, I applied all three versions (vit-h, vit-l, and vit-b) to segment a Landsat-8 image taken over an agricultural region in California, aiming to count the number of delineated blocks. Subsequently, I compared the detected number of blocks against the actual count visible in the image.
If you're intrigued by this comparison and wish to replicate my process, I encourage you to read further.
Section 1.1: Downloading an RGB Landsat-8 Image
To implement different SAM modes on satellite imagery, I utilized a Landsat-8 image, which can be downloaded through the following link. For those wanting to use the same image, refer to the section titled "đź“Ą Download Satellite Images in GeoTIFF Format for Your AOI."
After securing the image, execute the following code to create three directories in your workspace: Landsat-8, Image, and SAM_outputs:
import os
# Folder names
folders = ['Landsat-8', 'Image', 'SAM_outputs']
# Create the folders
for folder in folders:
if not os.path.exists(folder):
os.makedirs(folder)
The Landsat-8 directory will house the satellite image in GeoTIFF format, the Image folder will store the PNG file after visualizing the satellite image in RGB, and the SAM_outputs folder will keep the segmented satellite images resulting from the three SAM modes (vit-h, vit-l, vit-b).
Next, we will install the rasterio library, read each band, create a stacked array, visualize the stacked layer using Matplotlib, and save it in both GeoTIFF and PNG formats in the respective folders:
pip install rasterio
import rasterio
import numpy as np
from rasterio.plot import show
import matplotlib.pyplot as plt
# Open B2, B3, and B4 Landsat-8 raster files
b4 = rasterio.open('/content/Landsat-8/L08.002_SR_B4_CU_doy2023166_aid0001.tif')
b3 = rasterio.open('/content/Landsat-8/L08.002_SR_B3_CU_doy2023166_aid0001.tif')
b2 = rasterio.open('/content/Landsat-8/L08.002_SR_B2_CU_doy2023166_aid0001.tif')
# Read the data from the raster files
b4_data = b4.read(1)
b3_data = b3.read(1)
b2_data = b2.read(1)
# Stack the bands into a single 3D array
rgb = np.stack((b4_data, b3_data, b2_data), axis=0)
# Normalize the data to the 0-1 range
rgb = rgb / np.max(rgb)
# Display the RGB map
show(rgb)
# Save the RGB map as a PNG file
plt.imsave('/content/Image/Landsat-8.png', np.transpose(rgb, (1, 2, 0)))
# Export the stacked data as a new GeoTIFF file
with rasterio.open('/content/Landsat-8/Landsat-8_2023166_RGB.tif', 'w', driver='GTiff', height=b2.height, width=b2.width, count=3, dtype=rgb.dtype, crs=b2.crs, transform=b2.transform) as dst:
dst.write(rgb, indexes=[1, 2, 3])
The resulting plot will display a Landsat-8 image of California captured in June 2023.
Section 1.2: Downloading LandIQ Map
For an accurate block count within the Landsat-8 image, we will utilize a vector layer that outlines actual block boundaries. A good source for this data is the shapefile of agricultural blocks from LandIQ. You can download this layer from their website:
By opening it in QGIS and checking the row count in the attribute table, you can determine the precise number of blocks represented in the Landsat-8 image. The maximum ID number, which is 214, indicates the total count of blocks in this image.
Chapter 2: Applying SAM Modes on the Landsat-8 Image
The first video titled "SAM - Segment Anything Model by Meta AI: Complete Guide | Python Setup & Applications" provides a comprehensive overview of the SAM and its applications.
To analyze the various SAM modes on the Landsat-8 image, we'll write Python code. If you're unfamiliar with SAM or how to implement it in Python, please refer to this post:
To apply the different SAM modes to the satellite image, clone the SAM repository from Git, install the supervision library, and download the checkpoints for SAM vit-h, vit-l, and vit-b:
!pip install supervision
%cd segment-anything
Download the models:
In this step, we will set up the Google Colab environment to utilize the GPU if available, import the checkpoints, and create a Python script to detect objects in the satellite image using each SAM mode. The segmented images will be saved in PNG format, iterating through SAM's vit-h, vit-l, and vit-b modes:
import torch
import cv2
import supervision as sv
DEVICE = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
MODEL_TYPE = ["vit_h","vit_l","vit_b"]
CHECKPOINT_PATH =["./sam_vit_h_4b8939.pth","sam_vit_l_0b3195.pth","sam_vit_b_01ec64.pth"]
from segment_anything import sam_model_registry, SamAutomaticMaskGenerator, SamPredictor
for model in MODEL_TYPE:
if model =='vit_h':
CHECKPOINT_PATH = f"./sam_{model}_4b8939.pth"elif model =='vit_l':
CHECKPOINT_PATH = f"./sam_{model}_0b3195.pth"else:
CHECKPOINT_PATH = f"./sam_{model}_01ec64.pth"
sam = sam_model_registry[model](checkpoint=CHECKPOINT_PATH).to(device=DEVICE)
mask_generator = SamAutomaticMaskGenerator(sam)
IMAGE_PATH = "/content/Image/Landsat-8.png"
image_bgr = cv2.imread(IMAGE_PATH)
image_rgb = cv2.cvtColor(image_bgr, cv2.COLOR_BGR2RGB)
sam_result = mask_generator.generate(image_rgb)
mask_annotator = sv.MaskAnnotator(color_lookup = sv.ColorLookup.INDEX)
detections = sv.Detections.from_sam(sam_result=sam_result)
num_segments = len(detections)
print(f"Number of segments detected based on Model {model}: {num_segments} ")
annotated_image = mask_annotator.annotate(scene=image_bgr.copy(), detections=detections)
# Convert the annotated_image from RGB to BGR format
annotated_image_bgr = cv2.cvtColor(annotated_image, cv2.COLOR_RGB2BGR)
# Resize the annotated image
scale_factor = 2 # Adjust this value to control the size of the saved image
new_width = int(annotated_image_bgr.shape[1] * scale_factor)
new_height = int(annotated_image_bgr.shape[0] * scale_factor)
new_size = (new_width, new_height)
resized_annotated_image = cv2.resize(annotated_image_bgr, new_size, interpolation=cv2.INTER_LINEAR)
# Save the annotated_image as a PNG file
cv2.imwrite(f"/content/SAM_outputs/Segmented_by_{model}.png", resized_annotated_image)
sv.plot_images_grid(
images=[image_bgr, annotated_image],
grid_size=(1, 2),
titles=['Landsat-8 ', f'Segmented Image by Model: {model}']
)
The second video titled "Segment Anything 2 (SAM 2): Meta AI's Newest Model | Community Q&A (Jul 30) - YouTube" delves into the latest iterations of the SAM and its functionalities.
Section 2.1: Visualizing the Segmented Images
Upon inspecting the segmented images saved in PNG format, you will find the original satellite image on the left and the corresponding segmented image on the right for each SAM mode:
- The satellite image vs. the segmented image by SAM vit-h
- The satellite image vs. the segmented image by SAM vit-l
- The satellite image vs. the segmented image by SAM vit-b
From the visual assessment, it appears that the performance of vit-h and vit-l is quite similar, indicating that the number of detected blocks in each case should be comparable. However, it is evident that SAM struggled to identify many blocks when operating in the base mode (vit-b), resulting in numerous missed detections.
Section 2.2: Comparing Detected Segments vs. Actual Count
The previous code also reveals the accurate count of objects (blocks) identified by SAM in each mode. Let's review these figures:
- Detected segments for vit_h: 146
- Detected segments for vit_l: 149
- Detected segments for vit_b: 103
As anticipated, the detected counts for SAM in vit-h and vit-l are quite close (146 versus 149), while the base mode (vit-b) recorded only 103. Given that there are approximately 214 blocks in this image according to the LandIQ database, we can define accuracy as the ratio of detected blocks to actual blocks. The results are summarized as follows:
- SAM vit-h: (146/214)*100 = 68%
- SAM vit-l: (149/214)*100 = 70%
- SAM vit-b: (103/214)*100 = 48%
Chapter 3: Conclusion
The Segment Anything Model (SAM) serves as a robust algorithm designed for the automatic segmentation of images and detection of objects without prior training. This algorithm features three distinct configurations. In this discussion, I assessed the efficacy of each mode on a satellite image. The findings indicated that both the "huge" and "large" modes performed similarly, successfully identifying nearly 70% of the blocks in an image containing 214 blocks. Conversely, the algorithm's performance in the "base" mode was notably poorer, missing over 50% of the blocks in the satellite image.
Chapter 4: References
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., Dollár, P., & Girshick, R. (2023). Segment Anything. arXiv:2304.02643.
📱 Connect with me on various platforms for more engaging content! LinkedIn, ResearchGate, GitHub, and Twitter.