Thread
Reading through the Segment Anywhere Model (SAM) paper, some notes & insights as I do so in this Twitter thread. This whole paper is quite incredible actually, very strong work 1/n

arxiv.org/pdf/2304.02643.pdf
First, it sounds like their prompt module can take arbitrary masks. This means you can take existing supervised model that emits detections, & take THOSE masks as input to SAM in order bootstrap into a large amount of unsupervised masks via SAM
2nd, they have insight that future of these systems are loosely coupled components that interact with each other.

I think this is why OpenAI is not training GPT5 -- future might actually be decomposed modules interacting, rather than yet another giant monolithic GPT*X module
For their image encoding piece, they choose to use Masked Autoencoders (MAE), which I personally find very compelling. There are many different ways to get encodings, or to embed Transformers into an architecture, but MAEs have a conceptual simplicity & elegance to them IMHO.
Question I still have: if image encoding is separate, could one train custom encoder, such as geospatial masked autoencoder to get geospatially aware embeddings? Or is rest of system overfit to particular distribution of their current image encoder?
They aren't releasing the text encoder with the SAM release, but it sounds like they just took the CLIP text encoder as is and used that in their experiments.
Their mask decoder has a linear combination of two losses, which is interesting: both focal loss (which can be useful for dealing with unbalanced datasets + getting uncalibrated probabilistic like values) and dice loss (not quite sure why they are using dice loss to be honest)
I asked GPT4 for its opinion on why Dice loss was used, & its reply sounds reasonable: the Dice loss portion helps with smaller objects and forces masks to overlap between prediction & ground truth
The human in the loop "data engine" they produced is also really interesting to me
1st stage directly ran in browser & had annotators make judgements in 30 seconds on what they see, moving on after that
Sounds like at this stage they also had smaller ViT-B model & bootstrapped with common segmentation datasets, then later "ate their own data dogfood" & retrained on it, & moved to ViT-H (probably basically using ViT-B initially to prevent overfitting when having less data).
This whole paper & work are just so strong, very well written paper
The images in their SA-1B dataset they release are very high resolution (3300 X 4950 pixels), which might mean they roughly approximate size of remote sensing images (4k x 4k sometimes), so thats a positive note towards generalizing on remote sensing with SAM.
Note that the images were licensed "from a provider that works directly with photographers". This might actually mean SAM could have a bias towards very high quality professional images, which isn't true of the real world where people, sensors, etc. many times have crappy images
One of the cool things is they test SAM on zero shot performance on a variety of other non-mask tasks, where it does quite well: edge detection, object proposals, instance segmentation, and text to mask generation.
Interesting detail and modification for the text to mask experiment they ran:
An interesting detail in their data engine training, when they try to mix together the manual masks, semi-manual, and fully manual, they found they have to boost & oversample the manual/semi manual masks so they don't get swamped
Another interesting detail: training on just 10% of the full dataset yielded almost as good results as the full dataset, which could make replication & other work much easier in this area
BTW, there's a strong focus on hard computational time guarantees at mask generation time (50ms). I suspect this is also for future VR/AR applications that will have this run onboard headsets, prompted by eye gaze, gesture, etc., where hard frames per second guarantees matter
Note, though, there's an important subtle issue in SAM: the image encoder is expensive and separate from the mask decoder. So you get 50ms for mask decoder but image encoder is *expensive*. Maybe distilation techniques can help on the image encoder side?

Other limitations:
The prompt encoder has a learned embedding on whether a point is foreground or background, which I expect to perform badly for remote sensing imagery or biomedical imagery since there is no clear foreground or background
I thought this was kind of a clever trick: they emit 3 mask predictions, to deal with ambiguity, but only backpropagate the lowest loss
BTW that combined focal and dice loss are 20x focal and 1x dice as their terms
Note in their 23 test segmentation datasets, there are no remote sensing tests. However, there is a dataset named InstanceBuilding2D that are "High-resolution drone UAV images annotated with roof instance segmentation masks."
Mentions
See All