Prototype exploring semantic search for museum collections using AI-generated visual descriptions and embeddings.
- Search artworks using multilingual natural language queries
- Explore the embedding space with interactive visualizations
- Compare traditional & semantic search techniques side-by-side
- Upload an image to find similar artworks
- Try the Demo - Search 5,280 Open Access Met paintings
- Explore the Embeddings Visualization - See how artworks cluster by text & image similarity
- Technical Guide & Setup - Setup and development guide
Try these queries to see how semantic search finds artworks that traditional keyword search might miss:
"Three Women", "Gazing from the window", "The gnarled tree", "Mother and child", "Old man with a beard", "Person fighting a monster", "People on a bridge", "A banquet scene", "Man on a horse", "Ruins in a landscape", "Sleeping person", "The reclining nude", "Man in armor", "A person with a dog", "Flowers in a vase"
While AI can enhance search capabilities, it can also perpetuate existing biases or create new ones. (See "Improving the Search: Uncovering AI bias in digital collections") AI-generated content should ideally be verified and edited by human experts. Museum collections search is complex and nuanced. This project is a quick prototype, see the Musefully project which is more reflective of proper faceted search.
5,280 Open Access paintings with images from The Metropolitan Museum of Art's Open Access collection. View the full Met Open Access dataset
AI is used to generate three types of descriptions for each artwork that adhere to the Cooper Hewitt Guidelines for Image Description:
- Alt text (~15 words) - Concise accessibility description
- Long description (100-300 words) - Detailed visual elements
- Emoji summary (3-8 emojis) - Visual elements as symbols
Two-pass quality control:
- Generation pass - AI creates initial descriptions from the image. Full prompt
- Editorial pass - AI reviews and removes bias, interpretation, or cultural assumptions. Full prompt
Visual Descriptions Example: The Death of Socrates, Jacques Louis David, 1787
- Alt Text: "Socrates, surrounded by grieving students, reaches for a cup of hemlock in a dimly lit prison cell"
- Long Description: "The painting depicts a dramatic scene within a dimly lit, stone-walled room. In the center, an elderly, muscular man with a white beard, identified as Socrates, sits upright on a bed, draped in a white cloth. He gestures upwards with his right hand, as if speaking, while his left hand reaches for a small, red cup offered by a younger, muscular man in a red tunic, who turns away with a pained expression. To the right of Socrates, several men are gathered, showing various states of distress. One man in a red robe sits on a stone block, looking down thoughtfully. Behind him, others express anguish, some covering their faces, others raising their hands in despair. To the left of Socrates, an older man with a white beard and gray robe sits slumped on a bench, his head bowed in sorrow. Further back, in a darker arched doorway, more figures are visible, including a woman being led away. On the floor in the foreground, near the slumped man, are two rolled scrolls and a broken chain. The room's architecture is simple, with stone blocks forming the walls and an arched opening leading to a darker area. A small, dark object hangs from the ceiling. The overall lighting is dim, with a subtle glow on the central figures."
- Emojis: 🧔🍷😭⛓️📜
- Alt Text: "An elderly man reaches for a cup, surrounded by distressed figures in a dimly lit stone room"
- Long Description: "A dimly lit, stone-walled room contains multiple figures. In the center, an elderly, muscular man with a white beard sits upright on a bed, draped in a white cloth. He gestures upwards with his right hand, while his left hand reaches for a small, red cup. A younger, muscular man in a red tunic offers the cup, turning his head away from the elderly man with a downturned mouth. To the right of the central elderly man, several men are gathered, displaying varied postures. One man in a red robe sits on a stone block, looking downwards. Behind him, other figures cover their faces or raise their hands. To the left of the central elderly man, an older man with a white beard and gray robe sits slumped on a bench, his head bowed. Further back, in a darker arched doorway, more figures are visible, including a woman standing near another figure. On the floor in the foreground, near the slumped man, are two rolled scrolls and a broken chain. The room features stone block walls and an arched opening leading to a darker area. A small, dark object hangs from the ceiling. The overall lighting is dim, with a subtle glow on the central figures."
- Emojis: 🧔🍷👥⛓️📜
Editorial Changes Made:
- Alt Text: Removed specific name "Socrates."
- Alt Text: Removed interpretive terms "grieving students," "hemlock," and "prison cell."
- Alt Text: Replaced with objective visual descriptions like "distressed figures" and "stone room."
- Alt Text: Adjusted word count to be closer to 15 words.
- Long Description: Removed subjective phrase "The painting depicts a dramatic scene."
- Long Description: Removed specific name "Socrates" and the phrase "identified as Socrates."
- Long Description: Removed interpretive phrases such as "as if speaking," "pained expression," "various states of distress," "looking down thoughtfully," "express anguish," "raising their hands in despair," and "bowed in sorrow."
- Long Description: Replaced character-specific references like "To the right of Socrates" with neutral spatial references like "To the right of the central elderly man."
- Long Description: Rephrased "a woman being led away" to "a woman standing near another figure" to remove implied action/intent.
- Long Description: Removed subjective judgment "The room's architecture is simple."
- Long Description: Replaced emotional descriptions of figures with objective descriptions of their postures and expressions (e.g., "downturned mouth," "displaying varied postures," "cover their faces").
- Emoji Summary: Removed "😭" emoji as it represents an emotion, which is explicitly forbidden.
- Emoji Summary: Added "👥" emoji to represent the group of multiple figures, ensuring all main visual elements are covered objectively.
The strict prompts & two-pass editorial process help reduce bias and subjective interpretation, but visual elements may be misidentified or missed entirely. The primary consideration is if, in spite of minor inaccuracies, the descriptions still improve search relevance, especially when used in text embeddings.
![]() |
Excerpt from visual description of "The Penitent Magdalen" by Georges de La Tour: "The mirror reflects two lit candles, their flames appearing as elongated, bright vertical streaks against the dark background within the frame. One candle is visible on a dark, turned wooden candlestick directly in front of the mirror, while the other is only seen as a reflection." Here, the model seems confused by the mirror and incorrectly identifies two candles when there is only one. |
Besides enabling better semantic search, the AI-generated visual descriptions can be used as a dataset for textual analysis. Below are some examples of frequent words used in the Met Paintings.
COLORS
|
EMOJIS
|
ANIMALS
|
MYTHOLOGICAL
|
Sometimes strangely accurate revealing details I missed, at other times questionable and problematic, and often hilarious. Dubious practical use but fun.
Below are comparisons of keyword search, text embedding search, and image embedding search for the query "woman looking into mirror".
Search for "woman looking into mirror"
Out of a result set of 20:
- The conventional Elasticsearch keyword search over Met Museum metadata produces only 3 results that I consider highly relevant.
- Text embedding search using Jina v3 embeddings on combined metadata and AI-generated descriptions returns 13 excellent results, including a number of images where the reflection or mirror is not even visible.
- Image embedding search returns 8 highly-relevant results, including artworks where there's no actual mirror, but perhaps the concept of mirroring, for example "Portrait of a Woman with a Man at a Casement" by Fra Filippo Lippi and "Dancers, Pink and Green" by Edgar Degas.
Results that I found exciting are highlighted in the image below. A number of these I probably would have missed if browsing through images.
![]() |
Difficult to see: the woman on the left is looking into a mirror.
Vilaval Ragini: Folio from a ragamala series (Garland of Musical Modes) |
![]() |
I thought the AI-generated visual description and/or text embeddings had it wrong, but there is indeed a mirror in the painting and it's possible the main figure is looking into it. Madame Marsollier and Her Daughter by Jean Marc Nattier |
![]() |
Perhaps the woman is not looking into a mirror, but it does feel like a mirroring. Portrait of a Woman with a Man at a Casement by Fra Filippo Lippi |
The /visualize
page shows the entire collection as dots on a 2D map, where similar artworks cluster together based on shared themes, styles, and subjects. For example, "Portraits of Men" and "Portraits of Women" appear near each other, as do "Horses" and "Men on Horses". Distinct traditions like "Indian Manuscripts" form separate regions.
- Each dot represents one artwork in the collection
- Distance between dots shows semantic similarity, closer dots are more similar
- Search to highlight relevant results. Larger, brighter dots rank higher
- Color dots by artist, period, tags, or department to reveal patterns
Traditional Elasticsearch text search using BM25 scoring across artwork metadata (title, artist, medium, etc.) and optional AI-generated visual descriptions.
Vector similarity search using pre-computed embeddings:
- Jina v3 Text: Advanced text search combining artwork metadata with AI-generated descriptions (768 dimensions)
- SigLIP 2 Cross-Modal: True text-to-image search using Google's SigLIP 2 model (768 dimensions) - enables natural language queries like "red car in snow" or "mourning scene"
Combines keyword and semantic search with user-adjustable balance control:
- Text Mode: Keyword + Jina v3 text embeddings
- Image Mode: Keyword + SigLIP 2 cross-modal embeddings
- Both Mode: Keyword + both embedding types using RRF
- Balance slider: 0% = pure keyword, 100% = pure semantic, 50% = equal weight
By clicking on "Image Search" in the search bar, you can upload an image to find visually similar artworks using SigLIP 2 cross-modal embeddings. Such a feature could be useful for museum-goers to find more information about an artwork they see in person or projects like Google Arts & Culture's "Art Selfie".
Uploaded Image:
![]() |
First Search Result:
![]() "Aristotle with a Bust of Homer" by Rembrandt (Rembrandt van Rijn) |
The artwork detail pages display similar artworks using four different algorithms:
Finds artworks with similar structured metadata using art historical principles:
- Artist (weight: 10) - Same artist indicates strong connection
- Date/Period (weight: 7) - Temporal proximity using Gaussian decay (±25 years)
- Medium (weight: 6) - Similar materials and techniques
- Classification (weight: 5) - Same artwork type (painting, sculpture, etc.)
- Department (weight: 4) - Museum curatorial groupings
- Culture/Nationality (weight: 4) - Cultural and geographic connections
- Period/Dynasty (weight: 3) - Art historical movements
Uses 768-dimensional text embeddings to find semantically similar artworks based on:
- Artwork metadata (title, artist, date, medium)
- AI-generated visual descriptions
- Contextual understanding of art terminology
Uses 768-dimensional cross-modal embeddings to find visually similar artworks:
- Analyzes visual features like composition, color, style
- Works across different media and periods
- Captures visual patterns independent of metadata
Fuses all three similarity types using weighted Reciprocal Rank Fusion (RRF):
- 35% Jina v3 text embeddings - semantic understanding
- 35% SigLIP 2 visual embeddings - visual appearance
- 30% Elasticsearch metadata - art historical context
Note that Elasticsearch has native RRF but it's only available in the Enterprise plan.
The AI curation process:
- Retrieves top 20 candidates from metadata and text embeddings searches, 5 candidates from image embeddings search
- Removes duplicates and presents candidates without scores to avoid bias
- Applies art historical expertise to select truly meaningful connections
- Enforces diversity rules (max 3 per artist, max 8 per similarity type)
- Returns up to 20 curated recommendations with confidence scores
Uses Gemini 2.5 Flash to intelligently select and rank similar artworks:
- Cross-cultural connections: Discovers relationships across time periods and cultures (e.g., Gauguin's Tahitian Madonna with Renaissance Madonnas)
- Thematic relationships: Identifies shared subjects and motifs beyond surface similarities
- Visual intelligence: Considers composition, style, and emotional resonance
- Diversity-aware: Limits over-representation of single artists or similarity types
- Explainable: Each recommendation includes a brief explanation of the connection
See Example here: Holy Family with Saint Anne, French Painter (17th century)
For this example, relying only on metadata, the keyword search does a poor job of finding relevant similar artworks, pulling in various works by unknown "French Painter". Text & image embeddings results are better, especially with theme and style. The AI-curated results are perhaps best in my opinion, but I'm not an art historian and not familiar enough with the collection to make an educated judgment.
See TECHNICAL_GUIDE.md for technical details & setup instructions including prerequisites, environment configuration, and deployment steps.
- Musefully (website, github): Search across museums using Elasticsearch and Next.js
- “Accessible Art Tags” GPT: a specialized GPT that generates alt text and long descriptions following Cooper Hewitt Guidelines for Image Description.
- OpenAI CLIP Embedding Similarity: Examples of OpenAI CLIP Embeddings artwork similarity search.
- MuseRAG++: A Deep Retrieval-Augmented Generation Framework for Semantic Interaction and Multi-Modal Reasoning in Virtual Museums: RAG-powered museum chatbot
- National Museum of Norway Semantic Collection Search (Website, Article): Search via embeddings of GPT-4 Vision image descriptions.
- Semantic Art Search (Github, Website): Explore art through meaning-driven search
- Sketchy Collections (Github, Website): CLIP-based image search tool that lets you explore artworks by drawing or uploading a picture
MIT licensed. Museum data used according to The Metropolitan Museum of Art's open access policy.