TL;DR: Interactive notebook teaching when to use cosine vs euclidean vs manhattan vs hamming distance through real examples. Includes financial contracts dataset, hands-on experiments, and a production-ready decision framework. ~45 minutes to complete.
- The Story Behind This Project
- The Problem That Started It All
- The Eureka Moment
- What You'll Discover
- The Decision Framework
- Why This Matters
- What's Inside
- Ready to Start?
- Python 3.8+
- ~45 minutes of your time
- Curiosity about why your similarity search isn't working
- No advanced math background needed!
"Why does my similarity search keep returning weird results?"
That was me, four months ago, staring at my screen in frustration. I was building a document search system for financial contracts, and no matter what I tried, the "similar" documents it returned made no sense. A loan agreement would match with an insurance policy. A merger document would be "similar" to a simple purchase order.
I was throwing cosine distance at everything, hoping it would magically work. Spoiler alert: it didn't.
Picture this: You have thousands of financial contracts, each converted into a high-dimensional vector that supposedly captures its "meaning." You want to find similar contracts to help lawyers quickly locate relevant precedents. Sounds simple, right?
Wrong.
Different distance metrics tell completely different stories about what's "similar." And I learned this the hard way when my boss asked why our "AI-powered contract similarity system" thought a simple NDA was most similar to a complex derivatives trading agreement.
The breakthrough came when I stopped thinking about vectors as abstract mathematical objects and started thinking about them as people with preferences.
Meet Alice, Bob, and Carol - three friends with different interests:
- Alice:
[4, 0, 1]
→ loves sports, doesn't read, likes movies - Bob:
[3, 0, 1]
→ likes sports, doesn't read, likes movies - Carol:
[1, 3, 4]
→ some sports, loves reading, loves movies
Just by looking at these numbers, you'd say Alice and Bob are most similar, right? They're both sports fans who don't read much.
But here's where it gets interesting: different distance metrics might disagree with your intuition. And that's exactly why understanding them matters for building real-world systems.
Visualizing how Alice, Bob, and Carol's preferences translate into vectors with different magnitudes and directions
This isn't just another dry tutorial about mathematical formulas. This is the story of how I learned to choose the right tool for the job instead of using the same hammer for every nail.
Act I: The Simple Truth
- Start with Alice, Bob, and Carol
- Understand vectors, magnitude, and dot products through friendship
- See how different metrics "think" about similarity
Act II: The Real World
- Load actual financial contracts from HuggingFace
- Transform 1,024-dimensional document vectors into insights
- Watch each metric tell a different story about the same data
Real financial contracts transformed into high-dimensional vectors, each document becomes a point in 1,024-dimensional space
Act III: The Five Warriors Each distance metric has its own personality:
- Cosine Distance: The text whisperer (ignores length, focuses on meaning)
- Dot Product: The magnitude lover (bigger = more important)
- Euclidean Distance: The geometric purist (straight line distance)
- Manhattan Distance: The city navigator (robust to outliers)
- Hamming Distance: The binary specialist (counts exact differences)
See how each metric "thinks" differently about the same data, Alice, Bob, and Carol's similarity rankings change dramatically!
Act IV: Interactive Experiments
- Play with real data using interactive widgets
- See how each metric ranks similarity differently
- Understand why your search results were so weird
Act V: The Advanced Insights
- Deep dive into what the data really shows
- Analyze distance distributions and patterns
- Learn the secrets that took me months to figure out
Deep dive into vector distributions, magnitudes, and similarity patterns in high-dimensional space
How different distance metrics behave across thousands of financial contracts, the patterns reveal everything!
By the end of this journey, you'll have the decision framework I use in production systems:
- Text/Documents? → Cosine Distance (your new best friend)
- Neural Networks? → Dot Product (fast and meaningful)
- Images/Spatial Data? → Euclidean Distance (classic for a reason)
- High-Dimensional/Robust? → Manhattan Distance (the reliable workhorse)
- Binary/Categorical? → Hamming Distance (simple and fast)
Whether you're:
- Building a recommendation system
- Creating a semantic search engine
- Working with embeddings in ML
- Trying to understand why your similarity search isn't working
- Just curious about how vector databases actually work
This story will save you the months of trial-and-error I went through.
distance_metrics_in_vector_search/
├── financial_contracts_analysis.ipynb # The complete interactive story
├── requirements.txt # All dependencies you need
├── README.md # This guide you're reading
├── GETTING_STARTED.md # Quick setup instructions
└── venv/ # Virtual environment (after setup)
Open financial_contracts_analysis.ipynb
and follow along with the story. Each cell builds on the previous one, just like chapters in a book. By the end, you'll look at distance metrics the way I do now - as tools with personalities, each perfect for different jobs.
Fair warning: There will be math, but I promise it's the kind that makes sense when you see it in action with real examples.
Found this helpful? Have your own distance metric war stories? Questions about which metric to use for your specific use case?
Open an issue - I love helping people solve their similarity search problems. We've all been there, staring at confusing results and wondering what went wrong.
- Why cosine distance is perfect for text but terrible for other tasks
- When dot product similarity actually makes sense
- How Manhattan distance can save you from outliers
- The secret patterns in high-dimensional spaces
- A decision framework that actually works in production
The biggest revelation? There's no "best" distance metric. Each one is like a different lens for looking at your data. The magic happens when you know which lens to use when.
Four months ago, I was randomly trying different metrics hoping something would work. Today, I can look at a dataset and immediately know which metric will give me the insights I need.
That transformation is what this project is really about.
"The best way to understand distance metrics isn't through formulas - it's through stories, experiments, and seeing them work (or fail) with real data."
Ready to become a distance metrics detective?