The Distance Metrics Detective Story

A Journey from Confusion to Clarity in Vector Search

TL;DR: Interactive notebook teaching when to use cosine vs euclidean vs manhattan vs hamming distance through real examples. Includes financial contracts dataset, hands-on experiments, and a production-ready decision framework. ~45 minutes to complete.

Prerequisites

Python 3.8+
~45 minutes of your time
Curiosity about why your similarity search isn't working
No advanced math background needed!

The Story Behind This Project

"Why does my similarity search keep returning weird results?"

That was me, four months ago, staring at my screen in frustration. I was building a document search system for financial contracts, and no matter what I tried, the "similar" documents it returned made no sense. A loan agreement would match with an insurance policy. A merger document would be "similar" to a simple purchase order.

I was throwing cosine distance at everything, hoping it would magically work. Spoiler alert: it didn't.

The Problem That Started It All

Picture this: You have thousands of financial contracts, each converted into a high-dimensional vector that supposedly captures its "meaning." You want to find similar contracts to help lawyers quickly locate relevant precedents. Sounds simple, right?

Wrong.

Different distance metrics tell completely different stories about what's "similar." And I learned this the hard way when my boss asked why our "AI-powered contract similarity system" thought a simple NDA was most similar to a complex derivatives trading agreement.

The Eureka Moment

The breakthrough came when I stopped thinking about vectors as abstract mathematical objects and started thinking about them as people with preferences.

Meet Alice, Bob, and Carol - three friends with different interests:

Alice: [4, 0, 1] → loves sports, doesn't read, likes movies
Bob: [3, 0, 1] → likes sports, doesn't read, likes movies
Carol: [1, 3, 4] → some sports, loves reading, loves movies

Just by looking at these numbers, you'd say Alice and Bob are most similar, right? They're both sports fans who don't read much.

But here's where it gets interesting: different distance metrics might disagree with your intuition. And that's exactly why understanding them matters for building real-world systems.

Visualizing how Alice, Bob, and Carol's preferences translate into vectors with different magnitudes and directions

What You'll Discover in This Journey

This isn't just another dry tutorial about mathematical formulas. This is the story of how I learned to choose the right tool for the job instead of using the same hammer for every nail.

Act I: The Simple Truth

Start with Alice, Bob, and Carol
Understand vectors, magnitude, and dot products through friendship
See how different metrics "think" about similarity

Act II: The Real World

Load actual financial contracts from HuggingFace
Transform 1,024-dimensional document vectors into insights
Watch each metric tell a different story about the same data

Real financial contracts transformed into high-dimensional vectors, each document becomes a point in 1,024-dimensional space

Act III: The Five Warriors Each distance metric has its own personality:

Cosine Distance: The text whisperer (ignores length, focuses on meaning)
Dot Product: The magnitude lover (bigger = more important)
Euclidean Distance: The geometric purist (straight line distance)
Manhattan Distance: The city navigator (robust to outliers)
Hamming Distance: The binary specialist (counts exact differences)

See how each metric "thinks" differently about the same data, Alice, Bob, and Carol's similarity rankings change dramatically!

Act IV: Interactive Experiments

Play with real data using interactive widgets
See how each metric ranks similarity differently
Understand why your search results were so weird

Act V: The Advanced Insights

Deep dive into what the data really shows
Analyze distance distributions and patterns
Learn the secrets that took me months to figure out

Deep dive into vector distributions, magnitudes, and similarity patterns in high-dimensional space

How different distance metrics behave across thousands of financial contracts, the patterns reveal everything!

The Decision Framework That Changed Everything

By the end of this journey, you'll have the decision framework I use in production systems:

Text/Documents? → Cosine Distance (your new best friend)
Neural Networks? → Dot Product (fast and meaningful)
Images/Spatial Data? → Euclidean Distance (classic for a reason)
High-Dimensional/Robust? → Manhattan Distance (the reliable workhorse)
Binary/Categorical? → Hamming Distance (simple and fast)

Why This Matters for You

Whether you're:

Building a recommendation system
Creating a semantic search engine
Working with embeddings in ML
Trying to understand why your similarity search isn't working
Just curious about how vector databases actually work

This story will save you the months of trial-and-error I went through.

What's Inside

distance_metrics_in_vector_search/
├── financial_contracts_analysis.ipynb  # The complete interactive story
├── requirements.txt                    # All dependencies you need
├── README.md                           # This guide you're reading
├── GETTING_STARTED.md                  # Quick setup instructions
└── venv/                              # Virtual environment (after setup)

Ready to Start?

Open financial_contracts_analysis.ipynb and follow along with the story. Each cell builds on the previous one, just like chapters in a book. By the end, you'll look at distance metrics the way I do now - as tools with personalities, each perfect for different jobs.

Fair warning: There will be math, but I promise it's the kind that makes sense when you see it in action with real examples.

The Community

Found this helpful? Have your own distance metric war stories? Questions about which metric to use for your specific use case?

Open an issue - I love helping people solve their similarity search problems. We've all been there, staring at confusing results and wondering what went wrong.

What You'll Learn

Why cosine distance is perfect for text but terrible for other tasks
When dot product similarity actually makes sense
How Manhattan distance can save you from outliers
The secret patterns in high-dimensional spaces
A decision framework that actually works in production

The Plot Twist

The biggest revelation? There's no "best" distance metric. Each one is like a different lens for looking at your data. The magic happens when you know which lens to use when.

Four months ago, I was randomly trying different metrics hoping something would work. Today, I can look at a dataset and immediately know which metric will give me the insights I need.

That transformation is what this project is really about.

"The best way to understand distance metrics isn't through formulas - it's through stories, experiments, and seeing them work (or fail) with real data."

Ready to become a distance metrics detective?

Start Your Journey →

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.ipynb_checkpoints		.ipynb_checkpoints
imgs		imgs
.DS_Store		.DS_Store
GETTING_STARTED.md		GETTING_STARTED.md
README.md		README.md
financial_contracts_analysis.ipynb		financial_contracts_analysis.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

The Distance Metrics Detective Story

A Journey from Confusion to Clarity in Vector Search

Table of Contents

Prerequisites

The Story Behind This Project

The Problem That Started It All

The Eureka Moment

What You'll Discover in This Journey

The Decision Framework That Changed Everything

Why This Matters for You

What's Inside

Ready to Start?

The Community

What You'll Learn

The Plot Twist

About

Uh oh!

Releases

Packages

Languages

bassem-elsodany/distance_metrics_in_vector_search

Folders and files

Latest commit

History

Repository files navigation

The Distance Metrics Detective Story

A Journey from Confusion to Clarity in Vector Search

Table of Contents

Prerequisites

The Story Behind This Project

The Problem That Started It All

The Eureka Moment

What You'll Discover in This Journey

The Decision Framework That Changed Everything

Why This Matters for You

What's Inside

Ready to Start?

The Community

What You'll Learn

The Plot Twist

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages