Skip to content

Conversation

are-ces
Copy link

@are-ces are-ces commented Aug 26, 2025

What does this PR do?

When running RAG in a multi vector DB setting, it can be difficult to trace where retrieved chunks originate from. This PR adds the vector_db_id into each chunk’s metadata, making it easier to understand which database a given chunk came from. This is helpful for debugging and for analyzing retrieval behavior of multiple DBs.

Relevant code:

for vector_db_id, result in zip(vector_db_ids, results):
    for chunk, score in zip(result.chunks, result.scores):
        if not hasattr(chunk, "metadata") or chunk.metadata is None:
            chunk.metadata = {}
        chunk.metadata["vector_db_id"] = vector_db_id

        chunks.append(chunk)
        scores.append(score)

Test Plan

  • Ran Llama Stack in debug mode.
  • Verified that vector_db_id was added to each chunk’s metadata.
  • Confirmed that the metadata was printed in the console when using the RAG tool.

Copy link

meta-cla bot commented Aug 26, 2025

Hi @are-ces!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@are-ces are-ces changed the title Add vector_db_id to chunk metadata feat: Add vector_db_id to chunk metadata Aug 26, 2025
@@ -131,8 +131,15 @@ async def query(
for vector_db_id in vector_db_ids
]
results: list[QueryChunksResponse] = await asyncio.gather(*tasks)
chunks = [c for r in results for c in r.chunks]
scores = [s for r in results for s in r.scores]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a unit test for this with multiple vector DBs to confirm this behavior will work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test added thank you!

Copy link
Collaborator

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for this, requesting an added test.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 27, 2025
# Parse metadata from query result
def parse_metadata(s):
import ast, re
match = re.search(r"Metadata:\s*(\{.*\})", s)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why use this and the if statement below? you should just be able to do the eval and evaluate the keys, no?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the TextContentItem returned looks like this:
TextContentItem(type='text', text="Result 2\nContent: chunk from db2\nMetadata: {'chunk_id': 'chunk2', 'document_id': 'doc2', 'source': 'test_source2', 'vector_db_id': 'db2'}\n")
the Metadata block needs to be explicitly parsed. That’s why I added the regex match and guard.

That said, I might not be fully understanding your point, did you mean this?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, i see the text contains the Metadata json as a string! Cool, this makes sense.

Copy link
Collaborator

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Collaborator

@franciscojavierarceo franciscojavierarceo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please resolve pre-commit.

@are-ces
Copy link
Author

are-ces commented Sep 2, 2025

This PR has been closed because the branch got contaminated and/or overwritten.
I’ve created a new PR from a clean branch that contains only the relevant commits: here.
Please review the new PR instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants