-
Notifications
You must be signed in to change notification settings - Fork 1.1k
feat: Add vector_db_id to chunk metadata #3255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Hi @are-ces! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
@@ -131,8 +131,15 @@ async def query( | |||
for vector_db_id in vector_db_ids | |||
] | |||
results: list[QueryChunksResponse] = await asyncio.gather(*tasks) | |||
chunks = [c for r in results for c in r.chunks] | |||
scores = [s for r in results for s in r.scores] | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a unit test for this with multiple vector DBs to confirm this behavior will work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unit test added thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for this, requesting an added test.
tests/unit/rag/test_rag_query.py
Outdated
# Parse metadata from query result | ||
def parse_metadata(s): | ||
import ast, re | ||
match = re.search(r"Metadata:\s*(\{.*\})", s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use this and the if statement below? you should just be able to do the eval and evaluate the keys, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the TextContentItem
returned looks like this:
TextContentItem(type='text', text="Result 2\nContent: chunk from db2\nMetadata: {'chunk_id': 'chunk2', 'document_id': 'doc2', 'source': 'test_source2', 'vector_db_id': 'db2'}\n")
the Metadata
block needs to be explicitly parsed. That’s why I added the regex match and guard.
That said, I might not be fully understanding your point, did you mean this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, i see the text contains the Metadata json as a string! Cool, this makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please resolve pre-commit.
This PR has been closed because the branch got contaminated and/or overwritten. |
What does this PR do?
When running RAG in a multi vector DB setting, it can be difficult to trace where retrieved chunks originate from. This PR adds the
vector_db_id
into each chunk’s metadata, making it easier to understand which database a given chunk came from. This is helpful for debugging and for analyzing retrieval behavior of multiple DBs.Relevant code:
Test Plan
vector_db_id
was added to each chunk’s metadata.