-
Notifications
You must be signed in to change notification settings - Fork 202
feat: parsing directly to a protobuf serialised file #122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
feat: parsing directly to a protobuf serialised file
Reviewer's GuideThis PR integrates a direct protobuf serialization pipeline by defining new ingestion/query protocols, adding the protobuf schema and generated code, implementing a ProtobufFileIngestor that buffers and writes graph nodes/relationships to a .proto binary file, and updating configurations and tests to support the change. Class diagram for ProtobufFileIngestor and ProtocolsclassDiagram
class ProtobufFileIngestor {
+output_path: Path
+_nodes: dict[str, pb.Node]
+_relationships: dict[tuple[str, int, str], pb.Relationship]
+LABEL_TO_ONEOF_FIELD: dict[str, str]
+ONEOF_FIELD_TO_LABEL: dict[str, str]
+__init__(output_path: str)
+ensure_node_batch(label: str, properties: dict[str, Any])
+ensure_relationship_batch(from_spec, rel_type, to_spec, properties)
+flush_all()
-_get_node_id(label: str, properties: dict)
}
class IngestorProtocol {
<<Protocol>>
+ensure_node_batch(label: str, properties: dict[str, Any])
+ensure_relationship_batch(from_spec, rel_type, to_spec, properties)
+flush_all()
}
class QueryProtocol {
<<Protocol>>
+fetch_all(query: str, params: dict[str, Any] | None)
+execute_write(query: str, params: dict[str, Any] | None)
}
ProtobufFileIngestor ..|> IngestorProtocol : implements
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @jkumz, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a significant new feature that enables the direct serialization of code graph data into a Protobuf binary file. This change enhances data handling efficiency by allowing graph nodes and relationships to be processed and stored in a highly optimized format, suitable for rapid extraction and transmission. It bypasses the traditional graph construction phase, streamlining the overall data pipeline. The update includes the definition of a new Protobuf schema, the implementation of a dedicated ingestor class, and the necessary adjustments to development tooling and dependencies to support this new serialization method.
Highlights
- New Protobuf Ingestor: This PR introduces a new
ProtobufFileIngestor
class, which allows for the direct parsing and serialization of graph nodes and relationships into a compact Protobuf binary file. This approach aims to provide a more efficient method for data extraction and transmission, bypassing the need to construct an in-memory graph first. - Protobuf Schema Definition: A comprehensive Protobuf schema (
codec/schema.proto
) has been defined. This schema specifies the structure forGraphCodeIndex
(the top-level container), genericNode
andRelationship
messages, and detailed messages for various code entities such asProject
,Package
,Folder
,File
,Module
,Class
,Function
,Method
,ExternalPackage
,ModuleImplementation
, andModuleInterface
. - Pre-commit Hook Updates: The pre-commit configuration has been updated to accommodate the new Protobuf-generated files. Specifically,
ruff
andruff-format
now excludecodec/schema_pb2.py
andcodec/schema_pb2.pyi
from linting, andmypy
has been configured to includetypes-protobuf
for proper type checking of Protobuf messages. - Comprehensive Test Coverage: Two new test files (
test_protobuf_end_to_end.py
andtest_protobuf_service.py
) have been added. These tests validate the integrity of the Protobuf serialization pipeline, ensuring that the generated binary files are valid, non-empty, and can be correctly deserialized back into the expected data structures. - Dependency Management: The project's dependencies (
pyproject.toml
anduv.lock
) have been updated to includeprotobuf
for Protobuf serialization/deserialization andtypes-protobuf
for type hinting support, ensuring all necessary libraries are available for the new functionality.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey there - I've reviewed your changes - here's some feedback:
- Rather than committing generated schema_pb2.py and .pyi stubs, consider automating protobuf compilation in CI to avoid merge conflicts and reduce noise.
- In ensure_node_batch, when node_id is empty you're silently skipping creation—consider logging a warning to surface missing primary key data.
- Add error handling around flush_all's file write to catch and log any I/O failures during serialization.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Rather than committing generated schema_pb2.py and .pyi stubs, consider automating protobuf compilation in CI to avoid merge conflicts and reduce noise.
- In ensure_node_batch, when node_id is empty you're silently skipping creation—consider logging a warning to surface missing primary key data.
- Add error handling around flush_all's file write to catch and log any I/O failures during serialization.
## Individual Comments
### Comment 1
<location> `codebase_rag/services/protobuf_service.py:74` </location>
<code_context>
+ if value is None:
+ continue
+ destination_attribute = getattr(payload_message, key)
+ if hasattr(destination_attribute, "extend") and isinstance(value, list):
+ # This is a repeated field. Use .extend() to populate it.
+ destination_attribute.extend(value)
</code_context>
<issue_to_address>
Check for tuple values in repeated fields.
Use `isinstance(value, (list, tuple))` to ensure tuples are also handled when extending repeated fields.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
if hasattr(destination_attribute, "extend") and isinstance(value, list):
# This is a repeated field. Use .extend() to populate it.
destination_attribute.extend(value)
=======
if hasattr(destination_attribute, "extend") and isinstance(value, (list, tuple)):
# This is a repeated field. Use .extend() to populate it.
destination_attribute.extend(value)
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `codebase_rag/services/protobuf_service.py:113` </location>
<code_context>
+ )
+ rel.type = pb.Relationship.RelationshipType.RELATIONSHIP_TYPE_UNSPECIFIED
+
+ rel.source_id = str(from_spec[2])
+ rel.target_id = str(to_spec[2])
+
+ if rel.source_id.strip() == "" or rel.target_id.strip() == "":
</code_context>
<issue_to_address>
Consider validating the structure of from_spec and to_spec.
Currently, accessing from_spec[2] and to_spec[2] without validation may cause IndexError if the tuples are shorter than expected. Please add a check to ensure both have at least three elements before accessing.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
rel.source_id = str(from_spec[2])
rel.target_id = str(to_spec[2])
=======
if len(from_spec) < 3 or len(to_spec) < 3:
logger.warning(
f"Invalid relationship spec: from_spec={from_spec}, to_spec={to_spec}. Both must have at least 3 elements."
)
return
rel.source_id = str(from_spec[2])
rel.target_id = str(to_spec[2])
>>>>>>> REPLACE
</suggested_fix>
### Comment 3
<location> `codebase_rag/services/protobuf_service.py:122` </location>
<code_context>
+ )
+ return
+
+ if properties:
+ rel.properties.update(properties)
+
+ unique_key = (rel.source_id, rel.type, rel.target_id)
</code_context>
<issue_to_address>
Consider handling non-dict property values gracefully.
Validate that properties is a dict before calling update, or clarify the expected type in the documentation.
Suggested implementation:
```python
if properties:
if isinstance(properties, dict):
rel.properties.update(properties)
else:
logger.warning(
f"Expected 'properties' to be a dict, got {type(properties).__name__}. Skipping properties update."
)
```
```python
if properties:
if isinstance(properties, dict):
existing_rel.properties.update(properties)
else:
logger.warning(
f"Expected 'properties' to be a dict, got {type(properties).__name__}. Skipping properties update."
)
```
</issue_to_address>
### Comment 4
<location> `codebase_rag/services/protobuf_service.py:143` </location>
<code_context>
+
+ serialized_data = index.SerializeToString()
+
+ self.output_path.parent.mkdir(parents=True, exist_ok=True)
+ with open(self.output_path, "wb") as f:
+ f.write(serialized_data)
</code_context>
<issue_to_address>
Consider handling file write errors explicitly.
Wrap the file write operation in a try/except block to handle permission or disk errors, and provide a clear error message or log.
</issue_to_address>
### Comment 5
<location> `codebase_rag/tests/test_protobuf_service.py:47` </location>
<code_context>
+def test_protobuf_ingestor_serialization_and_deserialization(tmp_path: Path) -> None:
</code_context>
<issue_to_address>
Missing tests for error and edge cases in ProtobufFileIngestor.
Please add tests for scenarios like unknown labels, missing properties, duplicates, unknown relationship types, missing source/target IDs, and property merging to improve robustness and error handling.
Suggested implementation:
```python
def test_protobuf_ingestor_serialization_and_deserialization(tmp_path: Path) -> None:
"""
Tests that the ProtobufFileIngestor correctly serializes data and that
the output can be deserialized back to the original structure.
"""
output_file = tmp_path / "test_graph.proto.bin"
ingestor = ProtobufFileIngestor(str(output_file))
for node_data in SAMPLE_NODES.values():
ingestor.ensure_node_batch(
str(node_data["label"]), cast(dict[str, Any], node_data["properties"])
)
def test_protobuf_ingestor_error_and_edge_cases(tmp_path: Path) -> None:
"""
Tests error and edge cases for ProtobufFileIngestor:
- unknown labels
- missing properties
- duplicate nodes
- unknown relationship types
- missing source/target IDs
- property merging
"""
output_file = tmp_path / "edge_cases_graph.proto.bin"
ingestor = ProtobufFileIngestor(str(output_file))
# Unknown label
with pytest.raises(ValueError):
ingestor.ensure_node_batch("UnknownLabel", {"qualified_name": "foo"})
# Missing properties
with pytest.raises(KeyError):
ingestor.ensure_node_batch("Class", {}) # Assuming 'qualified_name' is required
# Duplicate nodes (should merge or raise, depending on implementation)
node_props = {"qualified_name": "test_project.UserService"}
ingestor.ensure_node_batch("Class", node_props)
# Try to add duplicate
ingestor.ensure_node_batch("Class", node_props) # Should not raise, but should not duplicate
# Unknown relationship type
with pytest.raises(ValueError):
ingestor.ensure_relationship_batch(
"Class", "qualified_name", "test_project.UserService",
"UNKNOWN_REL_TYPE",
"Method", "qualified_name", "test_project.UserService.get_user",
None
)
# Missing source/target IDs
with pytest.raises(KeyError):
ingestor.ensure_relationship_batch(
"Class", "qualified_name", None,
"DEFINES_METHOD",
"Method", "qualified_name", "test_project.UserService.get_user",
None
)
with pytest.raises(KeyError):
ingestor.ensure_relationship_batch(
"Class", "qualified_name", "test_project.UserService",
"DEFINES_METHOD",
"Method", "qualified_name", None,
None
)
# Property merging: add node with new property, ensure merge
ingestor.ensure_node_batch("Class", {"qualified_name": "test_project.UserService", "docstring": "A docstring"})
# Should merge 'docstring' into existing node
node = ingestor._nodes["Class"]["test_project.UserService"]
assert "docstring" in node
assert node["docstring"] == "A docstring"
```
- Ensure that `pytest` is imported at the top of the file if not already present:
`import pytest`
- Adjust exception types (`ValueError`, `KeyError`) to match the actual implementation of `ProtobufFileIngestor` if they differ.
- If `ProtobufFileIngestor` does not raise on duplicates but merges, update the duplicate node test to assert correct merging instead of expecting an exception.
- If the internal structure (`_nodes`) is different, update the property merging assertion accordingly.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
if hasattr(destination_attribute, "extend") and isinstance(value, list): | ||
# This is a repeated field. Use .extend() to populate it. | ||
destination_attribute.extend(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Check for tuple values in repeated fields.
Use isinstance(value, (list, tuple))
to ensure tuples are also handled when extending repeated fields.
if hasattr(destination_attribute, "extend") and isinstance(value, list): | |
# This is a repeated field. Use .extend() to populate it. | |
destination_attribute.extend(value) | |
if hasattr(destination_attribute, "extend") and isinstance(value, (list, tuple)): | |
# This is a repeated field. Use .extend() to populate it. | |
destination_attribute.extend(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkumz I am not sure if this makes sense but please double check.
if properties: | ||
rel.properties.update(properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Consider handling non-dict property values gracefully.
Validate that properties is a dict before calling update, or clarify the expected type in the documentation.
Suggested implementation:
if properties:
if isinstance(properties, dict):
rel.properties.update(properties)
else:
logger.warning(
f"Expected 'properties' to be a dict, got {type(properties).__name__}. Skipping properties update."
)
if properties:
if isinstance(properties, dict):
existing_rel.properties.update(properties)
else:
logger.warning(
f"Expected 'properties' to be a dict, got {type(properties).__name__}. Skipping properties update."
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkumz address this as well, please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a significant new feature for ingesting codebase analysis results into a Protobuf serialized file. The implementation is well-structured, with clear separation of concerns, new protocols for ingestors, and comprehensive tests. The Protobuf schema is well-designed.
My review focuses on improving the robustness and maintainability of the new ProtobufFileIngestor
. I've identified a critical bug related to populating Struct
properties that would cause a runtime error. I've also suggested refactoring the property population logic to use standard library functions for better robustness and simplifying the code. Additionally, there are a couple of smaller suggestions to improve type safety and schema clarity.
Overall, this is a great addition. Addressing these points will make the new ingestion pipeline more reliable and easier to maintain.
Package package = 2; | ||
Folder folder = 3; | ||
Module module = 4; | ||
Class class = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While class
is a valid field name in Protobuf, it's a reserved keyword in Python and many other languages. The Protobuf compiler renames it (e.g., to class_
), but this can be a source of confusion and potential issues with tooling.
To improve clarity and maintainability, consider renaming this field to something that isn't a keyword, like class_payload
. This change would also require updating the LABEL_TO_ONEOF_FIELD
mapping in ProtobufFileIngestor
.
Class class = 5; | |
Class class_payload = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkumz is this really an issue? If it is safe to leave it defined as 'class', then I am okay with this, and we can resolve this.
self._relationships: dict[tuple[str, int, str], pb.Relationship] = {} | ||
logger.info(f"ProtobufFileIngestor initialized to write to: {self.output_path}") | ||
|
||
def _get_node_id(self, label: str, properties: dict) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The properties
argument is missing a type hint. Based on the mypy
configuration (disallow_untyped_defs = true
), this should be typed. Using dict[str, Any]
would be consistent with its usage in ensure_node_batch
and improve type safety.
def _get_node_id(self, label: str, properties: dict) -> str: | |
def _get_node_id(self, label: str, properties: dict[str, Any]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkumz address this as well, please
for key, value in properties.items(): | ||
if hasattr(payload_message, key): | ||
if value is None: | ||
continue | ||
destination_attribute = getattr(payload_message, key) | ||
if hasattr(destination_attribute, "extend") and isinstance(value, list): | ||
# This is a repeated field. Use .extend() to populate it. | ||
destination_attribute.extend(value) | ||
else: | ||
# This is a scalar field. Use a simple assignment. | ||
setattr(payload_message, key, value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for populating protobuf messages is complex and could be brittle. It manually checks for repeated fields and might not handle all edge cases correctly (e.g., non-list values for repeated fields).
Using google.protobuf.json_format.ParseDict
would simplify this logic, make it more robust, and align with standard practices for converting dictionaries to protobuf messages. This change would improve maintainability.
You'll need to add from google.protobuf.json_format import ParseDict
at the top of the file.
try:
# Filter out None values, as ParseDict can error on them for scalar fields.
valid_properties = {k: v for k, v in properties.items() if v is not None}
ParseDict(valid_properties, payload_message, ignore_unknown_fields=True)
except Exception as e:
logger.error(
f"Failed to parse properties for label '{label}' into protobuf message: {e}\n"
f"Properties: {valid_properties}"
)
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkumz, please implement this suggestion and mark is as done.
Forgot to mention parsing directly to Protobuf eliminates the indempotency provided via MERGE calls so any downstream consumers will need to deal with this accordingly. There is also more relationships in the proto than in the graph, also as a result of Memgraph processing; when dumping from Memgraph the ingestion was filtered using MATCH clauses so it was not possible to connect a node to a non existing node, nodes had to exist before MERGING. GraphUpdater only creates Module nodes for files it finds physically within the repository, so it skips Modules for external libraries, which implicitly filters out relationships in Memgraph ingestion for these nodes that never get created, thereby filtering out relationships that would connect these nodes. This implicit filtering of relationships isn't in place without parsing into Memgraph. In my testing I was using it on the Pydantic library and got ±24K relationships with the Protobuf route and ±18K with Memgraph dump to JSON. Mostly there is more IMPORTS and INHERITS relationships to items from external libraries whether their Modules exist in the directory or not. If someone wants to ingest into Memgraph from the protobuf file in the future and treat it as a way to produce compact snapshots then this isn't an issue as the same indempotency & filtering can be applied, but for external processing it's worth noting there will be some extra work to do with this. I think this is fair enough as it's better to have an excess of info than a lack of info in the case of external processing. |
This is an incredible contribution, @jkumz . I will review this PR as soon as possible. |
Thanks! Let me know what needs changed to get it merged and I'll be happy to sort it out. |
While I review this, can you please clear all AI bot comments if they make sense to you? Thanks |
@jkumz Left some comments and suggestions. I think we need one more round of refinement, and we are good to merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see my comments.
Hey @jkumz are you going to submit the changes? Just bumping this... |
feat: parsing directly to a protobuf serialised file
Summary by Sourcery
Introduce Protobuf-based ingestion by defining ingestion/query protocols, generating and integrating a Protobuf schema with Python bindings, and implementing ProtobufFileIngestor for efficient serialization of graph data. Update dependencies, linting, and CI configurations and add tests to verify the new pipeline.
New Features:
Enhancements:
CI:
Tests: