-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[feat][pip] PIP-420: Provide ability for Pulsar clients to integrate with third-party schema registry service #24328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PIP looks meaningful.
|
||
The Pulsar client is better has the ability to access third-party schema registry service to manage the schema (register schema, | ||
get schema, validate schema, etc.). The schema registry service can be an independent service, if using third-party schema registry service, | ||
the Pulsar broker doesn't need to care about the schema of the messages. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The specific meaning of "not caring" can be further explained in detail. For example: the Broker side only treats the message as raw byte data and no longer performs additional processing such as schema compatibility check. This design makes the Broker more lightweight, thereby significantly improving the overall performance of the system.
Regarding the advantages of third-party schema registry services, it is recommended to elaborate further. For example:
- Taking Confluent Schema Registry as an example, it can achieve unified Schema management between Kafka and Pulsar.
- This service can also achieve collaborative management between Pulsar topic and data lake table metadata.
Schema is an important feature for messaging systems. Pulsar integrates schema manager into the Pulsar broker. | ||
The current implementation in Pulsar clients couples schema management with some protocols (creating producer, adding consumer subscription). | ||
This increases the Pulsar protocol complexity and users can’t leverage third-party schema registry services in Pulsar client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here it might be worth mentioning.
pulsar/pulsar-common/src/main/java/org/apache/pulsar/common/protocol/schema/SchemaStorage.java
Line 29 in bdf6277
public interface SchemaStorage { |
Support for third-party schema registration services can also be implemented through SchemaStorage
. In the following Motivation section, it can be explained what advantages this PIP implementation has over implementing SchemaStorag
·. This makes this PIP more persuasive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for reminding, the SchemaStorage
is designed to manage schema on the server side. This PIP mainly provides the ability to access a third-party schema registry service on the Pulsar client side, so it's not an alternative implementation.
The Pulsar client should ignore the schema information when creating producer and adding consumer subscription. | ||
|
||
Users can implement the `SchemaInfoProvider` interface and `Schema` interface to access external schema registry service. | ||
The `Schema` interface has mainly two methods `encode` and `decode`, the customized schemas can register schema or get schema with these methods. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should be the behavior when the encode method of a custom Schema fails when trying to register the Schema with an external Schema Registry (e.g., due to network issues, authentication failure), or when the decode method cannot find the corresponding Schema in the external Schema Registry based on the ID in the message?
PIP might be able to suggest that implementers consider these scenarios, for example, whether to throw a specific exception, return null, or have a retry mechanism.
Although the specific implementation is up to the user, it would be better to provide some guidance?
the factory can transfer the security configuration to the `SchemaInfoProvider` instance. | ||
|
||
# Pulsar-GEO replication | ||
If users can use third-party schema registry service, it provides a new way to manage scheme for geo-replicated topics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can be slightly expanded a bit. For example, if a user uses an external, globally available Schema Registry (such as a cross-region replicated Confluent Schema Registry), then the Schema synchronization issue in the geo-replication scenario can be guaranteed by this external system, simplifying the Schema synchronization needs of Pulsar itself.
} | ||
``` | ||
|
||
# Pulsar Function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although this PIP mainly focuses on client changes, how do operations staff and users understand this situation when a Topic's Schema is externally managed through the Pulsar Admin API or tools (such as pulsar-admin
)? For example, what should the pulsar-admin schemas get <topic-name>
command return for such Topics? Will there be a new status or flag to indicate that the Schema is externally managed? This may be beyond the direct scope of this PIP, but it is worth raising and considering as part of the overall design impact.
Motivation
The Pulsar client is better has the ability to access third-party schema registry service to manage the schema (register schema, get schema, validate schema, etc.). The schema registry service can be an independent service, if using an external schema registry service, the Pulsar broker doesn't need to care about the schema of the messages while creating producer or adding consumer subscription.
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete