Skip to content

[feat][pip] PIP-420: Provide ability for Pulsar clients to integrate with third-party schema registry service #24328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

gaoran10
Copy link
Contributor

@gaoran10 gaoran10 commented May 21, 2025

Motivation

The Pulsar client is better has the ability to access third-party schema registry service to manage the schema (register schema, get schema, validate schema, etc.). The schema registry service can be an independent service, if using an external schema registry service, the Pulsar broker doesn't need to care about the schema of the messages while creating producer or adding consumer subscription.

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

  • Dependencies (add or upgrade a dependency)
  • The public API
  • The schema
  • The default values of configurations
  • The threading model
  • The binary protocol
  • The REST endpoints
  • The admin CLI options
  • The metrics
  • Anything that affects deployment

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added PIP doc-not-needed Your PR changes do not impact docs labels May 21, 2025
@gaoran10 gaoran10 self-assigned this May 21, 2025
Copy link
Contributor

@Denovo1998 Denovo1998 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PIP looks meaningful.


The Pulsar client is better has the ability to access third-party schema registry service to manage the schema (register schema,
get schema, validate schema, etc.). The schema registry service can be an independent service, if using third-party schema registry service,
the Pulsar broker doesn't need to care about the schema of the messages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The specific meaning of "not caring" can be further explained in detail. For example: the Broker side only treats the message as raw byte data and no longer performs additional processing such as schema compatibility check. This design makes the Broker more lightweight, thereby significantly improving the overall performance of the system.

Regarding the advantages of third-party schema registry services, it is recommended to elaborate further. For example:

  1. Taking Confluent Schema Registry as an example, it can achieve unified Schema management between Kafka and Pulsar.
  2. This service can also achieve collaborative management between Pulsar topic and data lake table metadata.

Comment on lines +5 to +7
Schema is an important feature for messaging systems. Pulsar integrates schema manager into the Pulsar broker.
The current implementation in Pulsar clients couples schema management with some protocols (creating producer, adding consumer subscription).
This increases the Pulsar protocol complexity and users can’t leverage third-party schema registry services in Pulsar client.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here it might be worth mentioning.

Support for third-party schema registration services can also be implemented through SchemaStorage. In the following Motivation section, it can be explained what advantages this PIP implementation has over implementing SchemaStorag·. This makes this PIP more persuasive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding, the SchemaStorage is designed to manage schema on the server side. This PIP mainly provides the ability to access a third-party schema registry service on the Pulsar client side, so it's not an alternative implementation.

The Pulsar client should ignore the schema information when creating producer and adding consumer subscription.

Users can implement the `SchemaInfoProvider` interface and `Schema` interface to access external schema registry service.
The `Schema` interface has mainly two methods `encode` and `decode`, the customized schemas can register schema or get schema with these methods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should be the behavior when the encode method of a custom Schema fails when trying to register the Schema with an external Schema Registry (e.g., due to network issues, authentication failure), or when the decode method cannot find the corresponding Schema in the external Schema Registry based on the ID in the message?

PIP might be able to suggest that implementers consider these scenarios, for example, whether to throw a specific exception, return null, or have a retry mechanism.

Although the specific implementation is up to the user, it would be better to provide some guidance?

the factory can transfer the security configuration to the `SchemaInfoProvider` instance.

# Pulsar-GEO replication
If users can use third-party schema registry service, it provides a new way to manage scheme for geo-replicated topics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It can be slightly expanded a bit. For example, if a user uses an external, globally available Schema Registry (such as a cross-region replicated Confluent Schema Registry), then the Schema synchronization issue in the geo-replication scenario can be guaranteed by this external system, simplifying the Schema synchronization needs of Pulsar itself.

}
```

# Pulsar Function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although this PIP mainly focuses on client changes, how do operations staff and users understand this situation when a Topic's Schema is externally managed through the Pulsar Admin API or tools (such as pulsar-admin)? For example, what should the pulsar-admin schemas get <topic-name> command return for such Topics? Will there be a new status or flag to indicate that the Schema is externally managed? This may be beyond the direct scope of this PIP, but it is worth raising and considering as part of the overall design impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
doc-not-needed Your PR changes do not impact docs PIP
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants