Skip to content

WIP feat(idgen): Start Implementation of NoSQL with the ID Generation Framework #2131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

adam-christian-software
Copy link

@adam-christian-software adam-christian-software commented Jul 18, 2025

This PR is still in draft.

Motivation

This implementation starts the series of PRs to implement the NoSQL work presented in https://docs.google.com/document/d/1POUWe0xMZOBoaJ6Rgiw35ziEoc6OEYCiW7Zk6bR9H6M/edit?tab=t.0#heading=h.nx9vzhg2x8v2 with the first implementation here.

Description

This PR is dedicated to creating the ID Generation Framework.

Related to #650 & #844


public interface SnowflakeIdGenerator extends IdGenerator {
/** Offset of the snowflake ID generator since the 1970-01-01T00:00:00Z epoch instant. */
Instant EPOCH_OFFSET =
Copy link
Contributor

@dimas-b dimas-b Jul 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this is not technically an offset, but an exact moment in time. How about just EPOCH?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the expression is unnecessarily complex, the below one is 100% equivalent and imho easier to grasp:

Instant.parse("2025-03-01T00:00:00Z")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dimas-b - At the risk of sounding too verbose, how about ID_GENERATOR_EPOCH? I'd like to distinguish this from Unix Epoch?

And, I agree @adutra . I'll update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ID_GENERATOR_EPOCH LGTM. Alt: ID_EPOCH.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like ID_EPOCH. I'll do that.

long generateId();

/** Generate the system ID for a node, solely used by/for node management purposes. */
long systemIdForNode(int nodeId);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ID expected to be the same for all IdGenerator implementations? What if a future impl. is not node-based?

Should this be pushed down to the Snowflake ID code?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function isn't specific to the particular implementation.
See the upcoming node-id-lease stuff: there is one implementation with one constant configuration per setup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the more that I read about the node-id lease stuff. I don't know whether this should be in this module. Here's my thinking:

  1. It seems that NodeID generation is special and not a Snowflake ID generation. The implementation is only based upon the node id passed in and some system-wide variables such as SnowflakeIdGeneratorImpl#timestampMax, SnowflakeIdGeneratorImpl#timestampShift, & SnowflakeIdGeneratorImpl#sequenceBits. So, in practice, it's really just the Node ID passed in.
  2. So, given that, I think we could pull this out and just put it into the node leasing modules. That way, we can keep the IdGenerator clean for the cases that require the distributed id generation.

What do y'all think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to Adam's point 1.

... however, I have a bigger concern. Suppose we run with Snowflake IDs for a while and then change to another ID generator. Assume generateId() outputs do not clash. Still, do we expect systemIdForNode(X) to return the same value for all generator implementations and for all possible values of X?

compileOnly(platform(libs.quarkus.bom))
compileOnly("io.quarkus:quarkus-core")

compileOnly(project(":polaris-immutables"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This dependency seems unused. I wonder if you didn't forget to annotate IdGeneratorSource with @PolarisImmutable? It seems like a good candidate for that (lots of anonymous classes implementing this interface).

@Override
void close();

void waitUntilTimeMillisAdvanced();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be good to add javadocs here (and above), it's not immediately clear what this method is supposed to do (spin-wait until the clock ticks?).

Also neither this method nor sleepMillis throw InterruptedException, which is surprising since they are clearly blocking. It could be good to add an @implSpec note about how the interrupt flag is expected to be handled.

Map<String, String> params();

@PolarisImmutable
interface BuildableIdGeneratorSpec extends IdGeneratorSpec {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class looks odd? Afaict we could use ImmutableIdGeneratorSpec.builder() instead. The only difference seems to be that there is a default type here, whereas in IdGeneratorSpec there isn't, but it's certainly possible to overcome this limitation somehow.


long timestampFromId(long id);

long timestampUtcFromId(long id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Timestamp UTC" does not make sense to me 🤔
From the implementation class, it seems you meant "timestamp since Unix Epoch" instead.


long constructId(long timestamp, long sequence, long node);

long timestampFromId(long id);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this is understood as "since the Snowflake Epoch (2025-03-01)". Could be good to add javadocs to clarify.


public interface SnowflakeIdGenerator extends IdGenerator {
/** Offset of the snowflake ID generator since the 1970-01-01T00:00:00Z epoch instant. */
Instant EPOCH_OFFSET =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the expression is unnecessarily complex, the below one is 100% equivalent and imho easier to grasp:

Instant.parse("2025-03-01T00:00:00Z")


@Override
public long currentTimeMillis() {
return SnowflakeIdGenerator.EPOCH_OFFSET_MILLIS;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks suspicious as it is returning a fixed timestamp. Did you mean clockMillis.getAsLong()?

Suggested change
return SnowflakeIdGenerator.EPOCH_OFFSET_MILLIS;
return clockMillis.getAsLong();

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I think that there are some special-casing that is happening and is leaking into our interfaces and implementation. See here as well: https://github.com/apache/polaris/pull/2131/files#r2214779625

We need the Snowflake ID Generation for Commit IDs and Entity Object IDs.

We need a Node ID. Node IDs cannot be Snowflake IDs because Snowflake IDs require a NodeID.

So, I actually think that we need a NodeIdGenerator concept for Node IDs and the concepts presented here for the other IDs. Right now, we are blending them into one interface and I think that is making things a bit confusing.

For example:

  1. IdGeneratorFactory#buildSystemIdGenerator
  2. IdGenerator#systemIdForNode

I think that we probably need to separate these concepts.

What do y'all think, @dimas-b & @snazy?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think node ID generation should be internal to the Snowflake ID code. A sharable NodeIdGenerator generator might be an overkill, but that is a secondary concern.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My initial comment stems from the fact that the clockMillis parameter is part of the method signature (thus part of the API) but is not used, so I don't know why it is there.

}

@Override
public long constructId(long timestamp, long sequence, long nodeId) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method effectively bypasses the Id generator source. Maybe it shouldn't be public? Having two methods in the same API that generate IDs using different mechanisms seems error-prone. OTOH this method seems only called from timeUuidToId, in this same class, so it seems it could be private.

public long generateId() {
var nodeId = idGeneratorSource.nodeId();
checkState(nodeId >= 0, "Cannot generate a new ID, shutting down?");
var nodeIdPattern = ((long) nodeId) << sequenceBits;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: move this variable declaration to line 228. It's not needed before.

@dimas-b dimas-b requested a review from dennishuo July 21, 2025 20:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants