Skip to content

Conversation

Copilot
Copy link
Contributor

@Copilot Copilot AI commented Oct 19, 2025

  • Explore the codebase and understand the issue
  • Build baseline successfully
  • Identify conditions for the optimization (Leading Beginning anchor + Trailing anchor with fixed length)
  • Modify EmitScan to detect this pattern and skip TryFindNextPossibleStartingPosition
  • Apply same optimization to RegexCompiler.cs
  • Fix to compute trailing anchor when leading anchor takes precedence
  • Build changes successfully
  • Add test cases to Regex.Match.Tests
  • Run all tests - ALL PASS (30,327 functional tests)
  • Address PR feedback:
    • Only apply optimization for End (\z), not EndZ ($) since EndZ allows optional \n
    • Reorder checks to minimize cost (compute maxLength only after checking anchor)
    • Remove [Fact] test and add test cases to Match_MemberData_Cases instead
  • Final validation and completion

Summary

Successfully implemented and refined the optimization for regex patterns with both leading and trailing anchors and fixed length. The optimization now correctly:

  • Only applies to patterns with End (\z) anchor, not EndZ ($)
  • Minimizes computation cost by checking anchor type before computing max length
  • Has test coverage through Regex.Match.Tests for patterns like ^1234\z
Original prompt

This section details on the original issue you should resolve

<issue_title>Earlier fail fast for fixed-length regex patterns with anchors</issue_title>
<issue_description>

Description

When having a regex pattern which contains both leading+trailing anchors and the pattern is of fixed length, it seems to me that we might be missing an opportunity to fail fast earlier.

Take e.g. the regex pattern ^1234$.
Currently the emitted TryFindNextPossibleStartingPosition checks that the input has at least 4 characters to continue.

if (pos <= inputSpan.Length - 4 && pos == 0)

When invoking IsMatch("12345") on the generated regex, the flow will be:

  • pass TryFindNextPossibleStartingPosition as the input is longer than 4 characters,
  • enter TryMatchAtCurrentPosition and call .StartsWith("1234"),
  • and first then fail on 5 < slice.Length because the input was too long.

If we tightened the check in TryFindNextPossibleStartingPosition to "exactly 4 characters", we wouldn't have to enter TryMatchAtCurrentPosition.

If I understand the code correctly, we can detect this situation in EmitTryFindNextPossibleStartingPosition when:

  • rm.Tree.FindOptimizations.LeadingAnchor is RegexNodeKind.Beginning and
  • rm.Tree.FindOptimizations.FindMode is FindNextStartingPositionMode.TrailingAnchor_FixedLength_LeftToRight_End or FindNextStartingPositionMode.TrailingAnchor_FixedLength_LeftToRight_EndZ

Am I missing something?</issue_description>

Comments on the Issue (you are @copilot in this section)

@stephentoub Thanks. There are certainly more special-cases we could optimize for; it's primarily a question of how common they are and whether it's worth the extra code to maintain. Looking through https://github.com/dotnet/runtime-assets/blob/main/src/System.Text.RegularExpressions.TestData/Regex_RealWorldPatterns.json, there are several hundred occurrences of this pattern, so I'd be ok with a PR to try to optimize it further.

That said, keep in mind that $ does not mean "exactly at the end". It actually means "at the end or just before an ending \n", so a length check of exactly 4 would be wrong.

For this specific construction, I think the right answer would be to not emit TryFindNextPossibleStartingPosition at all. You can see in the source generator code there's already special-cases when Scan is emitted to control whether or not we emit TryFindNextPossibleStartingPosition or TryMatchAtCurrentPosition. In this case, I think the right answer would simply be to just emit TryMatchAtCurrentPosition and have Scan just call it.</comment_new>

Fixes #118489


💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

@github-actions github-actions bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Oct 19, 2025
@Copilot Copilot AI changed the title [WIP] Optimize fail fast for fixed-length regex patterns Optimize fixed-length regex patterns with both leading and trailing anchors Oct 19, 2025
Copilot finished work on behalf of stephentoub October 19, 2025 21:01
@Copilot Copilot AI requested a review from stephentoub October 19, 2025 21:01
@stephentoub stephentoub added area-System.Text.RegularExpressions and removed needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Oct 19, 2025
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-text-regularexpressions
See info in area-owners.md if you want to be subscribed.

Comment on lines 803 to 804
trailingAnchor is RegexNodeKind.End or RegexNodeKind.EndZ &&
maxLength == rm.Tree.FindOptimizations.MinRequiredLength;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, if trailingAnchor is RegexNodeKind.EndZ, will maxLength ever equal MinRequiredLength? Seems like "no" because EndZ has to allow for an optional \n at the end?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, please reorder the checks to minimize cost, e.g. this can be root.ComputeMaxLength() == rm.Tree.FindOptimizations.MinRequiredLength rather than computing the max earlier, so that we only compute it once we've verified other data we already have (like the kind of anchor in use).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're correct. I've updated the code to only apply this optimization for RegexNodeKind.End (the \z anchor), excluding RegexNodeKind.EndZ (the $ anchor) since EndZ allows an optional \n at the end. Fixed in commit 8e7b2d9.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've reordered the checks so that ComputeMaxLength() is only called after verifying !rtl and trailingAnchor == RegexNodeKind.End, minimizing the cost when those conditions aren't met. Fixed in commit 8e7b2d9.

Assert.Contains("base.pattern = \"\\n\";", actual);
}

[Fact]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot, we don't need this [Fact] test. Instead, just make sure that there's at least one pattern in Regex.Match.Tests that will exercise this optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I've removed the [Fact] test and added test cases to Match_MemberData_Cases for patterns like ^1234\z and ^abc\z with various inputs to exercise the optimization. Fixed in commit 8e7b2d9.

Copilot finished work on behalf of stephentoub October 20, 2025 03:58
@Copilot Copilot AI requested a review from stephentoub October 20, 2025 03:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Earlier fail fast for fixed-length regex patterns with anchors

2 participants