Skip to content

JS-1706 Fix S5868 false positive for regex fragment boundaries#7124

Open
sonar-nigel[bot] wants to merge 9 commits into
masterfrom
fix/JS-1706-fix-fp-on-s5868-unicode-in-variables-concatenated-into-regex-character-classes-gpt-5.5
Open

JS-1706 Fix S5868 false positive for regex fragment boundaries#7124
sonar-nigel[bot] wants to merge 9 commits into
masterfrom
fix/JS-1706-fix-fp-on-s5868-unicode-in-variables-concatenated-into-regex-character-classes-gpt-5.5

Conversation

@sonar-nigel
Copy link
Copy Markdown
Contributor

@sonar-nigel sonar-nigel Bot commented May 29, 2026

Relates to JS-1706

S5868 now avoids reporting Unicode grapheme sequences that are created only by concatenating separately sourced regex fragments into a dynamic character class. The fix tracks fragment seams in reconstructed RegExp constructor patterns and suppresses only reports that cross a real non-literal fragment boundary.

  • Keeps reporting grapheme-like sequences that are fully contained in one fragment, including identifier-resolved fragments.
  • Keeps reporting fully static regex sources split across literal-only concatenation.
  • Narrows seam matching to exact regexpp character boundaries so split escapes are still analyzed correctly.

Vibe Bot added 4 commits May 29, 2026 08:59
Tests cover dynamically concatenated regex character classes where Unicode
adjacency is created across fragment boundaries. They verify that seam-crossing
combining characters, surrogate pairs, emoji modifiers, regional indicators,
and ZWJ sequences are accepted while same-fragment grapheme clusters still
raise issues.

Relates to JS-1706
Track concatenation fragment boundaries while extracting regex patterns and
use them in S5868 to suppress Unicode grapheme reports created only across
dynamic RegExp string joins. Reports for suspicious Unicode sequences that
remain within a single source fragment are still raised, following the
approved Jira implementation guidelines.

Relates to JS-1706.
ESLint 9 flat-config verification no longer calls the custom pragma patch hooks,
so inline configs and disable comments using ESLint rule aliases were not mapped
to sonarjs/Sxxxx rule ids before linting. Patch directive comments before
verification so alias-based inline configs keep Sonar defaults and disable
directives suppress the intended wrapped rules. Update the linter expectation for
prefer-const severity-only inline config to match the preserved Sonar defaults.
Remove stale imports from the split linter test files so the bridge
compile no longer fails with TS6133 unused-import errors.

No SonarQube issues or security hotspots were present in the local scan
inputs, and no NOSONAR suppressions were added.
@sonar-nigel
Copy link
Copy Markdown
Contributor Author

sonar-nigel Bot commented May 29, 2026

Rule Profile

Field Value
Rule S5868 — Unicode Grapheme Clusters should be avoided inside regex character classes
Severity / type Major BUG
Sonar Way Yes
Scope Main
"cc̈d̈d".replace(/[c̈d̈]/g, "X"); // result is "XXXXXX" and not expected "cXXd"

False Positive Pattern

The false positives happen when a RegExp constructor builds a character class through concatenation, for example by adding [ and ] around variables that contain Unicode ranges or escapes. The flattened final pattern can place adjacent code points across source-fragment boundaries, but that adjacency is an artifact of dynamic regex construction rather than a grapheme cluster written inside one character-class fragment.

var nonASCIIidentifierStart = new RegExp("[" + nonASCIIidentifierStartChars + "]");
var nonASCIIidentifier = new RegExp("[" + nonASCIIidentifierStartChars + nonASCIIidentifierChars + "]");

False Negative Risk

The risk is bounded because suppression applies only when the suspicious adjacent regexpp characters straddle an exact recorded seam from a non-literal fragment boundary. Same-fragment Unicode sequences are still reported, including when the fragment comes through an identifier, and literal-only concatenations remain fully static so they are still checked.

const c = "👍";
const p = "[" + c + "]";
new RegExp(p);

Fix Summary

Regex pattern extraction now carries seam metadata alongside the flattened pattern and flags, plus an isPureLiteral marker to distinguish readability-only literal concatenation from dynamic fragment construction. S5868 consults those seams before reporting combined characters, surrogate pairs, modified emoji, regional indicators, and ZWJ sequences.

  • Records concatenation seams only when at least one side is not a pure inline literal.
  • Suppresses only reports crossing the exact boundary between adjacent parsed regex characters.
  • Adds regression coverage for valid dynamic-fragment cases and invalid same-fragment, split-escape, and literal-only concatenation cases.

@sonar-nigel sonar-nigel Bot temporarily deployed to sca-checking May 29, 2026 10:24 Inactive
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

Ruling Report

No changes to ruling expected issues in this PR

Replace the ES2023 Array.prototype.findLast usage in S5868 with an
explicit reverse ancestor scan so the rule compiles under the ESLint
Plugin TypeScript target while preserving the same RegExp constructor
lookup behavior.
@sonar-nigel sonar-nigel Bot temporarily deployed to sca-checking May 29, 2026 11:41 Inactive
@sonar-nigel sonar-nigel Bot marked this pull request as ready for review May 29, 2026 12:20
@francois-mora-sonarsource
Copy link
Copy Markdown
Contributor

The new seam suppression is broader than the intended fix. spansSeam() currently treats any concatenation seam between charA.start and charB.start as if it were a boundary between the two parsed regex characters, but that also matches seams that land inside the raw text of charA.

For example:

const part1 = '\\uD83';
const part2 = 'D\\uDC4D';
new RegExp('[' + part1 + part2 + ']');

Here the concatenation happens inside the escape \uD83D, not between \uD83D and \uDC4D. The final regex is still [\uD83D\uDC4D], so S5868 should still report it, but the current predicate suppresses it.

Could you narrow the seam check to the exact boundary between the adjacent regexpp characters instead of any seam before charB.start? Since seams are tracked in pattern space and regexpp offsets include the opening slash, the aligned case is seam + 1 === charB.start (equivalently seam === charA.end - 1, with charA.end === charB.start).

It would also be good to add an invalid regression test for the escape-split example above, so we keep suppressing only true fragment-boundary cases.

Vibe Bot added 2 commits May 29, 2026 15:37
Comment: The new seam suppression is broader than the intended fix. `spansSeam()` currently treats any concatenation seam between `charA.start` and `charB.start` as if it were a boundary between the two parsed regex characters, but that also matches seams that land inside the raw text of `charA`.

For example:

```js
const part1 = '\\uD83';
const part2 = 'D\\uDC4D';
new RegExp('[' + part1 + part2 + ']');
```

Here the concatenation happens inside the escape `\uD83D`, not between `\uD83D` and `\uDC4D`. The final regex is still `[\uD83D\uDC4D]`, so S5868 should still report it, but the current predicate suppresses it.

Could you narrow the seam check to the exact boundary between the adjacent `regexpp` characters instead of any seam before `charB.start`? Since seams are tracked in pattern space and `regexpp` offsets include the opening slash, the aligned case is `seam + 1 === charB.start` (equivalently `seam === charA.end - 1`, with `charA.end === charB.start`).

It would also be good to add an `invalid` regression test for the escape-split example above, so we keep suppressing only true fragment-boundary cases.
The S5868 unit test for a RegExp constructor built from split string parts
was missing the expected unicode-flag suggestion. ESLint RuleTester requires
test errors to declare suggestions when the rule emits them, so add the
expected suggestion output for that case.
@sonar-nigel sonar-nigel Bot temporarily deployed to sca-checking May 29, 2026 15:40 Inactive
@francois-mora-sonarsource
Copy link
Copy Markdown
Contributor

Improvement suggestion: the seam metadata should not treat every static + boundary as a boundary where S5868 must suppress adjacency checks.

Literal-only concatenation is still a fully static regex source. For example, new RegExp('[A' + '\\u0301]') is equivalent to new RegExp('[A\\u0301]'), so S5868 should still report the Unicode grapheme cluster inside the character class. The same applies to split surrogate literals such as new RegExp('[\\uD83D' + '\\uDC4D]').

A concrete fix would be to track whether an extracted pattern came from a pure inline literal expression:

type ExtractedRegexPattern = {
  pattern: string;
  flags: string;
  seams: number[];
  isPureLiteral: boolean;
};

Then literals, static templates, and String.raw templates return isPureLiteral: true; identifier resolution returns the resolved pattern with isPureLiteral: false; and binary + only records the current split offset when at least one side is not pure literal:

const splitSeams = left.isPureLiteral && right.isPureLiteral ? [] : [splitOffset];

return {
  pattern: left.pattern + right.pattern,
  flags: '',
  seams: [...left.seams, ...splitSeams, ...right.seams.map(s => s + splitOffset)],
  isPureLiteral: left.isPureLiteral && right.isPureLiteral,
};

This keeps the false-positive fix for variable/fragment boundaries, while preserving true positives for hard-coded regexes split only for readability. Regression tests should cover both sides: keep the variable-fragment cases valid, and add invalid cases for new RegExp('[A' + '\\u0301]') and new RegExp('[\\uD83D' + '\\uDC4D]').

Comment: Improvement suggestion: the seam metadata should not treat every static `+` boundary as a boundary where S5868 must suppress adjacency checks.

Literal-only concatenation is still a fully static regex source. For example, `new RegExp('[A' + '\\u0301]')` is equivalent to `new RegExp('[A\\u0301]')`, so S5868 should still report the Unicode grapheme cluster inside the character class. The same applies to split surrogate literals such as `new RegExp('[\\uD83D' + '\\uDC4D]')`.

A concrete fix would be to track whether an extracted pattern came from a pure inline literal expression:

```ts
type ExtractedRegexPattern = {
  pattern: string;
  flags: string;
  seams: number[];
  isPureLiteral: boolean;
};
```

Then literals, static templates, and `String.raw` templates return `isPureLiteral: true`; identifier resolution returns the resolved pattern with `isPureLiteral: false`; and binary `+` only records the current split offset when at least one side is not pure literal:

```ts
const splitSeams = left.isPureLiteral && right.isPureLiteral ? [] : [splitOffset];

return {
  pattern: left.pattern + right.pattern,
  flags: '',
  seams: [...left.seams, ...splitSeams, ...right.seams.map(s => s + splitOffset)],
  isPureLiteral: left.isPureLiteral && right.isPureLiteral,
};
```

This keeps the false-positive fix for variable/fragment boundaries, while preserving true positives for hard-coded regexes split only for readability. Regression tests should cover both sides: keep the variable-fragment cases valid, and add invalid cases for `new RegExp('[A' + '\\u0301]')` and `new RegExp('[\\uD83D' + '\\uDC4D]')`.
@sonar-nigel sonar-nigel Bot temporarily deployed to sca-checking May 29, 2026 16:43 Inactive
@sonarqube-next
Copy link
Copy Markdown

@francois-mora-sonarsource francois-mora-sonarsource requested review from a team and removed request for francois-mora-sonarsource May 29, 2026 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant