Skip to content

fix: handle multi-byte UTF-8 chars in SQL special char detection#4458

Merged
fzipi merged 1 commit into
mainfrom
fix/multibyte-chars
Feb 17, 2026
Merged

fix: handle multi-byte UTF-8 chars in SQL special char detection#4458
fzipi merged 1 commit into
mainfrom
fix/multibyte-chars

Conversation

@fzipi
Copy link
Copy Markdown
Member

@fzipi fzipi commented Feb 16, 2026

what

  • Extract multi-byte UTF-8 characters (´ U+00B4, ' U+2018, ' U+2019) from regex character classes into alternations to prevent byte-by-byte matching that caused false positives with non-Latin scripts (Chinese, Japanese, Arabic, Korean, Hebrew)
  • Affects rules: 942420, 942421, 942430, 942431, 942432
  • Creates shared regex-assembly/include/sql-special-chars-anomaly.ra include file and composable .ra files for all 5 rules using named assemblies

Test plan

  • All 52 regression tests pass across 5 rules (942420, 942421, 942430, 942431, 942432)
  • crs-toolchain regex compare confirms generated regex matches for all 5 rules
  • Tests cover: ASCII special chars, multi-byte UTF-8 special chars (´ ' '), and negative tests for non-Latin scripts (Chinese, Japanese, Arabic, Korean, Hebrew)
  • CI passes

refs

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Feb 16, 2026

📊 Quantitative test results for language: eng, year: 2023, size: 10K, paranoia level: 1:
🚀 Quantitative testing did not detect new false positives

Comment thread regex-assembly/include/sql-special-chars-anomaly.ra
@fzipi
Copy link
Copy Markdown
Member Author

fzipi commented Feb 16, 2026

Need to try using just the include and prefix and suffix. The only thing that should change is the amount of chars in the suffix 🤷

@fzipi fzipi force-pushed the fix/multibyte-chars branch from e5a106b to a5ea2fd Compare February 16, 2026 18:34
@fzipi fzipi changed the title fix(942430): handle multi-byte UTF-8 chars in SQL special char detection fix: handle multi-byte UTF-8 chars in SQL special char detection Feb 16, 2026
@fzipi fzipi requested review from Copilot and theseion February 16, 2026 18:37
@fzipi
Copy link
Copy Markdown
Member Author

fzipi commented Feb 16, 2026

Need to try using just the include and prefix and suffix. The only thing that should change is the amount of chars in the suffix 🤷

Worked as a charm! You need to ❤️ the crs-toolchain devs! 🎸

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates CRS SQLi “special character anomaly” detection to correctly handle multi-byte UTF-8 quote-like characters, preventing false positives on non‑Latin scripts while keeping behavior consistent across supported engines.

Changes:

  • Refactors 5 SQLi anomaly regexes (942420/942421/942430/942431/942432) to match UTF-8 multi-byte characters via alternation (byte sequences) rather than inside character classes.
  • Introduces a shared regex-assembly include (sql-special-chars-anomaly.ra) plus composable .ra sources for each of the 5 rules.
  • Expands regression coverage with new positive/negative cases (including non‑Latin text) for all affected rules.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/regression/tests/REQUEST-942-APPLICATION-ATTACK-SQLI/942420.yaml Adds/updates cookie-focused regression tests, including UTF‑8 quotes and non‑Latin negative cases.
tests/regression/tests/REQUEST-942-APPLICATION-ATTACK-SQLI/942421.yaml Adds/updates PL4 cookie anomaly regression tests with UTF‑8 quote and non‑Latin negatives.
tests/regression/tests/REQUEST-942-APPLICATION-ATTACK-SQLI/942430.yaml Adds extensive args anomaly regression tests, including UTF‑8 quotes/acute accent and multiple non‑Latin negatives.
tests/regression/tests/REQUEST-942-APPLICATION-ATTACK-SQLI/942431.yaml Adds args anomaly regression tests (incl. UTF‑8 quote) and improves existing negative array-name cases.
tests/regression/tests/REQUEST-942-APPLICATION-ATTACK-SQLI/942432.yaml Adds args anomaly regression tests for UTF‑8 quotes and multiple negatives to validate FP reductions.
rules/REQUEST-942-APPLICATION-ATTACK-SQLI.conf Updates the 5 rule regexes to avoid byte-by-byte matching of multi-byte UTF‑8 chars.
regex-assembly/include/sql-special-chars-anomaly.ra New shared include defining ASCII and UTF‑8 special-char matching as safe alternations.
regex-assembly/942420.ra New regex-assembly source for rule 942420 using the shared include.
regex-assembly/942421.ra New regex-assembly source for rule 942421 using the shared include.
regex-assembly/942430.ra New regex-assembly source for rule 942430 using the shared include.
regex-assembly/942431.ra New regex-assembly source for rule 942431 using the shared include.
regex-assembly/942432.ra New regex-assembly source for rule 942432 using the shared include.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread rules/REQUEST-942-APPLICATION-ATTACK-SQLI.conf
Extract multi-byte UTF-8 characters (´ U+00B4, ' U+2018, ' U+2019)
from regex character classes into alternations to prevent byte-by-byte
matching that caused false positives with non-Latin scripts (Chinese,
Japanese, Arabic, Korean, Hebrew).

Affects rules: 942420, 942421, 942430, 942431, 942432.

Creates shared include file sql-special-chars-anomaly.ra and composable
.ra files for all 5 rules using named assemblies.

Closes #3325
@fzipi fzipi force-pushed the fix/multibyte-chars branch from a5ea2fd to bd801f3 Compare February 16, 2026 19:50
@fzipi fzipi requested a review from a team February 16, 2026 19:59
@fzipi fzipi added this pull request to the merge queue Feb 17, 2026
Merged via the queue into main with commit 6130910 Feb 17, 2026
8 checks passed
@fzipi fzipi deleted the fix/multibyte-chars branch February 17, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix regex patterns that look for multi-byte characters

3 participants