Skip to content

[SPARK-54908][SQL] Support dateFormat option during JSON schema inference#55417

Draft
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/SPARK-54908-json-date-infer
Draft

[SPARK-54908][SQL] Support dateFormat option during JSON schema inference#55417
yadavay-amzn wants to merge 1 commit intoapache:masterfrom
yadavay-amzn:fix/SPARK-54908-json-date-infer

Conversation

@yadavay-amzn
Copy link
Copy Markdown

What changes were proposed in this pull request?

Add date type inference in JsonInferSchema using the dateFormat option, mirroring existing timestamp inference.

Why are the changes needed?

The dateFormat option is documented but has no effect during schema inference, while timestampFormat works correctly.

Does this PR introduce any user-facing change?

Yes, JSON schema inference now correctly infers DateType when dateFormat is specified.

How was this patch tested?

Added unit tests verifying date inference with custom dateFormat.

Was this patch authored or co-authored using generative AI tooling?

Yes

// this behavior now. See SPARK-46769 for more details.
if (SparkDateTimeUtils.stringToTimestampWithoutTimeZone(utf8Value, false).isDefined) {
TimestampType
} else if (allCatch.opt(dateFormatter.parse(field)).isDefined) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is the performance. It becomes too slower

Copy link
Copy Markdown
Author

@yadavay-amzn yadavay-amzn Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! Updated with two changes:

  1. Date inference only runs when the user explicitly sets dateFormatInRead. When not set the code path is never reached.

  2. Added DateFormatter.parseOptional() that uses DateTimeFormatter.parseUnresolved() instead of exception-based control flow. So for users who do set dateFormat, the performance is comparable to the existing timestamp inference overhead.

Please let me know if you have any suggestions on how it can be improved. Thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thing is that we should also support timestamp format too. How would we handle the precedence? Suppose that both are set differently.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code already handles this — timestamp is tried first, date only if timestamp doesn't
match any of the following:

  1. TimestampNTZType (if default NTZ + matches timestampNTZ format)
  2. TimestampType (if matches timestamp format)
  3. DateType (if matches date format)
  4. StringType (fallback)

So if both timestampFormat and dateFormat are set and a string matches both, timestamp wins. Date
inference only kicks in for strings that fail all timestamp parsers.

That said — should we also gate timestamp inference on an explicit timestampFormat being set, for
consistency? Currently it's controlled by inferTimestamp (a separate boolean config). Happy to align
the two if you think that's the right approach.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the explicit test cases? E.g., timstamp format as YYYY and date format YYYY-DD, vise versa. I want to make sure that this all works sound with all type coercion

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added explicit test cases covering the precedence and type coercion:

  1. Timestamp takes precedence over date — when both formats match the same string (e.g., "2024" with timestampFormat=yyyy and dateFormat=yyyy), TimestampType wins.
  2. Date inferred when timestamp does not match"2024-07-02" with timestampFormat=yyyy-MM-dd HH:mm:ss and dateFormat=yyyy-MM-ddDateType.
  3. No date inference without explicit dateFormat — date-like strings stay StringType when dateFormat is not set.
  4. Type coercion across rows — row 1 infers DateType, row 2 infers TimestampType → merged to TimestampType.

Also added the missing DateType + TimestampType → TimestampType coercion rule (previously only had DateType + TimestampNTZType → TimestampNTZType).

All 5 tests pass on both JsonV1Suite and JsonV2Suite.

@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-54908-json-date-infer branch from bd5579d to e477cdc Compare April 19, 2026 23:55
…ence

Add date type inference in JsonInferSchema using the dateFormat option,
mirroring existing timestamp inference. When inferTimestamp is enabled,
strings that don't match timestamp formats are now also tried against
the dateFormat. If they match, DateType is inferred instead of StringType.

Also adds DateType vs TimestampNTZType compatibility in compatibleType
so that merging these types produces TimestampNTZType (matching the
existing DateType vs TimestampType -> TimestampType precedence).
@yadavay-amzn yadavay-amzn force-pushed the fix/SPARK-54908-json-date-infer branch from e477cdc to ef8d0bb Compare April 20, 2026 07:16
override def parseOptional(s: String): Option[Int] = {
val pp = new java.text.ParsePosition(0)
val parsed = formatter.parseUnresolved(s, pp)
if (parsed != null && pp.getErrorIndex == -1 && pp.getIndex == s.length) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants