[SPARK-54908][SQL] Support dateFormat option during JSON schema inference#55417
[SPARK-54908][SQL] Support dateFormat option during JSON schema inference#55417yadavay-amzn wants to merge 1 commit intoapache:masterfrom
Conversation
| // this behavior now. See SPARK-46769 for more details. | ||
| if (SparkDateTimeUtils.stringToTimestampWithoutTimeZone(utf8Value, false).isDefined) { | ||
| TimestampType | ||
| } else if (allCatch.opt(dateFormatter.parse(field)).isDefined) { |
There was a problem hiding this comment.
I think the problem here is the performance. It becomes too slower
There was a problem hiding this comment.
Thanks for the review! Updated with two changes:
-
Date inference only runs when the user explicitly sets
dateFormatInRead. When not set the code path is never reached. -
Added
DateFormatter.parseOptional()that usesDateTimeFormatter.parseUnresolved()instead of exception-based control flow. So for users who do setdateFormat, the performance is comparable to the existing timestamp inference overhead.
Please let me know if you have any suggestions on how it can be improved. Thanks!
There was a problem hiding this comment.
Thing is that we should also support timestamp format too. How would we handle the precedence? Suppose that both are set differently.
There was a problem hiding this comment.
The current code already handles this — timestamp is tried first, date only if timestamp doesn't
match any of the following:
- TimestampNTZType (if default NTZ + matches timestampNTZ format)
- TimestampType (if matches timestamp format)
- DateType (if matches date format)
- StringType (fallback)
So if both timestampFormat and dateFormat are set and a string matches both, timestamp wins. Date
inference only kicks in for strings that fail all timestamp parsers.
That said — should we also gate timestamp inference on an explicit timestampFormat being set, for
consistency? Currently it's controlled by inferTimestamp (a separate boolean config). Happy to align
the two if you think that's the right approach.
There was a problem hiding this comment.
Can we add the explicit test cases? E.g., timstamp format as YYYY and date format YYYY-DD, vise versa. I want to make sure that this all works sound with all type coercion
There was a problem hiding this comment.
Added explicit test cases covering the precedence and type coercion:
- Timestamp takes precedence over date — when both formats match the same string (e.g.,
"2024"withtimestampFormat=yyyyanddateFormat=yyyy),TimestampTypewins. - Date inferred when timestamp does not match —
"2024-07-02"withtimestampFormat=yyyy-MM-dd HH:mm:ssanddateFormat=yyyy-MM-dd→DateType. - No date inference without explicit dateFormat — date-like strings stay
StringTypewhendateFormatis not set. - Type coercion across rows — row 1 infers
DateType, row 2 infersTimestampType→ merged toTimestampType.
Also added the missing DateType + TimestampType → TimestampType coercion rule (previously only had DateType + TimestampNTZType → TimestampNTZType).
All 5 tests pass on both JsonV1Suite and JsonV2Suite.
bd5579d to
e477cdc
Compare
…ence Add date type inference in JsonInferSchema using the dateFormat option, mirroring existing timestamp inference. When inferTimestamp is enabled, strings that don't match timestamp formats are now also tried against the dateFormat. If they match, DateType is inferred instead of StringType. Also adds DateType vs TimestampNTZType compatibility in compatibleType so that merging these types produces TimestampNTZType (matching the existing DateType vs TimestampType -> TimestampType precedence).
e477cdc to
ef8d0bb
Compare
| override def parseOptional(s: String): Option[Int] = { | ||
| val pp = new java.text.ParsePosition(0) | ||
| val parsed = formatter.parseUnresolved(s, pp) | ||
| if (parsed != null && pp.getErrorIndex == -1 && pp.getIndex == s.length) { |
What changes were proposed in this pull request?
Add date type inference in JsonInferSchema using the dateFormat option, mirroring existing timestamp inference.
Why are the changes needed?
The dateFormat option is documented but has no effect during schema inference, while timestampFormat works correctly.
Does this PR introduce any user-facing change?
Yes, JSON schema inference now correctly infers DateType when dateFormat is specified.
How was this patch tested?
Added unit tests verifying date inference with custom dateFormat.
Was this patch authored or co-authored using generative AI tooling?
Yes