[SPARK-56547][SQL][CONNECT] Improve error message when DataFrame column is referenced from a wrong DataFrame#55421
Draft
zhengruifeng wants to merge 2 commits intoapache:masterfrom
Draft
Conversation
…mn is referenced out of scope
### What changes were proposed in this pull request?
When a user references a column from a different DataFrame whose value was
overwritten in the current DataFrame (e.g. `df2.select(df1["id"])` where `df2 =
df1.withColumn("id", ...)`), the analyzer raised a misleading
`UNRESOLVED_COLUMN.WITH_SUGGESTION` error that suggested the exact column name
that was referenced.
This PR detects the case in `ColumnResolutionHelper.resolveDataFrameColumnRecursively`:
when a DataFrame column is resolved against the plan node matching the
`PLAN_ID_TAG` but its resolved attribute is filtered out because it's not in
some ancestor operator's output, we tag the `UnresolvedAttribute`. If the
attribute is still unresolved after the remaining analyzer rules (e.g.
`ResolveReferencesInSort` may still resolve it by promoting hidden output),
`CheckAnalysis` raises the existing `CANNOT_RESOLVE_DATAFRAME_COLUMN` error
which clearly points out the wrong DataFrame reference.
### Why are the changes needed?
The previous error was confusing — it said a column couldn't be resolved and
suggested the same column name as a fix, giving no hint that the real issue
was a reference to the wrong DataFrame.
### Does this PR introduce _any_ user-facing change?
Yes, the error class for the specific case of referencing a column from a
different DataFrame changes from `UNRESOLVED_COLUMN.WITH_SUGGESTION` to
`CANNOT_RESOLVE_DATAFRAME_COLUMN`. The latter already contained guidance about
illegal DataFrame references.
### How was this patch tested?
Updated `test_invalid_column` in `test_connect_error.py` to cover both the
existing case (`cdf3.select(cdf1.b)` where `cdf3 = cdf1.select(cdf1.a)`) and
the new `withColumn` case.
### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code
Co-authored-by: Isaac
Any unresolved attribute with `PLAN_ID_TAG` at `CheckAnalysis` time is a Spark Connect DataFrame column reference that couldn't be reached from the current operator, so raise `CANNOT_RESOLVE_DATAFRAME_COLUMN` directly without introducing a dedicated tag. Co-authored-by: Isaac
4a609cc to
0bd0129
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
In
CheckAnalysis.failUnresolvedAttribute, when the unresolvedAttributecarriesPLAN_ID_TAG(i.e. a Spark Connect DataFrame column reference likedf.col_nameordf["col_name"]), raise the existingCANNOT_RESOLVE_DATAFRAME_COLUMNerror instead of the genericUNRESOLVED_COLUMN. The tag reliably identifies DataFrame column references that the plan-id-based resolution already tried and failed, so the dedicated error's message ("illegal references likedf1.select(df2.col("a"))") is always more accurate than a similarity-based suggestion.Why are the changes needed?
The previous error was misleading. For example:
produced:
The message says
idcan't be resolved and suggestsidas a fix, with no hint that the real issue is a reference to a column from a different DataFrame whose value was overwritten bywithColumn.Does this PR introduce any user-facing change?
Yes. For DataFrame column references that fail to resolve in Spark Connect, the error class changes from
UNRESOLVED_COLUMN.WITH_SUGGESTION(orUNRESOLVED_COLUMN.WITHOUT_SUGGESTION) toCANNOT_RESOLVE_DATAFRAME_COLUMN:How was this patch tested?
Updated
test_invalid_columninpython/pyspark/sql/tests/connect/test_connect_error.py:cdf3.select(cdf1.b)wherecdf3 = cdf1.select(cdf1.a)(existing case, error class updated).cdf3.select(cdf1.a)wherecdf3 = cdf1.withColumn("a", F.lit(0))(new case mirroring the reported issue).Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Opus 4.7)