Print Data Using PySpark

When you print a PySpark DataFrame using Python’s built-in print() function, you do not see your data. You see metadata: column names, data types, and partition information. That is not a bug. It is by design. PySpark uses lazy evaluation, which means transformations build an execution plan but nothing executes until an action runs. Understanding this is the first step to working with data in distributed environments.

For tabular data inspection, PySpark provides show(), printSchema(), head(), and collect(). Each serves a distinct purpose and choosing the wrong one causes real problems: collect() crashes drivers on large datasets, print() gives misleading output, and show() silently truncates long strings. This guide covers all of them with working code.

TLDR

Use show() for console inspection. It is the primary debugging tool in PySpark.
Use printSchema() to understand column names, types, and nullability before transformations.
Use head(n) or take(n) for programmatic access to a few rows as Row objects.
Use collect() only when the entire dataset fits in driver memory.
Use toPandas() only when the entire dataset fits in driver memory and you need pandas ecosystem tools.
Always check truncate=False in show() to inspect full string values.

PySpark DataFrames vs Pandas DataFrames

PySpark DataFrames are distributed collections of data organized into named columns, similar in concept to tables in a relational database. The key difference is where the data lives. A pandas DataFrame is a single in-process Python object backed by NumPy arrays. A PySpark DataFrame is a logical plan distributed across multiple JVM processes on different worker nodes.

When you call df.filter(df.age > 30) in PySpark, nothing happens immediately. The operation queues in a DAG (directed acyclic graph) inside the Catalyst optimizer. When you call df.filter(df.age > 30).show(), Spark compiles the plan, ships the filter to worker nodes, runs the operation in parallel, and streams the first 20 rows back. This matters because printing is not just viewing data. It is triggering a distributed computation.

If you need to compare PySpark and pandas side by side, the practical rule is this: pandas is for data that fits in RAM on a single machine. PySpark is for data distributed across a cluster. The APIs look similar but the execution model is fundamentally different.

print() vs show() vs display()

Python’s print() on a PySpark DataFrame produces output like this:

df = spark.read.csv("titanic.csv", header=True, inferSchema=True)
print(df)
# Output:
# DataFrame[PassengerId: int, Survived: int, Pclass: int, Name: string, Sex: string, Age: double, SibSp: int, Parch: int, Ticket: string, Fare: double, Cabin: string, Embarked: string]

That is the schema representation, not the data. print() is useful for debugging the shape of your DataFrame during development but never for inspecting actual rows.

show() is the standard method for printing rows from a PySpark DataFrame. It triggers an action and prints a formatted table to stdout.

df.show()
# +-----------+--------+------+--------------------+------+----+
# |PassengerId|Survived|Pclass|                Name|   Sex|Age|
# +-----------+--------+------+--------------------+------+----+
# |          1|       0|     3| Braund, Mr. O...|  male|22.0|
# |          2|       0|     3| Cumings, Mr. J...|female|38.0|
# |          3|       1|     1| Heikkinen, Miss...|female|26.0|
# +-----------+--------+------+--------------------+------+----+
# only showing top 20 rows

display() is a Jupyter Notebook-specific function. It renders the DataFrame as a formatted HTML table with better visual styling than the plain-text output of show(). If you are working in a Jupyter or Databricks notebook, display(df) produces cleaner output. In a plain Python script or PySpark shell, display() does not exist.

# Works in Jupyter/DB notebooks, not in spark-submit or pyspark shell
display(df)

The practical summary: use print() for quick type checking, use show() for development debugging, and use display() when you want prettier output in notebook environments.

show() Method Deep Dive

The show() method signature accepts three key parameters: n for the number of rows, truncate for string column width, and vertical for rotated column orientation.

The default n=20 shows 20 rows. Override it by passing an integer:

df.show(5)  # shows top 5 rows only

Truncate defaults to True, which caps string columns at 20 characters using ellipsis. Disable truncation entirely with False:

df.show(truncate=False)

Setting truncate=50 limits string display to 50 characters, which is useful when you want partial visibility without the wall-of-text effect that False produces on wide datasets:

df.show(truncate=50)

Vertical mode is useful when a DataFrame has many wide columns. Each row displays as a separate block with one column-value pair per line:

df.select("Name", "Sex", "Age", "Fare", "Pclass", "Survived").show(5, vertical=True)

Here is the output in vertical mode for a single row:

-RECORD 0-------------------------------------------------
                 Name | Cumings, Mr. John Bradley
                    Sex | female
                    Age | 38.0
                   Fare | 71.2833
                  Pclass | 1
               Survived | 1
only showing top 5 rows

Vertical mode works well for DataFrames with 5-10 columns. Beyond that, even single-row displays become unwieldy. When debugging complex struct columns or deeply nested data, vertical mode often reveals information that horizontal tables hide entirely.

printSchema() and printColSchema()

Understanding DataFrame structure matters for everything that follows. PySpark DataFrames store data in a distributed, columnar format optimized for parallel processing, which differs fundamentally from the in-memory row-wise format of pandas. Each partition lives on a different worker node, and schema information is managed centrally by the Catalyst optimizer rather than per-partition.

printSchema() prints the full schema tree including nested struct types and array elements:

df.printSchema()
# root
#  |-- PassengerId: integer (nullable = true)
#  |-- Survived: integer (nullable = true)
#  |-- Pclass: integer (nullable = true)
#  |-- Name: string (nullable = true)
#  |-- Sex: string (nullable = true)
#  |-- Age: double (nullable = true)
#  |-- SibSp: integer (nullable = true)
#  |-- Parch: integer (nullable = true)
#  |-- Ticket: string (nullable = true)
#  |-- Fare: double (nullable = true)
#  |-- Cabin: string (nullable = true)
#  |-- Embarked: string (nullable = true)

The nullable flag tells you whether a column can contain null values, which is critical when writing transformations. If you apply df.fillna() or df.dropna(), knowing which columns are nullable changes your strategy.

printColSchema() (available in Spark 3.4+) provides a simplified view showing only top-level columns with their types and nullability, without the full tree format:

df.printColSchema()
# Col Name      DataType    Nullable?
# PassengerId   IntegerType true
# Survived      IntegerType true
# Pclass        IntegerType true
# Name          StringType  true
# Sex           StringType  true
# Age           DoubleType  true
# SibSp         IntegerType true
# Parch         IntegerType true
# Ticket        StringType  true
# Fare          DoubleType  true
# Cabin         StringType  true
# Embarked      StringType  true

Both methods are introspection tools, not actions. They do not trigger job execution because the schema is known at plan-build time.

head() and take() for Programmatic Access

When you need rows inside Python logic rather than console output, head() and take() are the right tools. They return lists of pyspark.sql.Row objects, which behave like dictionaries but support attribute access.

head() returns the first row as a single Row object:

first_row = df.head()
print(first_row.Name)       # attribute access
print(first_row["Age"])     # dictionary access

head(n) and take(n) return the first n rows as a list:

first_10 = df.head(10)
first_10_also = df.take(10)
print(type(first_10))  # list
print(first_10[0].Name)  # first row's Name

Both methods are actions. They trigger job execution. The number of rows returned controls how much data flows back to the driver. head(100) pulls 100 rows. take(1000) pulls 1000 rows. You control the volume.

These methods are useful for writing assertions in tests:

sample = df.filter(df.Survived == 1).take(5)
assert len(sample) == 5
assert all(row.Survived == 1 for row in sample)

The practical limit is driver memory. If you request take(100000) on a wide DataFrame, you risk OOMing the driver regardless of cluster size. Size your samples conservatively.

Collecting Data to Driver: collect() vs toPandas()

When you need the full dataset on the driver, collect() is the method that brings all partitions back at once. It returns a list of Row objects containing every row from every partition:

all_rows = df.collect()

This makes it useful for small reference datasets or when you need to run Python-specific logic on the complete data. But collect() has a hard limit: your dataset must fit entirely in the driver process memory. A 100GB dataset on a cluster cannot be collected to a driver with 64GB of RAM. That throws an OutOfMemoryError.

For production pipelines processing gigabytes or terabytes, collect() is almost never the answer. The show() method is safer because it internally limits output to 20 rows and streams them rather than loading everything at once.

If you need all data on the driver for local processing, use toPandas() instead. It converts the entire PySpark DataFrame to a pandas DataFrame, which is smaller in memory due to lack of JVM overhead and enables use of pandas-only libraries:

pandas_df = df.toPandas()
print(pandas_df.head())

The same memory constraint applies. toPandas() also materializes all data through the JVM into Python process memory, which means the data is effectively duplicated during the conversion. For very large datasets, this can cause OOM even when individual partition sizes seem manageable.

For large datasets, consider pandasAPISparkDF.to_pandas_on_spark() from the pandas API on Spark, which returns a pandas-like object backed by PySpark without full materialization. But that is an advanced topic beyond this guide.

Jupyter and Notebook Display Tricks

If you are working in Jupyter Notebook, the notebook environment automatically renders PySpark DataFrames as HTML tables when they are the last expression in a cell. This display hook is convenient, but it only triggers when the DataFrame is the final statement:

# This displays automatically (DataFrame is last expression)
df.filter(df.Age > 30).select("Name", "Age", "Pclass")

# This does NOT display (variable assignment suppresses output)
result = df.filter(df.Age > 30).select("Name", "Age", "Pclass")

To force display in cases where the DataFrame is not the last expression, use display() explicitly or call df.show():

from IPython.display import display
display(df.filter(df.Age > 30))

In Databricks notebooks, display() renders richer output with built-in charting for common plot types. PySpark DataFrames in standard Jupyter require additional setup for charting.

The environment variable SPARK_CONF_LOG_LEVEL=WARN reduces verbose Spark output during session initialization, though it does not affect DataFrame output. Setting spark.sql.repl.eagerEval.enabled=true in Spark 3.0 and later enables automatic eager evaluation, which prints the top 20 rows of every DataFrame at creation time. This feature is useful during development for quick data inspection but adds evaluation overhead, so it should be disabled in production pipelines.

Common Pitfalls

Lazy Evaluation

When you write df.filter(df.Age > 30).select("Name", "Age") and inspect your code, nothing executes yet. PySpark records the plan. Only when you call an action like show(), collect(), or count() does Spark compile and run the plan across the cluster.

This confuses beginners because there are no errors during the plan-building phase. If your filter has a bug or a column name is wrong, PySpark silently accepts it and reports the error only when an action runs. This delayed feedback loop makes debugging harder.

The practical habit: after writing a chain of transformations, immediately call show(5) or head(5) to verify the output looks right before continuing.

Out of Memory from collect()

Calling collect() on a large DataFrame attempts to pull all rows from all partitions into the driver JVM. If the total dataset size exceeds available driver memory, the JVM throws OutOfMemoryError. This is one of the most common production failures in PySpark applications.

A DataFrame with 10 million rows and 50 columns, each string averaging 100 bytes, is roughly 50GB of data. A driver with 64GB RAM cannot hold it. Even if the math looks close, account for JVM heap overhead, Spark internal structures, and the copy made during deserialization.

The fix: never use collect() on large DataFrames. Use show() for inspection, or write results to distributed storage (Parquet, JDBC, Delta Lake) and read them back in separate jobs.

Truncated Output Hiding Real Values

By default, show() truncates string columns at 20 characters. A truncated email address "jane.doe@example" looks clean but hides that the actual value is "[email protected]". Downstream code expecting a 20-character string silently truncates the real value.

The habit: always use truncate=False when inspecting string columns that will be used in joins, regex extraction, or type conversion:

df.select("email", "ticket", "name").show(truncate=False)

Wrong Method in Wrong Environment

display() works in Jupyter and Databricks notebooks. It does not exist in pyspark shell or spark-submit. Beginners sometimes write notebooks that work and then wonder why the same code fails in a production submission. Test your code in the same environment where it runs.

head() vs first() vs take()

df.first() returns the very first row of the DataFrame as a single Row object, equivalent to df.head(1)[0]. df.head(n) returns a list of n Row objects. df.take(n) is identical to df.head(n) in behavior. Choose based on naming clarity: use first() when you want one row, head(n) or take(n) when you want multiple.

RankMath FAQ

Why does print() not show my PySpark DataFrame data?

Python’s print() calls the DataFrame’s string representation, which outputs the schema (column names and types) rather than actual rows. PySpark uses lazy evaluation, so transformations like select() or filter() do not execute until you call an action method like show().

How do I show all rows in a PySpark DataFrame?

There is no built-in method to show all rows at once. PySpark is designed for distributed data where printing millions of rows is impractical. Use show(n) with a reasonable sample size, or write to a file and inspect it with external tools. For small DataFrames, collect() brings all rows to the driver.

What is the difference between collect() and toPandas()?

collect() returns a list of PySpark Row objects. toPandas() converts the DataFrame to a single pandas DataFrame in the driver’s Python memory. Both require the full dataset to fit in driver memory. toPandas() is useful when you need pandas ecosystem libraries like matplotlib or scikit-learn.

How do I prevent PySpark from truncating string columns in show()?

Pass truncate=False to show(): df.show(truncate=False). This displays the full string value regardless of length.

How do I print the schema of a PySpark DataFrame?

Use printSchema(): df.printSchema(). This prints the full tree including column names, data types, nullability, and nested struct types. Use printColSchema() (Spark 3.4+) for a flat top-level column view.

Is PySpark lazy or eager?

PySpark transformations (select, filter, withColumn, join) are lazy. They build an execution plan but do not execute until an action (show, collect, count, write) is called. Session creation and DataFrame definition are eager.