PySpark DataFrame show() Method

Spread the love

The show() method in PySpark DataFrame is used to display the contents of the DataFrame. This can be extremely useful when you are working with data and need to visualize it, check the result of a transformation, or debug an issue. It prints the DataFrame in a tabular format which is easier to read compared to the DataFrame’s collect() method, which returns a list of Row objects.

Here is an example of how to use the show() method:

from pyspark.sql import SparkSession

# Initialize a SparkSession
spark = SparkSession.builder \
    .appName("PySpark show() method example") \
    .getOrCreate()

# Example DataFrame
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
df = spark.createDataFrame(data, ["Name", "Value"])

# Use show() to display the DataFrame
df.show()

Running this code will output:

+-------+-----+
|   Name|Value|
+-------+-----+
|  Alice|    1|
|    Bob|    2|
|Charlie|    3|
+-------+-----+

As you can see, the show() method provides a neat and clean visualization of the DataFrame content.

Customizing the show() Method

The show() method has some optional parameters that you can use to customize the output:

  • n: The number of rows to show. The default is 20.
  • truncate: If set to True, truncate the strings to 20 chars by default. If set to a number n, truncate strings to n characters, and all cells will be aligned right.
  • vertical: If set to True, print output rows vertically (one line per column value).

Here’s an example that uses these parameters:

# Show the first 2 rows, truncate strings to 10 characters, and print rows vertically
df.show(n=2, truncate=10, vertical=True)

This will output:

-RECORD 0------
 Name  | Alice
 Value | 1     
-RECORD 1------
 Name  | Bob   
 Value | 2     
only showing top 2 rows

In this output, only the first 2 rows are displayed, the string values are truncated to 10 characters (not noticeable in this example as all strings are shorter than 10 characters), and the rows are printed vertically.

Using show() with Column Expressions

You can use column expressions with the show() method to transform the data before displaying it. For example, you can add a new column, filter rows, or sort the DataFrame. Here’s an example:

from pyspark.sql.functions import *

# Add a new column, filter rows, sort the DataFrame, and show the result
df.withColumn("ValueSquared", col("Value") ** 2) \
  .filter(col("Value") > 1) \
  .sort(desc("Value")) \
  .show()

This will output:

+-------+-----+-------------+
|   Name|Value|ValueSquared|
+-------+-----+-------------+
|Charlie|    3|          9.0|
|    Bob|    2|          4.0|
+-------+-----+-------------+

In this output, a new column “ValueSquared” is added, rows with “Value” less than or equal to 1 are filtered out, and the DataFrame is sorted in descending order by “Value”.

Conclusion

The show() method is a fundamental tool when working with PySpark DataFrames. It enables you to visualize your data quickly and efficiently, which is crucial when developing and debugging PySpark applications.

However, remember that show() triggers an action on your DataFrame, which means it will force the execution of the transformations that you’ve defined on your DataFrame. While this is not an issue for small DataFrames, it can be a problem for large DataFrames as it can take a significant amount of time to compute. Therefore, use it judiciously when working with large DataFrames.

Lastly, keep in mind that show() is mainly intended for debugging and data exploration. For production applications, consider using Spark’s write operations to store your DataFrame’s content into a persistent storage system.

Leave a Reply