Limit in PySpark explained with examples

Spread the love

Limit in PySpark –

Oftentimes, you might want to extract what you extract from a DataFrame. For example you might want just the top 10 of some dataframe. You can do this by using the limit method.

Let’s create a PySpark DataFrame .

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('LifeWithData.com').getOrCreate()

sampleData = [
    ('Eleven', 18, 'F', 99),
    ('Mike', 20, 'M', 85),
    ('Lucas', 20, 'M', 82),
    ('Will', 18, 'M', 70),
    ('Max', 19, 'F', 80),
    ('Dustin', 17, 'M', 70),
    ('Steve', 20, 'M', 80),
    ('Nancy', 20, 'F', 75)
]

columns = ['Name', 'Age', 'Sex', 'Marks']

df = spark.createDataFrame(data= sampleData, schema= columns)
df.show()

We can use limit in PySpark like this

df.limit(5).show()

The equivalent of which in SQL is

SELECT * FROM dfTable LIMIT 5

Now, Let’s order the result by Marks in descending order and show only the top 5 results.

df.orderBy(df["Marks"].desc()).limit(5).show()

In SQL this is written as

SELECT * FROM dfTable ORDER BY Marks DESC LIMIT 5

Rating: 1 out of 5.

Leave a Reply