
Limit in PySpark –
Oftentimes, you might want to extract what you extract from a DataFrame. For example you might want just the top 10 of some dataframe. You can do this by using the limit method.
Let’s create a PySpark DataFrame .
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('LifeWithData.com').getOrCreate()
sampleData = [
('Eleven', 18, 'F', 99),
('Mike', 20, 'M', 85),
('Lucas', 20, 'M', 82),
('Will', 18, 'M', 70),
('Max', 19, 'F', 80),
('Dustin', 17, 'M', 70),
('Steve', 20, 'M', 80),
('Nancy', 20, 'F', 75)
]
columns = ['Name', 'Age', 'Sex', 'Marks']
df = spark.createDataFrame(data= sampleData, schema= columns)
df.show()

We can use limit in PySpark like this
df.limit(5).show()
The equivalent of which in SQL is
SELECT * FROM dfTable LIMIT 5
Now, Let’s order the result by Marks in descending order and show only the top 5 results.
df.orderBy(df["Marks"].desc()).limit(5).show()

In SQL this is written as
SELECT * FROM dfTable ORDER BY Marks DESC LIMIT 5