Union and UnionAll – Merge DataFrames in PySpark

Spread the love

Union and UnionAll-

In PySpark Union() and UnionAll() is used to merge or concatenate two or more dataframes. To merge two dataframes, you must be sure that they have the same schema and number of columns otherwise the union will fail.

Let’s create two dataframe to work with.

First DataFrame –

sampleData1 = [
    ('Eleven', 18, 'F', 99),
    ('Mike', 20, 'M', 85),
    ('Lucas', 20, 'M', 82),
    ('Will', 18, 'M', 70),
    ('Max', 19, 'F', 80)

columns1 = ['Name', 'Age', 'Sex', 'Marks']

df1 = spark.createDataFrame(data= sampleData1, schema= columns1)

Second dataframe –

sampleData2 = [
    ('Eleven', 18, 'F', 99),
    ('Dustin', 17, 'M', 70),
    ('Steve', 20, 'M', 80),
    ('Nancy', 20, 'F', 75),
    ('Mike', 20, 'M', 85)

columns2 = ['Name', 'Age', 'Sex', "Marks"]

df2 = spark.createDataFrame(data= sampleData2, schema= columns2)

Merge two or more dataframes using Union –

The union() method in PySpark merge two dataframes and returns a new dataframe with all the rows from both the dataframe including any duplicate records.

Let’s merge the df1 and df2.

df3 = df1.union(df2)

Merge two or more dataframes using unionAll –

UnionAll() method also return the same result as above and it has been deprecated since PySpark “2.0.0” version and recommends using the union() method.

df4 = df1.unionAll(df2)

Merge dataframes without duplicates –

To merge dataframes without duplicates, first we use the union() method to combine two dataframes and then use the distinct() method to remove any duplicate records.

df5 = df1.union(df2).distinct()

Rating: 1 out of 5.

Leave a Reply