How to Merge or Join two DataFrames in R

Spread the love

Merging DataFrames is a pivotal task in the data manipulation process, vital for aggregating data from various sources or preparing data for subsequent analysis. In R, dplyr is a popular package for data manipulation due to its versatile and intuitive set of functions. This article will explore the extensive capabilities of dplyr in merging DataFrames, delving into the functions, methodologies, and best practices.

1. Creating Sample DataFrames:

Let’s create two sample DataFrames for illustrative purposes.

# Creating DataFrame df1
df1 <- data.frame(
  ID = c(1, 2, 3, 4),
  Name = c("John", "Jane", "Mike", "Amy")
)

# Creating DataFrame df2
df2 <- data.frame(
  ID = c(3, 4, 5, 6),
  Score = c(85, 95, 80, 90)
)

Output:

# df1
  ID Name
1  1 John
2  2 Jane
3  3 Mike
4  4  Amy

# df2
  ID Score
1  3    85
2  4    95
3  5    80
4  6    90

2. Utilizing dplyr for Merging:

The dplyr package offers various functions to perform different types of merges.

2.1. Inner Join:

The inner_join() function merges two DataFrames, retaining only rows with matching keys in both.

# Inner Join
df_merged <- inner_join(df1, df2, by = "ID")

Output:

  ID Name Score
1  3 Mike    85
2  4  Amy    95

2.2. Left Join:

The left_join() function retains all rows from the left DataFrame and matching rows from the right DataFrame. Unmatched rows will have NA in the new columns.

# Left Join
df_merged <- left_join(df1, df2, by = "ID")

Output:

  ID Name Score
1  1 John    NA
2  2 Jane    NA
3  3 Mike    85
4  4  Amy    95

2.3. Right Join:

Conversely, right_join() retains all rows from the right DataFrame and the matching rows from the left DataFrame.

# Right Join
df_merged <- right_join(df1, df2, by = "ID")

Output:

  ID Name Score
1  3 Mike    85
2  4  Amy    95
3  5 <NA>    80
4  6 <NA>    90

2.4. Full Join:

The full_join() function retains all rows from both DataFrames, inserting NA where there are unmatched keys.

# Full Join
df_merged <- full_join(df1, df2, by = "ID")

Output:

  ID Name Score
1  1 John    NA
2  2 Jane    NA
3  3 Mike    85
4  4  Amy    95
5  5 <NA>    80
6  6 <NA>    90

2.5. Semi Join:

The semi_join() function retains all rows in the left DataFrame where there are matching keys in the right DataFrame.

# Semi Join
df_merged <- semi_join(df1, df2, by = "ID")

Output:

  ID Name
1  3 Mike
2  4  Amy

2.6. Anti Join:

The anti_join() function retains all rows from the left DataFrame where there are not matching keys in the right DataFrame.

# Anti Join
df_merged <- anti_join(df1, df2, by = "ID")

Output:

  ID Name
1  1 John
2  2 Jane

3. Advanced Merging Techniques:

3.1. Merging by Multiple Columns:

When joining by multiple columns, pass a character vector of column names to the by argument.

# Sample DataFrames
df1 <- data.frame(
  ID = c(1, 2, 3, 4),
  Name = c("John", "Jane", "Mike", "Amy"),
  Age = c(21, 22, 23, 24)
)

df2 <- data.frame(
  ID = c(3, 4, 5, 6),
  Name = c("Mike", "Amy", "Alex", "Sophie"),
  Score = c(85, 95, 80, 90)
)

# Merge by Multiple Columns
df_merged <- left_join(df1, df2, by = c("ID", "Name"))

Output:

  ID Name Age Score
1  1 John  21    NA
2  2 Jane  22    NA
3  3 Mike  23    85
4  4  Amy  24    95

3.2. Handling Suffixes:

If there are overlapping column names in the merging DataFrames, dplyr will automatically add suffixes. You can control the suffixes using the suffix argument.

# Handling Suffixes
df_merged <- left_join(df1, df2, by = "ID", suffix = c("_left", "_right"))

Output:

  ID Name_left Age Name_right Score
1  1      John  21       <NA>    NA
2  2      Jane  22       <NA>    NA
3  3      Mike  23       Mike    85
4  4       Amy  24        Amy    95

Conclusion:

Merging DataFrames is a fundamental aspect of data manipulation, enabling the consolidation of varied information into cohesive datasets. The dplyr package in R provides a versatile and user-friendly toolkit for executing diverse merge operations, including inner, left, right, full, semi, and anti joins, each serving distinct purposes and use cases.

Posted in RTagged

Leave a Reply