How to Delete Rows in R?

Spread the love

Deleting rows in R is a common operation in data manipulation and analysis. It may be necessary to remove rows due to various reasons such as duplicates, outliers, or other criteria based on the analysis needs. This article will explore multiple methods to delete rows in R, using both base R and other contributed packages, and each method will be illustrated with examples.

1. Using Row Indexes with Square Brackets

In R, you can remove rows by subsetting the dataframe using square brackets.

# Sample DataFrame
df <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Value = c(10, 20, 30, 40, 50)
)

# Deleting the 2nd row
df <- df[-2, ]

# Output DataFrame
print(df)

Output:

  ID Value
1  1    10
3  3    30
4  4    40
5  5    50

2. Using Logical Conditions

Logical conditions can be used with square brackets to subset a dataframe and remove rows meeting certain criteria.

# Removing rows where Value is less than 30
df <- df[df$Value >= 30, ]

# Output DataFrame
print(df)

Output:

  ID Value
3  3    30
4  4    40
5  5    50

3. Using the subset( ) Function

The subset() function in base R can be used to filter out rows based on conditions.

# Removing rows where ID is not 3
df <- subset(df, ID != 3)

# Output DataFrame
print(df)

Output:

  ID Value
4  4    40
5  5    50

4. Using dplyr package

The filter() function in dplyr is very versatile and intuitive to remove rows based on conditions.

# Sample DataFrame
df <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Value = c(10, 20, 30, 40, 50)
)

library(dplyr)

# Removing rows where ID is 1
df <- df %>% filter(ID != 1)

# Output DataFrame
print(df)

Output:

  ID Value
1  2    20
2  3    30
3  4    40
4  5    50

5. Using the slice( ) Function

The slice() function from dplyr package can be used to remove rows by their position.

# Removing the 1st row
df <- df %>% slice(-1)

# Output DataFrame
print(df)

Output:

  ID Value
1  3    30
2  4    40
3  5    50

6. Using na.omit( ) Function

The na.omit() function removes rows containing NA values.

Let’s create a dataframe with some NA values and then use the na.omit() function to remove the rows containing NA values.

# Creating a sample DataFrame with NA values
df <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Name = c("John", "Sara", NA, "Anna", "Mike"),
  Age = c(21, 35, 30, 25, NA)
)

# Displaying original DataFrame
print("Original DataFrame:")
print(df)

# Applying na.omit() to remove rows containing NA values
df_no_na <- na.omit(df)

# Displaying DataFrame after removing rows with NA values
print("DataFrame after omitting NA values:")
print(df_no_na)

Output:

[1] "Original DataFrame:"
  ID Name Age
1  1 John  21
2  2 Sara  35
3  3 <NA>  30
4  4 Anna  25
5  5 Mike  <NA>

[1] "DataFrame after omitting NA values:"
  ID Name Age
1  1 John  21
2  2 Sara  35
4  4 Anna  25

Here, you can see that the rows 3 and 5 from the original dataframe, which had NA values in the Name and Age columns respectively, have been omitted in the df_no_na dataframe.

In-Depth Examples:

A. Combining Conditions to Remove Rows

# Sample DataFrame
df <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Value = c(10, 20, 30, 40, 50)
)

# Removing rows where ID is less than 3 or Value is greater than 40
df <- df[!(df$ID < 3 | df$Value > 40), ]

Output:

  ID Value
3  3    30
4  4    40

This will remove rows where the ID is less than 3 or the Value is greater than 40.

B. Removing Duplicate Rows

# Creating a sample DataFrame with duplicate rows
df <- data.frame(
  ID = c(1, 2, 2, 3, 4, 4, 5),
  Name = c("John", "Sara", "Sara", "Anna", "Mike", "Mike", "Eva"),
  Age = c(21, 35, 35, 25, 30, 30, 22)
)

# Displaying the original DataFrame
print("Original DataFrame:")
print(df)

# Removing duplicate rows
df_no_duplicates <- df[!duplicated(df), ]

# Displaying the DataFrame after removing duplicate rows
print("DataFrame after removing duplicate rows:")
print(df_no_duplicates)

Output:

[1] "Original DataFrame:"
  ID Name Age
1  1 John  21
2  2 Sara  35
3  2 Sara  35
4  3 Anna  25
5  4 Mike  30
6  4 Mike  30
7  5  Eva  22

[1] "DataFrame after removing duplicate rows:"
  ID Name Age
1  1 John  21
2  2 Sara  35
4  3 Anna  25
5  4 Mike  30
7  5  Eva  22

Here, the rows 3 and 6 from the original dataframe, which were duplicates of rows 2 and 5 respectively, have been removed in the df_no_duplicates dataframe.

C. Using filter( ) with Multiple Conditions

# Sample DataFrame
df <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Value = c(10, 20, 30, 40, 50)
)

# Removing rows where ID is 4 and Value is 40
df <- df %>% filter(!(ID == 4 & Value == 40))
print(df)

Output:

  ID Value
1  1    10
2  2    20
3  3    30
4  5    50

D. Combining slice( ) and n( ) Functions

library(dplyr)

# Removing the last row of the dataframe
df <- df %>% slice(1:(n()-1))

Output:

  ID Value
1  1    10
2  2    20
3  3    30

This will remove the last row of the dataframe.

E. Using drop_na( ) to Remove Rows with NA Values

# Creating a sample DataFrame with NA values
df <- data.frame(
  ID = c(1, 2, 3, 4, 5),
  Name = c("John", "Sara", NA, "Anna", "Mike"),
  Age = c(21, 35, 30, 25, NA)
)

library(tidyr)

# Removing rows with NA values in any column
df <- drop_na(df)
print(df)

Output:

  ID Name Age
1  1 John  21
2  2 Sara  35
3  4 Anna  25

This will remove any rows with NA values in any of the columns of the dataframe.

Conclusion

Deleting rows is a critical part of data manipulation and preprocessing in R. Whether it’s removing duplicates, filtering out irrelevant data, or handling missing values, knowing how to delete rows effectively is crucial.

R provides a variety of functions and operators in base R and in contributed packages like dplyr and tidyr, which make it easy and intuitive to delete rows from a dataframe based on a wide range of criteria. By understanding these different approaches, you can choose the one that best suits your needs and efficiently manage your data to prepare it for further analysis.

Posted in RTagged

Leave a Reply