Leetcode – Most Common Word Solution in Python

Spread the love

Text processing and analysis are common themes in algorithmic challenges, and the “Most Common Word” problem on Leetcode is a testament to this. It provides a real-world scenario of extracting meaningful information from textual data while dealing with noise and exceptions. This article will delve deep into the problem, unpacking its nuances, possible solution paths, and the actual implementation in Python.

Table of Contents

  1. Problem Statement
  2. Preliminary Thought Process
  3. Solution Strategies
  4. Python Implementation
  5. Testing and Analysis
  6. Further Considerations and Enhancements
  7. Conclusion

1. Problem Statement

Given a paragraph of text and a list of banned words, determine the most frequent word that is not in the list of banned words. The word should be returned in lowercase, and it can be guaranteed that at least one word in the paragraph isn’t banned. Words in the paragraph can be separated by any non-alphanumeric character.

2. Preliminary Thought Process

Several aspects need consideration:

  1. Normalization: Ensuring that words are processed in a standard format, e.g., converting everything to lowercase.
  2. Tokenization: Splitting the paragraph into individual words.
  3. Filtering: Removing the banned words.
  4. Counting: Keeping track of word frequencies to find the most common word.

3. Solution Strategies

A direct solution path involves the following steps:

  1. Preprocess the Paragraph: Convert the entire paragraph to lowercase and split it into words using regex or built-in string methods.
  2. Filter and Count: Iterate through the words, ignoring banned ones, and using a dictionary or counter to track the frequencies.
  3. Extract Most Common: Once we have the counts, identify and return the most frequent word.

4. Python Implementation

Let’s walk through a Pythonic approach step by step:

import re
from collections import Counter

def mostCommonWord(paragraph, banned):
    # Step 1: Normalize and tokenize
    words = re.findall(r'\w+', paragraph.lower())
    
    # Step 2: Filter and count
    counter = Counter(word for word in words if word not in banned)
    
    # Step 3: Extract the most common word
    return counter.most_common(1)[0][0]

How the Code Works:

  1. We use the re (regex) library to tokenize the paragraph into words. The regex pattern \w+ helps extract sequences of alphanumeric characters as words.
  2. The paragraph is also converted to lowercase using lower() to ensure uniformity.
  3. We use Python’s Counter from the collections module to track word frequencies. While creating the counter, we filter out words present in the banned list.
  4. counter.most_common(1) returns a list with a tuple of the most common word and its frequency. We extract the word using appropriate indexing.

5. Testing and Analysis

Running a sample test:

paragraph = "Bob hit a ball, the hit BALL flew far after it was hit."
banned = ["hit"]
print(mostCommonWord(paragraph, banned))  # Expected output: "ball"

Time Complexity: O(n + m), where n is the length of the paragraph and m is the length of the banned list.

Space Complexity: O(n + m), accounting for the storage of the words from the paragraph and the banned list.

6. Further Considerations and Enhancements

  1. Handling Edge Cases: What if there are multiple words with the same maximum frequency? Our current solution returns the first one it encounters.
  2. Efficiency: Using sets for banned words can speed up lookup times.
  3. Expandability: In real-world scenarios, the list of banned words can be dynamic or vast, requiring more scalable solutions.

7. Conclusion

The “Most Common Word” problem offers an insightful journey into text processing, capturing the essence of data extraction and noise filtering. The Python language provides powerful tools and libraries that simplify these tasks, enabling efficient solutions. Understanding the problem, identifying the steps, and methodically implementing the solution are key.

Leave a Reply