Python has a built in search function. Here's how to use it

Deepjyoti Barman @deepjyoti30
Jan 4, 2021 2:46 PM UTC
Post cover

I was going through the Python docs and guess what I came across? A function that helps you get closest matches to a list of other strings. Now why would that function be useful, you'd ask. Let me give you a proper example.

TL;DR: Python has a built in function get_close_matches that computes jaccard in order to determine if two strings are similar and accordingly finds the closest matches.

The problem

About 1 and a half years ago, I was working on playx. It is a tool that lets you stream music from the commandline directly from YouTube. While working on it, we decided that adding a feature to cache the files would be pretty nice. Now, while implementing that, we came across a problem. We needed to check if a song is already cached.

Now, the issue is, on YouTube, one song is uploaded by various channels which means the titles vary and since we were caching the songs using the title of the video from YouTube, that meant that if we did a direct match of the titles, it would be a very inefficient way to check if the song has been cached already.

For example, if we have the song name Sundown. This song might have been saved as Sundown - Gordon Lightfoot. However, the next time user tries to play the same song which had a different title, something like Sundown - Lyrics only | Gordon Lightfoot. In this case, a simple match would indicate that both songs are different and the tool would end up caching both even though the songs are same.

If you got an idea of the problem from reading the above, I think you would agree with me that it is similar to searching a particular file in a list of files.

Solving the above problem the hard way

At that point, Nishan came up with a nice solution to find songs that are similar to each other and not cache them.

TL;DR: Compute jaccard for the song name passed by the user and the list of files we have which will give us a relative idea of how similar a song is to the other based on the title.

First things first, if we look at the two title Sundown and Sundown - Gordon Lightfoot, we would think that the titles might mean it's the same song (to some extent). So how do we check that using our code?

One solution to do that is to compute jaccard between the two titles.

Jaccard Coefficient is a coefficient that is calculated between two similar entitites and it indicates how similar the two entities are. It is a naive way to find it, but atleast, it works better than direct matching.

So how do we compute the jaccard between two strings? From the wiki definition, jaccard is the ratio of all the similar properties of the two entities upon all the possible properties of the two entities. That means, it would be a ratio of all the similar characters in the two strings upon all the unique characters in the two strings.

How do we find that by code?

We can just split the two strings into sets. We can then, find the ratio of the sets intersection upon the sets union.

Following code does exactly that

def compute_jaccard(first_string, second_string):
   first_str_chars = set(first_string.split())
   second_str_chars = set(second_string.split())

   similarity = first_str_chars.intersection(second_str_chars)
   all_unique_chars = first_str_chars.union(second_str_chars)

   return len(similarity) / len(all_unique_chars)

The above code has a function that accepts two strings. It splits the two strings based on space into sets. The sets are then intersected and union'ed. Finally the jaccard is computed by counting the number of similar words to the number of total unique words.

Once we have the jaccard_coefficient, we can just keep a minimum amount that the jaccard should be which would indicate how similar the two string would be.

For example, if we have a jaccard coefficient of 0.75, it means the two strings are 75% (0.75 * 100) similar.

In a search algorithm

We can use the above logic to sort a list of results based on the jaccard coefficient computed and accordingly find the most similar files based on their names.

Python's built in solution

Python has a built in function in the difflib package named as get_close_matches.

It takes four arguments in the following order:

  • word: The string which we are searching for
  • possibilites: The list of strings we are searching it in
  • n: Indicates how many results to return based on the search. By default it is set to 3. Can be a positive integer.
  • cutoff: An float between 0 and 1 which will effectively how similar the two strings should be. By default set to 0.6

If you look at the above parameters are bit carefully, you'll see that the cutoff might be effectively the value of the jaccard coefficient that will indicate how similar the two strings are.

Let's get back to our initial problem. We want to see if a song similar to that entered by the user is already cached.

So, if we have a list of already cached files and we have the name of the song user entered, we can find if that song exists in the following way:

from difflib import get_close_matches

song_user_entered = "Sundown"

cached_songs = [
   "Sundown Gordon Lightfoot",
   "Sundown lyrics gordon lightfoot",
   "sundown hd audio gordon",
   "some other song",
   "couchiching gordon lightfoot",
   "never gonna give you up"
]

results = get_close_matches(song_user_entered, cached_songs, n=1)

# results is an emtpy list

The above code runs all right but we get an empty list. This is because the cutoff parameter (how similar the strings are) is set to 0.6 by default and none of the strings pass that number if you try checking that.

Instead, if we lower the cutoff in the following way:

results = get_close_matches(song_user_entered, cached_songs,
                            n=1, cutoff=0.3)
# results is ['Sundown Gordon Lightfoot']

The cutoff parameter is the value of jaccard which the function uses to filter out the results.

Thus, now, we know that one of the results was similar.

Verifying if Jaccard is computed

Just for fun, let's see if actually, jaccard is being calculated by the get_close_matched function.

First, we will compute the jaccard ourselves using our compute_jaccard function and then we will see if the results match.

for song in cached_songs:
   print(song, compute_jaccard(song, song_user_entered))

Above gives us the following output:

Sundown Gordon Lightfoot 0.3333333333333333
Sundown lyrics gordon lightfoot 0.25
sundown hd audio gordon 0.0
some other song 0.0
couchiching gordon lightfoot 0.0
never gonna give you up 0.0

Which justifies the result that only the first string is returned by the get_close_matches function. Thus, we know that the function uses jaccard to determine similarity.

Discussion