TF-IDF Website Document Search

by Nigel Story
nigelstorydata.com


1. Introduction

It happens all too often. Coworkers and friends asking, "Hey, didn't you write some code that does XYZ? Where can I find that?" or "I'm working on something that you mentioned having done before. Can I see your code?" I figured having my own website to which I could direct them would save me a little time, but then the followup questions start: "Which project is it in? Where is that on your site?" And things are made worse when even I can't remember.

I needed a search bar. Not a run-of-the-mill keyword lookup, I needed something that could search through my write-ups and code: something where I could search "Poisson simulation" or "dbscan clustering" and get good results. With a simple keyword search, it would be impossible to anicipate what a person might want to look for, so in this project we will be implementing a searching and ranking algorithm based in NLP that can give us results for any relevant query that a user might search.

1.1 Methods

To acheive a more robust search experience, we will combine fuzzy keyword matching techniques with text vectorization and cosine similarity. Specifically, we will start by reading and preprocessing HTML documents created from my own Jupyter Notebook projects that appear on nigelstorydata.com, then we will create a Term Frequency Inverse Document Frequency (TF-IDF) vectorizer to convert each document and users' search queries to vector representations. We will then use the cosine similarity between each document and search query as the first component of our results ranking.

Next, we will create a three tiered tagging system for our documents, and the tags that we assign will contribute to keyword matching scores: one for title matching, one for category matching, and one for miscellaneous tags. The matching between search terms and tags will be fuzzy, or intentionally imprecise, by matching based on the Levenshtein Distance Ratio between the search terms and each tag, and it is this ratio that contributes to the total search rank score.

The scores are combined as follows: $Search Rank Score = Title Match Score + 0.3*CategoryMatchScore + 0.5*TagMatchScore + CosineSimilarity$

This search algorithm is currently deployed on the website. Search queries are sent via POST request to the site server asynchronously, and the data processing and responses are handled via CGI.

1.2 Package Imports

We'll be relying on libraries from a typical data science stack: namely Pandas, NumPy, and Sci-Kit Learn. We'll also be using the nltk library to accomplish NLP preprocessing tasks and Beautiful Soup to extract our text data from the Jupyter notebooks. We'll also use the Levenshtein module to assist with our fuzzy keyword matching. Models and other objects will be persisted using the pickle library.

In [3]:
import warnings
warnings.filterwarnings("ignore")
In [4]:
import os
import re
import string
import collections
import numpy as np
import pandas as pd
import Levenshtein
from bs4 import BeautifulSoup
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import pickle as pkl
In [5]:
# Download nltk dependency datasets.

# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')

2. Data Preparation

The data preparation process will consist of _ steps: reading in the raw HTML text, parsing the HTML to extract the text that is displayed in the notebooks, and finally the standard NLP text preprocessing (removing punctuation and stopwords, standardizing text, lemmatization, etc.).

2.1 Reading in the Jupyter Notebooks

The notebooks are stored in a local directory and have already been converted to HTML using the nbconvert functionality. To read in the HTML files, we can loop through the files of the directory and read them in one by one. However, for our initial development, we will only read in the text from the first file, which corresponds to the LSTM Anomaly Detection project notebook.

In [6]:
folder = "/Applications/XAMPP/htdocs/nigelstorydata/jupyter_notebooks"
In [7]:
notebook_html = [f for f in os.listdir(folder) if f.endswith('.html')]
notebook_html
Out[7]:
['anomaly.html',
 'acnh.html',
 'senate.html',
 'sales_simulation.html',
 'james_webb_gan.html',
 'waybill.html',
 'senate_etl.html',
 'hamlet.html']
In [8]:
with open(os.path.join(folder, notebook_html[0])) as f:
    html = f.read()

Let's preview the first 100 characters of the text.

In [9]:
html[:100]
Out[9]:
'<!DOCTYPE html>\n<html>\n<head><meta charset="utf-8" />\n\n<title>LSTM Autoencoding Anomaly Detector</ti'

2.2 Parsing HTML & Data Extraction

The raw HTML text is in no shape to perform any analysis on; the HTML tags are cumbersome and not informative, and the added JavaScript and hashed text used to render the notebooks will heavily skew the TF-IDF results.

To overcome this, we'll start by extracting the text that is contained only in HTML tags where the text would be displayed, like H1 or p tags (see code cells below for the full list of get_tags). We will use the Beatiful Soup library to assist in parsing the HTML and extracting the text.

In [10]:
soup = BeautifulSoup(html)
In [11]:
get_tags = ['title', 'h1', 'h2', 'h3', 'h4', 'h5', 'li', 'p', 'em', 'a', 'span', 'code']
In [12]:
doc_text = []
for t in get_tags:
    txt = soup.find_all(t)
    doc_text.append([s.text.replace('ΒΆ', '').strip() for s in txt])
    
doc_text = ' '.join([' '.join(x) for x in doc_text])

Let's take another look at the preview.

In [13]:
print(doc_text[:100])
LSTM Autoencoding Anomaly Detector LSTM Autoencoding Anomaly Detector Contents Introduction ETL Anom

This looks much better than the raw HTML, but it still has a ways to go before it's ready for use in an NLP application. Next, we'll handle the cleaning and preprocessing of the document text.

2.3 Text Preprocessing

To preprocess our text, we'll do the following:

1. Convert the text to lowercase
2. Tokenize the text
3. Remove stopwords
4. Remove punctuation
5. Remove numbers
6. Strip extra spaces
7. Lemmatize each word

To assist with steps 4 and 5, we will use two custom functions: replace_punc and replace_nums. The prep_text function will be our preprocessing pipeline, and will be used to process not only the documents, but also search queries, so that we compare consistent vocabularies.

In [14]:
STOPWORDS = set(x for x in stopwords.words('english'))
PUNCTUATION = set([x for x in string.punctuation])
lemmatizer = WordNetLemmatizer()
In [15]:
def replace_punc(s):
    """Replace punctuation in a string with space.
    """
    for p in string.punctuation:
        s = s.replace(p, ' ')
        
    return s

def replace_nums(s):
    """Replace numbers in a string with space.
    """
    return re.sub('\d+', ' ', s)
In [16]:
def prep_text(doc_text):    
    """Text preprocessing pipeline.
    """
    clean = [s.lower() for s in word_tokenize(doc_text)]
    clean = [s for s in clean if s not in STOPWORDS]
    clean = [replace_punc(s) for s in clean if s not in PUNCTUATION]
    clean = [replace_nums(s) for s in clean]
    clean = [s.strip() for s in clean]
    clean = [lemmatizer.lemmatize(s) for s in clean]

    return ' '.join(clean)
In [17]:
clean_text = prep_text(doc_text)
In [18]:
print(clean_text[:100])
lstm autoencoding anomaly detector lstm autoencoding anomaly detector content introduction etl anoma

Now the text is ready for NLP and to be vectorized using TF-IDF.


3. Term Frequency Inverse Document Frequency

Term Frequency Inverse Document Frequency (TF-IDF) is a text vectorization technique that scores a word's salience by comparing its frequency of use within a given document to its frequency of use in the whole corpus of documents. For more information about its uses and calculation, check out this MonkeyLearn.com article.

At a high level, what we want to do is convert each document and search term into a vector, and then our search algorithm can rate how similar the search terms are to each document by measuring the angle between the two vectors, or the cosine similarity. This will be our baseline search result ranking metric.

3.1 TF-IDF Vector Encoding

We will implement our TF-IDF vectorization using the TfidfVectorizer class from Sci-Kit Learn. We will look at word groupings of 1 word to 3 words, with a limit of 500 words for our vocabulary.

Later, we will be able to persist trained TF-IDF vecotrizers with the pickle package, so that we can use the same vectorizer to transform search terms.

In [19]:
tfv = TfidfVectorizer(ngram_range=(1, 3), max_features=500)
In [20]:
vec = tfv.fit_transform([clean_text])

Let's preview the results. Since we only trained this TF-IDF model on one document, we are only returned one vector.

In [40]:
df = pd.DataFrame([(x,y) for x,y in zip(tfv.get_feature_names(), vec.toarray()[0])], columns=['word', 'val'])
df.sort_values('val', ascending=False).head()
Out[40]:
word val
73 data 0.551041
232 model 0.298870
33 bearing 0.196133
7 anomaly 0.168114
211 loss 0.149435

3.2 Search Term Cosine Similarity

Now that we have a trained TF-IDF model and a vectorized document, we can test how cosine similarity will rank search terms.

The document we trained the TF-IDF model on was titled LSTM Anomaly Detection, and talked quite a bit about anomalies and identifying them. It did not, however, use the word "probability" very often, so I will use the search terms "anomaly" and "probability" to test that our algorithm is working so far.

In [41]:
search_term = 'anomaly'
sim = cosine_similarity(tfv.transform([search_term]), vec)
print(f'Cosine Similarity ({search_term}): {sim[0][0]:0.4f}')
Cosine Similarity (anomaly): 0.1681
In [43]:
search_term = 'probability'
sim = cosine_similarity(tfv.transform([search_term]), vec)
print(f'Cosine Similarity ({search_term}): {sim[0][0]:0.4f}')
Cosine Similarity (probability): 0.0000

We can see that "anomaly" has quite a high cosine similarity, at 0.17, compared to "probability", which appears 0 times in the text.

Now we can move on to creating the algorithms to add to our cosine similarity score by looking at the relationship between the search terms and tags that are intentionally chosen.


4. Search Result Ranking

TF-IDF and cosine similarity are only the first pieces to our document search algorithm. To provide the me with the ability to add more information to each project, I can create a list of tags that can boost the salience of chosen words. For example, in the project "Analysis of U.S. Senate Polarization," the word "senate" is used quite a lot, but if a user searched "politics," then they might not find this project in their results despite it being a highly political project. By adding "politics" to the project's tags, I can add explicit boosts to words' importance to improve search results.

4.1 Project Tag Scoring

The tags will be stored in a dictionary: one key for each project (since we are only testing with one project, there is only one key here), and each key, or entry, will contain the project's name, the project's category, and a list of relevant search tags that are closely related to the project.

Here, I've decided to add tags for specific packages used in the project: Keras, TensorFlow, and PySpark. I've also added tags for "time series," so that I can search based on analysis techniques.

The tags are preprocessed in the exact same manner as the document text to assure consistency of results.

In [23]:
tags = {
    0: {
        'name': prep_text('lstm autoencoding anomaly detection').split(),
        'category': prep_text('machine learning').split(),
        'tags': prep_text('machine learning nigel story lstm anomaly detect autoencoder apache spark pyspark sklearn keras tensorflow time series').split()
    }
}

To improve robustness, our algorithm will not use a direct keyword search of the tags, but rather perform fuzzy matching, as discussed in the following section.

4.2 Levenshtein Ratio Scoring

The scoring of tag matching will be accomplished by analyzing the Levenshtein minimum edit distance ratio. More details about how this is calculated here. The idea is to make our search robust against minor typos and differences in things like cardinality and conjugation. For example, a complete matching of keywords would not perform well in the case that a user queried "detector" when "detection" is one of the tags. Using the Levenshtein ratio will remedy these kinds of small differences.

The way we will use this ratio is by taking each word of the user's search query, and computing the Levenshtein ratio between the word and each tag for each project. If the ratio is greater than 0.5, then the Levenshtein ratio will be added into the final search ranking score. We will test this method using the search term "anomaly detection."

In [24]:
# test search term
q = 'anomaly detection'

# preprocess the search terms
q = prep_text(q)
ql = q.split()
In [46]:
# project name match
title_score = 0
for w in ql:
    for n in tags[0]['name']:
        lr = Levenshtein.ratio(w, n)
        if lr > 0.5:
            title_score += lr

# project category match
cat_score = 0
for w in ql:
    for n in tags[0]['category']:
        lr = Levenshtein.ratio(w, n)
        if lr > 0.5:
            cat_score += lr
            
# project miscellaneous tag match          
tag_score = 0
for w in ql:
    for n in tags[0]['tags']:
        lr = Levenshtein.ratio(w, n)
        if lr > 0.5:
            tag_score += lr
In [47]:
print(f"title_score: {title_score:0.4f}, cat_score: {cat_score:0.4f}, tag_score: {tag_score:0.4f}")
title_score: 2.0000, cat_score: 0.0000, tag_score: 1.8000

We can see above, that our search matched two words from the project's title, and matched based on two miscellaneous tags as well ("anomaly" and "detect"). Since the words "anomaly detection" do not appear in the project's category, "machine learning," the category match score is 0.

In the next section, we tie everything together by combining the match scores with the cosine similarity computed in section 3.

4.3 Total Search Rank Score

Now, we will combine all the scores from previous sections into the final Search Rank Score. We will do this simply by taking a weighted sum of our calculated scores.

A match on a project's title is the most important and should receive the most weight, so it will receive a weight of 1. The category score will receive a smaller weight of 0.3, since we would like to allow for a project's contents to overrule its category if they receive a high enough cosine similarity score. Tag matches will have slightly lower scores on average than category matches, so we boost them a bit with a weight of 0.5. And finally, the cosine similarity will receive a weight of 1. Cosine similarity values are mostly very small, comparitively, unless there is a very strong match between the search query and the document's contents, and in this case, we want the cosine similarity to carry its full weight. This also allows for strong results if the user searches for terms that are not in any of the project tags.

In [26]:
cos_sim = cosine_similarity(tfv.transform([q]), vec)[0][0]

Let's compute the final search rank score for the same query as before: "anomaly detection."

In [48]:
print(f"title_score: {title_score:0.4f}, cat_score: {cat_score:0.4f}, tag_score: {tag_score:0.4f}, cos_sim: {cos_sim:0.4f}")
title_score: 2.0000, cat_score: 0.0000, tag_score: 1.8000, cos_sim: 0.1402
In [28]:
score = title_score + 0.3*cat_score + 0.5*tag_score + cos_sim
score
Out[28]:
3.0401989083110235

We can see above that the content match based on cosine similarity gives a slight boost to the ranking score, but the main drivers are the matches on the title and the tags.


5. Prediction Pipeline

Bringing it all together...

Here is a demonstration of how the pipeline works in production.

1. The trained TF-IDF model is read into the environment
2. We have predefined project document search tags
3. The search terms are processed and scored
4. The final results are displayed

5.1 Loading Trained TF-IDF Vectorizer

This TF-IDF model has been previously trained on all of the projects on the website.

In [29]:
# load in model
with open('/Applications/XAMPP/htdocs/nigelstorydata/py/trained_models/tf_idf_model_2022_08_30.pkl', 'rb') as f:
    tf_idf = pkl.load(f)

5.2 Project Document Search Tags

Note, here we are using tags for all of the projects on the website.

In [30]:
tags = {
    0: {
        'name': prep_text('lstm autoencoding anomaly detection').split(),
        'category': prep_text('machine learning').split(),
        'tags': prep_text('machine learning nigel story lstm anomaly apis detect autoencoder neural network apache spark pyspark sklearn keras tensorflow time series').split()
    },
    1: {
        'name': prep_text('Time Series Clustering Analysis of ACNH Turnip Price Trends animal crossing new horizons').split(),
        'category': prep_text('machine learning').split(),
        'tags': prep_text('machine learning nigel story clustering time series dbscan kmeans web scraping neural network turnips animal eda crossing stalk stock market prediction sklearn random forest multiclass classification imbalanced data').split()
    },
    2: {
        'name': prep_text('analysis of us united states senate polarization').split(),
        'category': prep_text('machine learning').split(),
        'tags': prep_text('machine learning nigel story politics clustering cluster analysis centroids magnitude lstm kmeans forcasting arima time series sklearn keras tensorflow sql database eda').split()
    },
    3: {
        'name': prep_text('retail simulation Sales Simulation with Customer-Level Bahavior and Preference; Regional Seasonality; and Product, Category, and Brand Data').split(),
        'category': prep_text('data engineering simulation').split(),
        'tags': prep_text('simulation nigel story retain shopping business poisson inverse tranform montecarlo seasonality web scraping object oriented').split()
    },
    4: {
        'name': prep_text('James Web Telescope Images GAN generative adversarial network').split(),
        'category': prep_text('machine learning simulation').split(),
        'tags': prep_text('machine learning nigel story gan generative adversarial network space telescope pytorch image processing computer vision neural network deep learning convolutional cnn tcnn transposed james webb nasa').split()
    },
    5: {
        'name': prep_text('Hazardous Freight Prediction: Siamese Neural Network and NLP natural language processing train').split(),
        'category': prep_text('machine learning').split(),
        'tags': prep_text('machine learning nigel story cargo freight train siamese neural network text vectorization vector embedding train logistics hazard binary classification etl deep learning sklearn').split()
    },
    6: {
        'name': prep_text('Senate Votes ETL and Database Population').split(),
        'category': prep_text('data engineering').split(),
        'tags': prep_text('data engineering nigel story mysql sql database population data cleaning munging web scraping web crawling bot automation ').split()
    },
    7: {
        'name': prep_text('Hamlet NLP Preprocessing').split(),
        'category': prep_text('data engineering machine learning').split(),
        'tags': prep_text('hamlet william shakespeare machine learning data engineering nigel story nlp natural language processing apache spark pyspark map reduce nltk sklearn tf idf term frequency inverse document bag of words').split()
    }
    
}

5.3 Tesing Results

Let's perform a final test of our algorithm using the search term "time series" to search through all of the projects on the website.

In [33]:
def title_score(q_ls, proj_attr):
    score = 0
    for w in q_ls:
        for n in proj_attr['name']:
            lr = Levenshtein.ratio(w, n)
            if lr > 0.5:
                score += lr
    return score


def category_score(q_ls, proj_attr):
    score = 0
    for w in q_ls:
        for n in proj_attr['category']:
            lr = Levenshtein.ratio(w, n)
            if lr > 0.5:
                score += lr
    return score


def tag_score(q_ls, proj_attr):
    score = 0
    for w in q_ls:
        for n in proj_attr['tags']:
            lr = Levenshtein.ratio(w, n)
            if lr > 0.5:
                score += lr
    return score


def search_score(q, proj_attrs, proj_tfidf_vec, tfidf_model):
    """Ingest search query and calculate the search rank score.
    
    Args:
        q (str): User search term/query.
        proj_attrs (dict): dictionary of all project tags.
        proj_tfidf_vec (np.array): TF-IDF vector for all documents in corpus.
        tfidf_model (sklearn.TfidfVectorizer): pretrained TF-IDF vectorizer model.
        
    Returns:
        (np.array): Search rank scores for each project.
    """
    
    prepped_q = prep_text(q)
    q_ls = prepped_q.split()
    
    tls = []
    tgs = []
    cts = []
    for p, atts in proj_attrs.items():
        tls.append(title_score(q_ls, atts))
        tgs.append(tag_score(q_ls, atts))
        cts.append(category_score(q_ls, atts))
    
    cos_sim = cosine_similarity(tfidf_model.transform([prepped_q]), proj_tfidf_vec)
    
    return np.array(tls) + 0.3*np.array(cts) + 0.5*np.array(tgs) + cos_sim[0]
In [34]:
search_term = 'time series'
score = search_score(search_term, tags, tf_idf['vectorized_projects'], tf_idf['trained_model'])
res = pd.DataFrame(zip(notebook_html, score.T), columns=['proj', 'score']).sort_values('score', ascending=False)
res.loc[res['score']>0.7, :]
Out[34]:
proj score
1 acnh.html 3.737743
5 waybill.html 1.749067
0 anomaly.html 1.158529
2 senate.html 1.018685
4 james_webb_gan.html 1.014170

These results are very strong. The top result refers to the ACNH Time Series Clustering project, and it's followed by the Hazardous Freight Prediction: Siamese Neural Network and NLP project, which also uses time series data. In fact, all of the results except for the last relate strongly to the search terms.

We'll take a look at how all this looks in production in the next section.


6. In Production

In production, the algorithm works exactly the same way it does here; there's just the added hurdles of how to pass data between an HTML/PHP/JavaScript website and Python.

Python has many libraries to facilitate this kind of data pipeline, like Falsk or Django, but the quickest and easiest solution is by using CGI, or the common gateway interface. Enabling CGI on a webserver allows for programming languages like Perl or Python to respond to GET and POST requests and pass data to front-end web applications.

When a user submits a search on the website, a POST request is sent asynchronously via ajax to the CGI Python script. This script loads in a pretrained TF-IDF model, preprocesses the search text, calculates the search rank score, and finally returns results. In production, the results of a search are returned as templated HTML snippets that can be displayed directly on the site and viewed by the front-end user.


7. Conclusion

I get a huge amount of satisfaction when data science can be used in a visible way, and it doesn't get much more visible than this. Using TF-IDF combined with fuzzy matching on keywords produces strong results, and though I believe some additional tweaking may be required before it's perfect, I am well pleased with these results so far.

This algorithm can be improved in the short-term in two ways: 1. find an optimal balance of the weights used in the calculation of the search rank score, and 2. add a more rich set of tags to augment search results.

Thanks for reading!