SkillAgentSearch skills...

DocumentSearchEngine

Document Search Engine project with TF-IDF abd Google universal sentence encoder model

Install / Use

/learn @zayedrais/DocumentSearchEngine

README

<p class="ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj" data-selectable-paragraph="">In this post, we will be building a <strong class="id iq">semantic documents search engine</strong> by using <a class="bu dh iw ix iy iz" href="http://qwone.com/~jason/20Newsgroups/" target="_blank" rel="noopener nofollow">20newsgroup open-source dataset</a>.</p> <h1 class="hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf" data-selectable-paragraph="">Prerequisites</h1> <ul class=""> <li class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io mz na nb" data-selectable-paragraph=""><a class="bu dh iw ix iy iz" href="https://www.python.org/" target="_blank" rel="noopener nofollow">Python 3.5</a>+</li> <li class="ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb" data-selectable-paragraph=""><a class="bu dh iw ix iy iz" href="https://pypi.org/project/pip/" target="_blank" rel="noopener nofollow">pip 19</a>+ or pip3</li> <li class="ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb" data-selectable-paragraph=""><a class="bu dh iw ix iy iz" href="https://www.nltk.org/" target="_blank" rel="noopener nofollow">NLTK</a></li> <li class="ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb" data-selectable-paragraph=""><a class="bu dh iw ix iy iz" href="https://scikit-learn.org/stable/" target="_blank" rel="noopener nofollow">Scikit-learn</a></li> <li class="ia ib cs ax id b ie nc ig nd mg ne mi nf mk ng io mz na nb" data-selectable-paragraph=""><a class="bu dh iw ix iy iz" href="https://www.tensorflow.org" target="_blank" rel="noopener nofollow">TensorFlow-GPU</a></li> </ul> <h1 class="hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf" data-selectable-paragraph="">1. Getting Ready</h1> <p class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj" data-selectable-paragraph="">For this post we will need the above prerequisites<strong class="id iq">,&nbsp;</strong>If you do not have it yet, please make ready for it.</p> <h1 class="hb hc cs ax aw ee ka mm kc mn mo mp mq mr ms mt hf" data-selectable-paragraph="">2. Data collection</h1> <p class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj" data-selectable-paragraph="">Here, we are using 20newsgroup dataset to the analysis of a text search engine giving input keywords/sentences input.</p> <p class="ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj" data-selectable-paragraph="">The 20 Newsgroups data set is a collection of approximately 11K newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups.</p> <pre class="lp lq lr ls lt nh ni eu"><span class="nj hc cs ax nk b bp nl nm y nn" data-selectable-paragraph="">news = pd.read_json('<a class="bu dh iw ix iy iz" href="https://raw.githubusercontent.com/zayedrais/DocumentSearchEngine/master/data/newsgroups.json" target="_blank" rel="noopener nofollow">https://raw.githubusercontent.com/zayedrais/DocumentSearchEngine/master/data/newsgroups.json</a>')</span></pre> <h2 class="nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny" data-selectable-paragraph="">2.1 data cleaning:</h2> <p class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj" data-selectable-paragraph="">Before going into a clean phase, we are retrieving the subject of the document from the text.</p> <pre class="lp lq lr ls lt nh ni eu"><span class="nj hc cs ax nk b bp nl nm y nn" data-selectable-paragraph="">for i,txt in enumerate(news['content']):<br /> subject = re.findall('Subject:(.*\n)',txt)<br /> if (len(subject) !=0):<br /> news.loc[i,'Subject'] =str(i)+' '+subject[0]<br /> else:<br /> news.loc[i,'Subject'] ='NA'<br />df_news =news[['Subject','content']]</span></pre> <p class="ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj" data-selectable-paragraph="">Now, we are removing the unwanted data from text content and the subject of a dataset.</p> <pre class="lp lq lr ls lt nh ni eu"><span class="nj hc cs ax nk b bp nl nm y nn" data-selectable-paragraph="">df_news.content =df_news.content.replace(to_replace='from:(.*\n)',value='',regex=True) ##remove from to email <br />df_news.content =df_news.content.replace(to_replace='lines:(.*\n)',value='',regex=True)<br />df_news.content =df_news.content.replace(to_replace='[!"#$%&amp;\'()*+,/:;&lt;=&gt;?@[\\]^_`{|}~]',value=' ',regex=True) #remove punctuation except<br />df_news.content =df_news.content.replace(to_replace='-',value=' ',regex=True)<br />df_news.content =df_news.content.replace(to_replace='\s+',value=' ',regex=True) #remove new line<br />df_news.content =df_news.content.replace(to_replace=' ',value='',regex=True) #remove double white space<br />df_news.content =df_news.content.apply(lambda x:x.strip()) # Ltrim and Rtrim of whitespace</span></pre> <h2 class="nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny" data-selectable-paragraph="">2.2 data preprocessing</h2> <p class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj" data-selectable-paragraph="">Preprocessing is one of the major steps when we are dealing with any kind of text models. During this stage, we have to look at the distribution of our data, what techniques are needed and how deep we should clean.</p> <h2 class="nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny" data-selectable-paragraph="">Lowercase</h2> <p class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj" data-selectable-paragraph="">Conversion the text into a lower form. i.e. &lsquo;<strong class="id iq">Dogs&rsquo;</strong> into &lsquo;<strong class="id iq">dogs</strong>&rsquo;</p> <pre class="lp lq lr ls lt nh ni eu"><span class="nj hc cs ax nk b bp nl nm y nn" data-selectable-paragraph="">df_news['content']=[entry.lower() for entry in df_news['content']]</span></pre> <h2 class="nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny" data-selectable-paragraph="">Word Tokenization</h2> <p class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj" data-selectable-paragraph="">Word tokenization is the process to divide the sentence into the form of a word.</p> <p class="ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj" data-selectable-paragraph="">&ldquo;<strong class="id iq">Jhon is running in the track</strong>&rdquo; &rarr; &lsquo;<strong class="id iq">john</strong>&rsquo;, &lsquo;<strong class="id iq">is</strong>&rsquo;, &lsquo;<strong class="id iq">running</strong>&rsquo;, &lsquo;<strong class="id iq">in</strong>&rsquo;, &lsquo;<strong class="id iq">the</strong>&rsquo;, &lsquo;<strong class="id iq">track</strong>&rsquo;</p> <pre class="lp lq lr ls lt nh ni eu"><span class="nj hc cs ax nk b bp nl nm y nn" data-selectable-paragraph="">df_news['Word tokenize']= [word_tokenize(entry) for entry in df_news.content]</span></pre> <h2 class="nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny" data-selectable-paragraph="">Stop words</h2> <p class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj" data-selectable-paragraph="">Stop words are the most commonly occurring words which don&rsquo;t give any additional value to the document vector. in-fact removing these will increase computation and space efficiency. <a class="bu dh iw ix iy iz" href="https://www.nltk.org/" target="_blank" rel="noopener nofollow">NLTK</a> library has a method to download the stopwords.</p> <figure class="lp lq lr ls lt gp t u paragraph-image"> <div class="t u nz"> <div class="gu y br gv"> <div class="oa y"><img class="is it z ab ac fu v ha fr-fic fr-dii fr-draggable" src="https://miro.medium.com/max/601/1*PdgWsOM1ep9Z2rfkQ6UJZA.png" width="601" height="275" data-fr-image-pasted="true" /></div> </div> </div> </figure> <h2 class="nj hc cs ax aw ee no np nq nr ns nt nu nv nw nx ny" data-selectable-paragraph="">Word Lemmatization</h2> <p class="ia ib cs ax id b ie mu ig mv mg mw mi mx mk my io gj" data-selectable-paragraph="">Lemmatisation is a way to reduce the word to root synonym of a word. Unlike Stemming, Lemmatisation makes sure that the reduced word is again a dictionary word (word present in the same language). WordNetLemmatizer can be used to lemmatize any word.</p> <p class="ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj" data-selectable-paragraph="">i.e. <strong class="id iq">rocks &rarr;rock, better &rarr;good, corpora &rarr;corpus</strong></p> <p class="ia ib cs ax id b ie me ig mf mg mh mi mj mk ml io gj" data-selectable-paragraph="">Here created wordLemmatizer function to remove a <strong class="id iq">single character</strong>, <strong class="id iq">stopwords</strong> and <strong class="id iq">lemmatize</strong> the words.</p> <pre class="lp lq lr ls lt nh ni eu"><span class="nj hc cs ax nk b bp nl nm y nn" data-selectable-paragraph=""># WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adjective etc. By default it is set to Noun<br />def wordLemmatizer(data):<br /> tag_map = defaultdict(lambda : wn.NOUN)<br /> tag_map['J'] = wn.ADJ<br /> tag_map['V'] = wn.VERB<br /> tag_map['R'] = wn.ADV<br /> file_clean_k =pd.DataFrame()<br /> for index,entry in enumerate(data):<br /> <br /> # Declaring Empty List to store the words that follow the rules for this step<br /> Final_words = []<br /> # Initializing WordNetLemmatizer()<br /> word_Lemmatized = WordNetLemmatizer()<br /> # pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(V) or something else.<br /> for word, tag in pos_tag(entry):<br /> # Below condition is to check for Stop words and consider only alphabets<br /> if len(word)&gt;1 and word not in stopwords.words('english') and word.isalpha():<br /> word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])<br /> Final_words.append(word_Final)<br /> # The final processed set of words for each iteration will be stored in 'text_final'<br /> file_clean_k.loc[index,'Keyword_final'] = str(Final_words)<br /> file_clean_k.loc[index,'Keyword_final'] = str(Final_words)<br />

Related Skills

View on GitHub
GitHub Stars55
CategoryData
Updated3mo ago
Forks24

Languages

Jupyter Notebook

Security Score

82/100

Audited on Dec 23, 2025

No findings