67 skills found · Page 1 of 3
CatchTheTornado / Text Extract ApiDocument (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
NanoNets / DocstrangeExtract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
lovit / Soynlp한국어 자연어처리를 위한 파이썬 라이브러리입니다. 단어 추출/ 토크나이저 / 품사판별/ 전처리의 기능을 제공합니다.
yongzhuo / Macropodus自然语言处理工具Macropodus,基于Albert+BiLSTM+CRF深度学习网络架构,中文分词,词性标注,命名实体识别,新词发现,关键词,文本摘要,文本相似度,科学计算器,中文数字阿拉伯数字(罗马数字)转换,中文繁简转换,拼音转换。tookit(tool) of NLP,CWS(chinese word segnment),POS(Part-Of-Speech Tagging),NER(name entity recognition),Find(new words discovery),Keyword(keyword extraction),Summarize(text summarization),Sim(text similarity),Calculate(scientific calculator),Chi2num(chinese number to arabic number)
lluisgomez / TextProposalsImplementation of the method proposed in the papers " TextProposals: a Text-specific Selective Search Algorithm for Word Spotting in the Wild" and "Object Proposals for Text Extraction in the Wild" (Gomez & Karatzas), 2016 and 2015 respectively.
rock3125 / Enhanced Subject Verb Object ExtractionEnhanced Subject Word Object Extraction
yafangy / Review Aspect ExtractionCNN-based model to realize aspect extraction of restaurant reviews based on pre-trained word embeddings and part-of-speech tagging
abhishek305 / PyBot A ChatBot For Answering Python Queries Using NLPPybot can change the way learners try to learn python programming language in a more interactive way. This chatbot will try to solve or provide answer to almost every python related issues or queries that the user is asking for. We are implementing NLP for improving the efficiency of the chatbot. We will include voice feature for more interactivity to the user. By utilizing NLP, developers can organize and structure knowledge to perform tasks such as automatic summarization, translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation. NLTK has been called “a wonderful tool for teaching and working in, computational linguistics using Python,” and “an amazing library to play with natural language.The main issue with text data is that it is all in text format (strings). However, the Machine learning algorithms need some sort of numerical feature vector in order to perform the task. So before we start with any NLP project we need to pre-process it to make it ideal for working. Converting the entire text into uppercase or lowercase, so that the algorithm does not treat the same words in different cases as different Tokenization is just the term used to describe the process of converting the normal text strings into a list of tokens i.e words that we actually want. Sentence tokenizer can be used to find the list of sentences and Word tokenizer can be used to find the list of words in strings.Removing Noise i.e everything that isn’t in a standard number or letter.Removing Stop words. Sometimes, some extremely common words which would appear to be of little value in helping select documents matching a user need are excluded from the vocabulary entirely. These words are called stop words.Stemming is the process of reducing inflected (or sometimes derived) words to their stem, base or root form — generally a written word form. Example if we were to stem the following words: “Stems”, “Stemming”, “Stemmed”, “and Stemtization”, the result would be a single word “stem”. A slight variant of stemming is lemmatization. The major difference between these is, that, stemming can often create non-existent words, whereas lemmas are actual words. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma. Examples of Lemmatization are that “run” is a base form for words like “running” or “ran” or that the word “better” and “good” are in the same lemma so they are considered the same.
BryceWG / Blinko ExtentionAI-based Chrome web content extraction and summarization extension for Blinko. Supports one-click summarization of web pages, word selection saving, quick recording, and more, with content synchronization to the Blinko server.
YuanLi95 / EEGA For JMEREThis is code for Joint Multimodal Entity-Relation Extraction Based on Edge-enhanced Graph Alignment Network and Word-pair Relation Tagging (AAAI 2023)
appautomaton / Document SKILLsA collection of Claude Code / Codex AGENTIC SKILLs for document manipulation - PDF extraction/forms, Excel analysis/formulas, Word, and PowerPoint
arsena-k / Word2Vec Bias ExtractionHow are words loaded with meaning? Repository accompanying research by Alina Arseniev-Koehler and Jacob G. Foster, titled "Machine learning as a model for cultural learning: teaching an algorithm what it means to be fat." https://journals.sagepub.com/doi/full/10.1177/00491241221122603
noc-lab / Clinical Concept ExtractionClinical Concept Extraction with Contextual Word Embedding
nok / Rake Text RubyImplementation of the Rapid Automatic Keyword Extraction algorithm in Ruby, a multi-word keywords extraction.
u-prashant / RAKERAKE-Keyword is a Python library that can extract keywords from any document or a piece of text. It is based on RAKE algorithm. Rapid Automatic Keyword Extraction (RAKE) is a keyword extraction method that is extremely efficient and operates on individual documents. It tries to determine the key phrases in a text by calculating the co-occurrences of every word in a key phrase and also its frequency in the entire text.
eyaler / Hebrew TokenizerA field-tested Hebrew tokenizer for dirty texts (ben-yehuda project, bible, cc100, mc4, opensubs, oscar, twitter) focused on multi-word expression extraction.
arpit3043 / Extractive Text SummerizationSummarization systems often have additional evidence they can utilize in order to specify the most important topics of document(s). For example, when summarizing blogs, there are discussions or comments coming after the blog post that are good sources of information to determine which parts of the blog are critical and interesting. In scientific paper summarization, there is a considerable amount of information such as cited papers and conference information which can be leveraged to identify important sentences in the original paper. How text summarization works In general there are two types of summarization, abstractive and extractive summarization. Abstractive Summarization: Abstractive methods select words based on semantic understanding, even those words did not appear in the source documents. It aims at producing important material in a new way. They interpret and examine the text using advanced natural language techniques in order to generate a new shorter text that conveys the most critical information from the original text. It can be correlated to the way human reads a text article or blog post and then summarizes in their own word. Input document → understand context → semantics → create own summary. 2. Extractive Summarization: Extractive methods attempt to summarize articles by selecting a subset of words that retain the most important points. This approach weights the important part of sentences and uses the same to form the summary. Different algorithm and techniques are used to define weights for the sentences and further rank them based on importance and similarity among each other. Input document → sentences similarity → weight sentences → select sentences with higher rank. The limited study is available for abstractive summarization as it requires a deeper understanding of the text as compared to the extractive approach. Purely extractive summaries often times give better results compared to automatic abstractive summaries. This is because of the fact that abstractive summarization methods cope with problems such as semantic representation, inference and natural language generation which is relatively harder than data-driven approaches such as sentence extraction. There are many techniques available to generate extractive summarization. To keep it simple, I will be using an unsupervised learning approach to find the sentences similarity and rank them. One benefit of this will be, you don’t need to train and build a model prior start using it for your project. It’s good to understand Cosine similarity to make the best use of code you are going to see. Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Since we will be representing our sentences as the bunch of vectors, we can use it to find the similarity among sentences. Its measures cosine of the angle between vectors. Angle will be 0 if sentences are similar. All good till now..? Hope so :) Next, Below is our code flow to generate summarize text:- Input article → split into sentences → remove stop words → build a similarity matrix → generate rank based on matrix → pick top N sentences for summary.
raphael-sch / HC Sentence SummarizationDiscrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction
jeekim / FasttextrankA word embedding and graph-based keyword extraction tool
fakabbir / OCRProbabilistic Key Value pair extraction using word weights from Invoices - Non Searchable PDF