SkillAgentSearch skills...

Dobbi

An open-source NLP library: fast text cleaning and preprocessing

Install / Use

/learn @iaramer/Dobbi
About this skill

Quality Score

0/100

Supported Platforms

Universal

README

<h1 align='center'> 🌴 dobbi 🦕 </h1> <p align='center'> Takes care of all of this boring NLP stuff <br> <br> <img alt="PyPI - Python Version" src="https://img.shields.io/pypi/pyversions/dobbi"> <a href='https://pypi.org/project/dobbi/'><img alt="Version" src="https://img.shields.io/pypi/v/dobbi?logo=pypi"></a> <a href='https://opensource.org/licenses/Apache-2.0'><img alt="GitHub" src="https://img.shields.io/github/license/iaramer/dobbi"></a><br> </p>

Description

An open-source NLP library: fast text cleaning and preprocessing.

TL;DR

This library provides a quick and ready-to-use text preprocessing tools for text cleaning and normalization. You can simply remove hashtags, nicknames, emoji, url addresses, punctuation, whitespace and whatever.

Installation

To download dobbi, either fork this GitHub repo or simply use Pypi via pip:

$ pip install dobbi

Usage

Import the library:

import dobbi

Interaction

The library uses method chaining in order to simplify text processing:

import pandas as pd

d = {'text': ['#fun #lol   Why  @Alex33 is so funny here: https://some-url.com',
              '#looool     =)      😍 such lovely!?*!!!%&']}
df = pd.DataFrame(d)

cln_func = dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .function()
df['text'] = df['text'].map(cln_func)

repl_func = dobbi.replace() \
    .emoji() \
    .emoticon() \
    .punctuation() \
    .function()
df['text'] = df['text'].map(repl_func)

Result:

print(df['text'][0])  # 'Why is so funny here'
print(df['text'][1])  # 'TOKEN_EMOTICON_HAPPY_FACE_OR_SMILEY TOKEN_EMOJI_SMILING_FACE_WITH_HEART_EYES such lovely'

Supported methods and patterns

The process consists of three stages:

  1. Initialization methods: initialize a dobbi Work object
  2. Intermediate methods: chain patterns in the needed order
  3. Terminal methods: choose if you need a function or a result

Initialization functions:

  • dobbi.clean()
  • dobbi.collect()
  • dobbi.replace()

Intermediate methods (pattern processing choice):

  • regexp() - custom regular expressions
  • url() - URLs
  • html() - HTML and "<...>" type markups
  • punctuation() - punctuation
  • hashtag() - hashtags
  • emoji() - emoji
  • emoticons() - emoticons
  • whitespace() - any type of whitespaces
  • nickname() - @-starting nicknames

Terminal methods:

  • execute(str) - executes chosen methods on the provided string.
  • function() - returns a function which is a combination of the chosen methods.

Examples

1) Clean a random Twitter message

dobbi.clean() \
    .hashtag() \
    .nickname() \
    .url() \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'

2) Replace nicknames and urls with tokens

dobbi.replace() \
    .hashtag('') \
    .nickname() \
    .url('__CUSTOM_URL_TOKEN__') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why TOKEN_NICKNAME is so funny? Check here: __CUSTOM_URL_TOKEN__'

3) Get the text cleanup function

func = dobbi.clean() \
    .url() \
    .hashtag() \
    .punctuation() \
    .whitespace() \
    .html() \
    .function()
func('\t #fun #lol    Why  @Alex33 is so... funny? <tag> \nCheck\there: https://some-url.com')

Result:

'Why Alex33 is so funny Check here'
  1. Chain regexp methods
dobbi.clean() \
    .regexp('#\w+') \
    .regexp('@\w+') \
    .regexp('https?://\S+') \
    .execute('#fun #lol    Why  @Alex33 is so funny? Check here: https://some-url.com')

Result:

'Why is so funny? Check here:'
  1. Remove emoji and emoticons
em_func = dobbi.clean() \
    .emoji() \
    .emoticon() \
    .punctuation() \
    .function()
em_func('Great! =) :D  😍 😋such lovely!?*!!!%&')

Result:

'Great such lovely'

Additional

Please pay attention that the functions are applied in the order you've specified them. So, you're better to chain .punctuation() as one of the last functions.

Call for collaboration 🤗

If you enjoyed the project I would be grateful if you supported it :)

Below is the list of useful features I would be happy to share with you:

  • [ ] Finding bugs
  • [ ] Making code optimizations
  • [ ] Writing tests
  • [ ] Help with new features development
View on GitHub
GitHub Stars23
CategoryDevelopment
Updated2y ago
Forks0

Languages

Python

Security Score

80/100

Audited on Jan 4, 2024

No findings