PrepGem

PrepGem is a Python package for preprocessing text data, designed to simplify the text-cleaning process for natural language processing (NLP) projects.

Features

PrepGem offers the following features:

Handle Missing Values: Easily handle missing values in specified DataFrame columns.
Clean HTML Text: Remove HTML tags and special characters from text or DataFrame columns.
Remove URLs: Remove URLs from text or DataFrame columns.
Remove Punctuation: Remove punctuation from text or DataFrame columns.
Remove Emojis: Remove emojis from text or DataFrame columns.
Remove Foreign Letters: Remove foreign letters from text or DataFrame columns.
Remove Numbers: Remove numbers from text or DataFrame columns.
Lowercasing: Convert text to lowercase in text or DataFrame columns.
Remove White Spaces: Remove extra white spaces from text or DataFrame columns.
Remove Repeated Characters: Remove repeated characters in words from text or DataFrame columns.
Remove Nonsense Words: Remove nonsense words from text or DataFrame columns.
Spell Correction: Perform spell-checking on text or DataFrame columns.
Nonsense Words and Spell Check: Perform spell-checking and remove nonsense words from text or DataFrame columns.
Tokenize: Tokenize text using NLTK's word_tokenize function.
Remove Stopwords: Remove stopwords from text tokens.
Stemming: Perform stemming on text tokens.

Installation

You can install PrepGem via pip:

pip install prepgem

Usage

Importing the module python

import prepgem

Basic Usage

text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)

Preprocessing a single text

text = "This is an example text for preprocessing."
cleaned_text = prepgem.preprocess_single_text(text)
print(cleaned_text)

Preprocessing a DataFrame

import pandas as pd

# Create a sample DataFrame
data = {
    'text_column': ["This is an example text.", "Another example text with numbers: 12345."]
}
df = pd.DataFrame(data)

# Preprocess text column in the DataFrame
cleaned_df = prepgem.preprocess_dataframe(df, columns=['text_column'])
print(cleaned_df)

Default preprocessing pipeline

Default available preprocessing step is:

clean_html_text.
remove_urls
remove_punctuation
remove_emojis
remove_foreign_letters
remove_numbers
lowercasing
remove_white_spaces
remove_repeated_characters
nosense_words_and_spell_check
tokenize
remove_stopwords
stemming

text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text)
print(cleaned_text)

Custom preprocessing pipeline

You can customize the preprocessing steps by passing a list of parameters to the preprocess_text method. Available parameters include:

clean_html_text.
remove_urls
remove_punctuation
remove_emojis
remove_foreign_letters
remove_numbers
lowercasing
remove_white_spaces
remove_repeated_characters
remove_nonsense_words
spell_corrector
nosense_words_and_spell_check
tokenize
remove_stopwords
stemming
handle_missing_values

Example usage

text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text","nosense_words_and_spell_check"])
print(cleaned_text)

You can customize the preprocessing steps by passing a parameter remove with value of True remove=True to the preprocess_text method to remove a step. Available parameters include:

Example usage

text = "This is an example text with <html> tags and URLs: https://example.com."
cleaned_text = prepgem.preprocess_text(text, pipeline=["clean_html_text"], remove=True)
print(cleaned_text)

You can use all step as normal function just by passing The text or DataFrame containing the text column to be cleaned

from prepgem import remove_urls

# Example text with URLs
text_with_urls = "This is an example text with URLs: https://example.com and http://www.example.org."

# Remove URLs from the text

cleaned_text = remove_urls(text_with_urls)

print("Original text:")
print(text_with_urls)
print("\nText after removing URLs:")
print(cleaned_text)

This will output:

Original text:
This is an example text with URLs: https://example.com and http://www.example.org.

Text after removing URLs:
This is an example text with URLs:  and .

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
.github/workflows		.github/workflows
.idea		.idea
prepgem		prepgem
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PrepGem

Features

Installation

Usage

Importing the module python

Basic Usage

Preprocessing a single text

Preprocessing a DataFrame

Default preprocessing pipeline

Custom preprocessing pipeline

Example usage

Example usage

About

Uh oh!

Releases 4

Packages

Uh oh!

Languages

simoderyouch/PrepGem

Folders and files

Latest commit

History

Repository files navigation

PrepGem

Features

Installation

Usage

Importing the module python

Basic Usage

Preprocessing a single text

Preprocessing a DataFrame

Default preprocessing pipeline

Custom preprocessing pipeline

Example usage

Example usage

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 4

Packages 0

Uh oh!

Languages

Packages