Basic usage¶
TextScrubber¶
The text_scrubber.text_scrubber.TextScrubber
class cleans a single or a collection of strings. It can be easily
constructed and configured with building blocks:
from text_scrubber import TextScrubber
ts = (TextScrubber().to_ascii()
.lowercase()
.tokenize()
.remove_stop_words()
.join())
which can then be used as:
ts.transform('héLlô there, WòrlD') # outputs 'hello world'
or with an iterable of input:
ts.transform(['héLlô there, WòrlD', 'slímm̀er ÀI']) # outputs ['hello world', 'slimmer AI']
For a complete list of building blocks please refer to the text_scrubber.text_scrubber.TextScrubber
API
reference.
Geo¶
The text_scrubber.geo
module contains functions to normalize geographical data which deal with spelling errors,
country name variations, etc.:
from text_scrubber.geo import normalize_country, normalize_state, normalize_city
# Countries
normalize_country('Peoples rep. of China') # ['China']
normalize_country('Deutschland') # ['Germany']
normalize_country('st Nevis and Kitties') # ['Saint Kitts and Nevis']
normalize_country('ira') # ['Iran', 'Iraq']
# States
normalize_state('Qld') # [('Queensland', 'Australia')]
normalize_state('AR') # [('Arkansas', 'United States'),
# ('Arunachal Pradesh', 'India')]
normalize_state('King Kong') # [('Hong Kong', 'China')]
# Cities
normalize_city('Leibnitz') # [('Leibnitz', 'Austria')]
normalize_city('heidelberg') # [('Heidelberg', 'Australia'), ('Heidelberg', 'Germany'),
# ('Heidelberg', 'South Africa'),
# ('Heidelberg', 'United States')]
normalize_city('texas') # [('Texas City', 'United States')]
normalize_city('Pari') # [('Parai', 'Brazil'), ('Paris', 'Canada'),
# ('Paris', 'France'), ('Paris', 'United States'),
# ('Parit', 'Malaysia'), ('Pariz', 'Czech Republic')]
Warning
There’s a good chance that the list of states/cities is not complete for all countries.
Note
Whenever a country is considered part of another country normalize_country
will return the latter.
E.g., Puerto Rico
is mapped to United States
and Greenland
to Denmark
.
Cleaning¶
There are clean functions available for countries/states/cities, which all follow the same cleaning pipeline:
from text_scrubber.geo import clean_country, clean_state, clean_city
clean_country('cent afr rep.') # 'central african republic'
clean_state('Hyōgo') # 'hyogo'
clean_city('płońsk') # 'plonsk'
clean_city('neustadt/westerwald') # 'neustadt westerwald'