Basic usage

TextScrubber

The text_scrubber.text_scrubber.TextScrubber class cleans a single or a collection of strings. It can be easily constructed and configured with building blocks:

from text_scrubber import TextScrubber

ts = (TextScrubber().to_ascii()
                    .lowercase()
                    .tokenize()
                    .remove_stop_words()
                    .join())

which can then be used as:

ts.transform('héLlô there, WòrlD')  # outputs 'hello world'

or with an iterable of input:

ts.transform(['héLlô there, WòrlD', 'slímm̀er ÀI'])  # outputs ['hello world', 'slimmer AI']

For a complete list of building blocks please refer to the text_scrubber.text_scrubber.TextScrubber API reference.

Geo

The text_scrubber.geo module contains functions to normalize geographical data which deal with spelling errors, country name variations, etc.:

from text_scrubber.geo import normalize_country, normalize_state, normalize_city

# Countries
normalize_country('Peoples rep. of China')  # ['China']
normalize_country('Deutschland')            # ['Germany']
normalize_country('st Nevis and Kitties')   # ['Saint Kitts and Nevis']
normalize_country('ira')                    # ['Iran', 'Iraq']

# States
normalize_state('Qld')         # [('Queensland', 'Australia')]
normalize_state('AR')          # [('Arkansas', 'United States'),
                               #  ('Arunachal Pradesh', 'India')]
normalize_state('King Kong')   # [('Hong Kong', 'China')]

# Cities
normalize_city('Leibnitz')    # [('Leibnitz', 'Austria')]
normalize_city('heidelberg')  # [('Heidelberg', 'Australia'), ('Heidelberg', 'Germany'),
                              #  ('Heidelberg', 'South Africa'),
                              #  ('Heidelberg', 'United States')]
normalize_city('texas')       # [('Texas City', 'United States')]
normalize_city('Pari')        # [('Parai', 'Brazil'), ('Paris', 'Canada'),
                              #  ('Paris', 'France'), ('Paris', 'United States'),
                              #  ('Parit', 'Malaysia'), ('Pariz', 'Czech Republic')]

Warning

There’s a good chance that the list of states/cities is not complete for all countries.

Note

Whenever a country is considered part of another country normalize_country will return the latter. E.g., Puerto Rico is mapped to United States and Greenland to Denmark.

Cleaning

There are clean functions available for countries/states/cities, which all follow the same cleaning pipeline:

from text_scrubber.geo import clean_country, clean_state, clean_city

clean_country('cent afr rep.')     # 'central african republic'
clean_state('Hyōgo')               # 'hyogo'
clean_city('płońsk')               # 'plonsk'
clean_city('neustadt/westerwald')  # 'neustadt westerwald'