Change log

0.4.1

(2022-05-27)

  • Documentation issue prevented GitHub from uploading text-scrubber to PyPI

0.4.0

(2022-05-27)

  • Common country replacements have been updated to allow for fuzzy matching

  • Country codes added such that they can be matched through text_scrubber.geo.normalize_country()

  • The geo normalize functions now return the canonical name together with the matched name

  • The geo normalize functions now return a text_scrubber.geo.normalize.Location object

  • The geo find in string functions now return a text_scrubber.geo.find_in_string.ExtractedLocation object

0.3.2

(2022-05-19)

  • Updated MANIFEST.in file to include the Cython files

0.3.1

(2022-05-19)

  • Optimized Levenshtein and trigram similarity functions, and all normalization functions. Speed-ups of x20-40 are to be expected

0.3.0

(2022-04-13)

  • Renamed normalize_state to text_scrubber.geo.normalize_region(), as it now handles all kinds of regions

  • Expanded countries, regions, and cities with geonames database, increasing the completeness of the geo database

  • text_scrubber.geo.normalize_country(), text_scrubber.geo.normalize_region(), and text_scrubber.geo.normalize_city() now return the match scores as well

  • text_scrubber.geo.normalize_region() and text_scrubber.geo.normalize_city() also return the corresponding normalized country

  • Added text_scrubber.geo.find_country_in_string(), text_scrubber.geo.find_city_in_string(), and text_scrubber.geo.find_region_in_string() functions that find a location in a string

  • Updated cleaning pipeline of text_scrubber.geo.clean_country(), text_scrubber.geo.clean_city(), and text_scrubber.geo.clean_region()

  • Added case_sensitive boolean flag to text_scrubber.text_scrubber.TextScrubber.remove_stop_words()

  • Improved speed of trigram matching by mapping trigrams to integer indices

0.2.1

(2022-03-02)

  • Information about the cities in a country is loaded on the fly.

0.2.0

(2021-05-10)

  • Replaced unidecode by anyascii, which has a more relaxed license. Output of to_ascii can change because of it

0.1.1

(2020-09-10)

  • Removed Python 3.5 support

0.1.0

(2020-09-10)

  • First release