Change log¶
0.5.0¶
(2024-08-26)
Added
text_scrubber.text_scrubber.TextScrubber.convert_html_entities()
Added
text_scrubber.text_scrubber.TextScrubber.fix_bad_unicode()
Added
text_scrubber.text_scrubber.TextScrubber.latex_to_text()
Added
text_scrubber.text_scrubber.TextScrubber.normalize_unicode()
Removed Python 3.6 and 3.7 support
Replaced dependency Levenshtein with rapidfuzz for licensing reasons
0.4.2¶
(2023-04-14)
Changed ownership of and moved repository from Slimmer-AI to sybrenjansen
0.4.1¶
(2022-05-27)
Documentation issue prevented GitHub from uploading
text-scrubber
to PyPI
0.4.0¶
(2022-05-27)
Common country replacements have been updated to allow for fuzzy matching
Country codes added such that they can be matched through
text_scrubber.geo.normalize_country()
The geo normalize functions now return the canonical name together with the matched name
The geo normalize functions now return a
text_scrubber.geo.normalize.Location
objectThe geo find in string functions now return a
text_scrubber.geo.find_in_string.ExtractedLocation
object
0.3.2¶
(2022-05-19)
Updated MANIFEST.in file to include the Cython files
0.3.1¶
(2022-05-19)
Optimized Levenshtein and trigram similarity functions, and all normalization functions. Speed-ups of x20-40 are to be expected
0.3.0¶
(2022-04-13)
Renamed normalize_state to
text_scrubber.geo.normalize_region()
, as it now handles all kinds of regionsExpanded countries, regions, and cities with geonames database, increasing the completeness of the geo database
text_scrubber.geo.normalize_country()
,text_scrubber.geo.normalize_region()
, andtext_scrubber.geo.normalize_city()
now return the match scores as welltext_scrubber.geo.normalize_region()
andtext_scrubber.geo.normalize_city()
also return the corresponding normalized countryAdded
text_scrubber.geo.find_country_in_string()
,text_scrubber.geo.find_city_in_string()
, andtext_scrubber.geo.find_region_in_string()
functions that find a location in a stringUpdated cleaning pipeline of
text_scrubber.geo.clean_country()
,text_scrubber.geo.clean_city()
, andtext_scrubber.geo.clean_region()
Added
case_sensitive
boolean flag totext_scrubber.text_scrubber.TextScrubber.remove_stop_words()
Improved speed of trigram matching by mapping trigrams to integer indices
0.2.1¶
(2022-03-02)
Information about the cities in a country is loaded on the fly.
0.2.0¶
(2021-05-10)
Replaced unidecode by anyascii, which has a more relaxed license. Output of to_ascii can change because of it
0.1.1¶
(2020-09-10)
Removed Python 3.5 support
0.1.0¶
(2020-09-10)
First release