Basic usage
===========

.. contents:: Contents
    :depth: 2
    :local:

TextScrubber
------------

The :obj:`text_scrubber.text_scrubber.TextScrubber` class cleans a single or a collection of strings. It can be easily
constructed and configured with building blocks:

.. code-block:: python

    from text_scrubber import TextScrubber

    ts = (TextScrubber().to_ascii()
                        .lowercase()
                        .tokenize()
                        .remove_stop_words()
                        .join())

which can then be used as:

.. code-block:: python

    ts.transform('héLlô there, WòrlD')  # outputs 'hello world'

or with an iterable of input:

.. code-block:: python

    ts.transform(['héLlô there, WòrlD', 'slímm̀er ÀI'])  # outputs ['hello world', 'slimmer AI']

For a complete list of building blocks please refer to the :obj:`text_scrubber.text_scrubber.TextScrubber` API
reference.


Geo
---

The :obj:`text_scrubber.geo` module contains functions to normalize geographical data which deal with spelling errors,
country name variations, etc.:

.. code-block:: python

    from text_scrubber.geo import normalize_country, normalize_state, normalize_city

    # Countries
    normalize_country('Peoples rep. of China')  # ['China']
    normalize_country('Deutschland')            # ['Germany']
    normalize_country('st Nevis and Kitties')   # ['Saint Kitts and Nevis']
    normalize_country('ira')                    # ['Iran', 'Iraq']

    # States
    normalize_state('Qld')         # [('Queensland', 'Australia')]
    normalize_state('AR')          # [('Arkansas', 'United States'),
                                   #  ('Arunachal Pradesh', 'India')]
    normalize_state('King Kong')   # [('Hong Kong', 'China')]

    # Cities
    normalize_city('Leibnitz')    # [('Leibnitz', 'Austria')]
    normalize_city('heidelberg')  # [('Heidelberg', 'Australia'), ('Heidelberg', 'Germany'),
                                  #  ('Heidelberg', 'South Africa'),
                                  #  ('Heidelberg', 'United States')]
    normalize_city('texas')       # [('Texas City', 'United States')]
    normalize_city('Pari')        # [('Parai', 'Brazil'), ('Paris', 'Canada'),
                                  #  ('Paris', 'France'), ('Paris', 'United States'),
                                  #  ('Parit', 'Malaysia'), ('Pariz', 'Czech Republic')]

.. warning::

    There's a good chance that the list of states/cities is not complete for all countries.

.. note::

    Whenever a country is considered part of another country ``normalize_country`` will return the latter.
    E.g., ``Puerto Rico`` is mapped to ``United States`` and ``Greenland`` to ``Denmark``.


Cleaning
~~~~~~~~

There are clean functions available for countries/states/cities, which all follow the same cleaning pipeline:

.. code-block:: python

    from text_scrubber.geo import clean_country, clean_state, clean_city

    clean_country('cent afr rep.')     # 'central african republic'
    clean_state('Hyōgo')               # 'hyogo'
    clean_city('płońsk')               # 'plonsk'
    clean_city('neustadt/westerwald')  # 'neustadt westerwald'