# String Grouper  
<!-- Some cool decorations -->
[![pypi](https://badgen.net/pypi/v/string-grouper)](https://pypi.org/project/string-grouper)
[![license](https://badgen.net/pypi/license/string_grouper)](https://github.com/Bergvca/string_grouper)
[![lastcommit](https://badgen.net/github/last-commit/Bergvca/string_grouper)](https://github.com/Bergvca/string_grouper)
[![codecov](https://codecov.io/gh/Bergvca/string_grouper/branch/master/graph/badge.svg?token=AGK441CQDT)](https://codecov.io/gh/Bergvca/string_grouper)
<!-- [![github](https://shields.io/github/v/release/Bergvca/string_grouper)](https://github.com/Bergvca/string_grouper) -->

<details>
<summary>Click to see image</summary>
<br>
<center><img width="100%" src="https://raw.githubusercontent.com/Bergvca/string_grouper/master/tutorials/sec__edgar_company_info_group003c.svg"></center>

The image displayed above is a visualization of the graph-structure of one of the groups of strings found by `string_grouper`.  Each circle (node) represents a string, and each connecting arc (edge) represents a match between a pair of strings with a similarity score above a given threshold score (here `0.8`).  

The ***centroid*** of the group, as determined by `string_grouper` (see [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation), is the largest node, also with the most edges originating from it.  A thick line in the image denotes a strong similarity between the nodes at its ends, while a faint thin line denotes weak similarity.

The power of `string_grouper` is discernible from this image: in large datasets, `string_grouper` is often able to resolve indirect associations between strings even when, say, due to memory-resource-limitations, direct matches between those strings cannot be computed using conventional methods with a lower threshold similarity score.    

<div style="text-align: center"> &mdash;&mdash;&mdash;</div>

<sup>This image was designed using the graph-visualization software Gephi 0.9.2 with data generated by `string_grouper` operating on the [sec__edgar_company_info.csv](https://www.kaggle.com/dattapiy/sec-edgar-companies-list/version/1) sample data file.</sup>

---
</details>


**`string_grouper`** is a library that makes finding groups of similar strings within a single, or multiple, lists of strings easy — and fast. **`string_grouper`** uses **tf-idf** to calculate [**cosine similarities**](https://towardsdatascience.com/understanding-cosine-similarity-and-its-application-fd42f585296a) within a single list or between two lists of strings. The full process is described in the blog [Super Fast String Matching in Python](https://bergvca.github.io/2017/10/14/super-fast-string-matching.html).

## Installing

`pip install string-grouper`

## Usage

```python
import pandas as pd
from string_grouper import match_strings, match_most_similar, \
	group_similar_strings, compute_pairwise_similarities, \
	StringGrouper
```

As shown above, the library may be used together with `pandas`, and contains four high level functions (`match_strings`, `match_most_similar`, `group_similar_strings`, and `compute_pairwise_similarities`) that can be used directly, and one class (`StringGrouper`) that allows for a more interactive approach. 

The permitted calling patterns of the four functions, and their return types, are:

| Function        | Parameters | `pandas` Return Type |
| -------------: |:-------------|:-----:|
| `match_strings`| `(master, **kwargs)`| `DataFrame` |
| `match_strings`| `(master, duplicates, **kwargs)`| `DataFrame` |
| `match_strings`| `(master, master_id=id_series, **kwargs)`| `DataFrame` |
| `match_strings`| `(master, duplicates, master_id, duplicates_id, **kwargs)`| `DataFrame` |
| `match_most_similar`| `(master, duplicates, **kwargs)`| `Series` (if kwarg `ignore_index=True`) otherwise `DataFrame` (default)|
| `match_most_similar`| `(master, duplicates, master_id, duplicates_id, **kwargs)`| `DataFrame` |
| `group_similar_strings`| `(strings_to_group, **kwargs)`| `Series` (if kwarg `ignore_index=True`) otherwise `DataFrame` (default)|
| `group_similar_strings`| `(strings_to_group, strings_id, **kwargs)`| `DataFrame` |
| `compute_pairwise_similarities`| `(string_series_1, string_series_2, **kwargs)`| `Series` |

In the rest of this document the names, `Series` and `DataFrame`, refer to the familiar `pandas` object types.
#### Parameters:

|Name | Description |
|:--- | :--- |
|**`master`** | A `Series` of strings to be matched with themselves (or with those in `duplicates`). |
|**`duplicates`** | A `Series` of strings to be matched with those of `master`. |
|**`master_id`** (or `id_series`) | A `Series` of IDs corresponding to the strings in `master`. |
|**`duplicates_id`** | A `Series` of IDs corresponding to the strings in `duplicates`. |
|**`strings_to_group`** | A `Series` of strings to be grouped. |
|**`strings_id`** | A `Series` of IDs corresponding to the strings in `strings_to_group`. |
|**`string_series_1(_2)`** | A `Series` of strings each of which is to be compared with its corresponding string in `string_series_2(_1)`. |
|**`**kwargs`** | Keyword arguments (see [below](#kwargs)).|

***New in version 0.6.0***<a name="corpus"></a>: each of the high-level functions listed above also has a `StringGrouper` method counterpart of the same name and parameters.  Calling such a method of any instance of `StringGrouper` will not rebuild the instance's underlying corpus to make string-comparisons but rather use it to perform the string-comparisons.  The input Series to the method (`master`, `duplicates`, and so on) will thus be encoded, or transformed, into tf-idf matrices, using this corpus.  For example:
```python
# Build a corpus using strings in the pandas Series master:
sg = StringGrouper(master)
# The following method-calls will compare strings first in
# pandas Series new_master_1 and next in new_master_2
# using the corpus already built above without rebuilding or
# changing it in any way:
matches1 = sg.match_strings(new_master_1)
matches2 = sg.match_strings(new_master_2)
```

#### Functions:

* #### `match_strings` 
   Returns a `DataFrame` containing similarity-scores of all matching pairs of highly similar strings from `master` (and `duplicates` if given).  Each matching pair in the output appears in its own row/record consisting of
   
   1. its "left" part: a string (with/without its index-label) from `master`, 
   2. its similarity score, and  
   3. its "right" part: a string (with/without its index-label) from `duplicates` (or `master` if `duplicates` is not given), 
   
   in that order.  Thus the column-names of the output are a collection of three groups:
   
   1. The name of `master` and the name(s) of its index (or index-levels) all prefixed by the string `'left_'`,
   2. `'similarity'` whose column has the similarity-scores as values, and 
   3. The name of `duplicates` (or `master` if `duplicates` is not given) and the name(s) of its index (or index-levels) prefixed by the string `'right_'`.
   
   Indexes (or their levels) only appear when the keyword argument `ignore_index=False` (the default). (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
   
   If either `master` or `duplicates` has no name, it assumes the name `'side'` which is then prefixed as described above.  Similarly, if any of the indexes (or index-levels) has no name it assumes its `pandas` default name (`'index'`, `'level_0'`, and so on) and is then prefixed as described above.
   
   In other words, if only parameter `master` is given, the function will return pairs of highly similar strings within `master`.  This can be seen as a self-join where both `'left_'` and `'right_'` prefixed columns come from `master`. If both parameters `master` and `duplicates` are given, it will return pairs of highly similar strings between `master` and `duplicates`. This can be seen as an inner-join where `'left_'` and `'right_'` prefixed columns come from `master` and `duplicates` respectively.     
   
   The function also supports optionally inputting IDs (`master_id` and `duplicates_id`) corresponding to the strings being matched.  In which case, the output includes two additional columns whose names are the names of these optional `Series` prefixed by `'left_'` and `'right_'` accordingly, and containing the IDs corresponding to the strings in the output.  If any of these `Series` has no name, then it assumes the name `'id'` and is then prefixed as described above.
   
   
* #### `match_most_similar` 
   If `ignore_index=True`, returns a `Series` of strings, where for each string in `duplicates` the most similar string in `master` is returned.  If there are no similar strings in `master` for a given string in `duplicates` (because there is no potential match where the cosine similarity is above the threshold \[default: 0.8\]) then the original string in `duplicates` is returned.  The output `Series` thus has the same length and index as `duplicates`.  
   
   For example, if an input `Series` with the values `\['foooo', 'bar', 'baz'\]` is passed as the argument `master`, and `\['foooob', 'bar', 'new'\]` as the values of the argument `duplicates`, the function will return a `Series` with values: `\['foooo', 'bar', 'new'\]`.
   
   The name of the output `Series` is the same as that of `master` prefixed with the string `'most_similar_'`.  If `master` has no name, it is assumed to have the name `'master'` before being prefixed.
       
   If `ignore_index=False` (the default), `match_most_similar` returns a `DataFrame` containing the same `Series` described above as one of its columns.  So it inherits the same index and length as `duplicates`.  The rest of its columns correspond to the index (or index-levels) of `master` and thus contain the index-labels of the most similar strings being output as values.  If there are no similar strings in `master` for a given string in `duplicates` then the value(s) assigned to this index-column(s) for that string is `NaN` by default.  However, if the keyword argument `replace_na=True`, then these `NaN` values are replaced with the index-label(s) of that string in `duplicates`.  Note that such replacements can only occur if the indexes of `master` and `duplicates` have the same number of levels.  (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md#MMS) for a demonstration.)
   
   Each column-name of the output `DataFrame` has the same name as its corresponding column, index, or index-level of `master` prefixed with the string `'most_similar_'`.
  
   If both parameters `master_id` and `duplicates_id` are also given, then a `DataFrame` is always returned with the same column(s) as described above, but with an additional column containing those IDs from these input `Series` corresponding to the output strings.  This column's name is the same as that of `master_id` prefixed in the same way as described above.  If `master_id` has no name, it is assumed to have the name `'master_id'` before being prefixed.


* #### `group_similar_strings` 
  Takes a single `Series` of strings (`strings_to_group`) and groups them by assigning to each string one string from `strings_to_group` chosen as the group-representative for each group of similar strings found. (See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for details on how the the group-representatives are chosen.)   
  
  If `ignore_index=True`, the output is a `Series` (with the same name as `strings_to_group` prefixed by the string `'group_rep_'`) of the same length and index as `strings_to_group` containing the group-representative strings.  If `strings_to_group` has no name then the name of the returned `Series` is `'group_rep'`.  
   
  For example, an input Series with values: `\['foooo', 'foooob', 'bar'\]` will return `\['foooo', 'foooo', 'bar'\]`.  Here `'foooo'` and `'foooob'` are grouped together into group `'foooo'` because they are found to be similar.  Another example can be found [below](#dedup).
  
   If `ignore_index=False`, the output is a `DataFrame` containing the above output `Series` as one of its columns with the same name.  The remaining column(s) correspond to the index (or index-levels) of `strings_to_group` and contain the index-labels of the group-representatives as values.  These columns have the same names as their counterparts prefixed by the string `'group_rep_'`. 
   
   If `strings_id` is also given, then the IDs from `strings_id` corresponding to the group-representatives are also returned in an additional column (with the same name as `strings_id` prefixed as described above).  If `strings_id` has no name, it is assumed to have the name `'id'` before being prefixed.
   

* #### `compute_pairwise_similarities`
   Returns a `Series` of cosine similarity scores the same length and index as `string_series_1`.  Each score is the cosine similarity between the pair of strings in the same position (row) in the two input `Series`, `string_series_1` and `string_series_2`, as the position of the score in the output `Series`.  This can be seen as an element-wise comparison between the two input `Series`.
   

All functions are built using a class **`StringGrouper`**. This class can be used through pre-defined functions, for example the four high level functions above, as well as using a more interactive approach where matches can be added or removed if needed by calling the **`StringGrouper`** class directly.
   

#### Options:

* #### <a name="kwargs"></a>`kwargs`

   All keyword arguments not mentioned in the function definitions above are used to update the default settings. The following optional arguments can be used:

   * **`ngram_size`**: The amount of characters in each n-gram. Default is `3`.
   * **`regex`**: The regex string used to clean-up the input string. Default is `r"[,-./]|\s"`.
   * **`ignore_case`**: Determines whether or not letter case in strings should be ignored. Defaults to `True`.
   * **`tfidf_matrix_dtype`**: The datatype for the tf-idf values of the matrix components. Allowed values are `numpy.float32` and `numpy.float64`.  Default is `numpy.float32`.  (Note: `numpy.float32` often leads to faster processing and a smaller memory footprint albeit less numerical precision than `numpy.float64`.)
   * **`max_n_matches`**: The maximum number of matching strings in `master` allowed per string in `duplicates`. Default is the total number of strings in `master`.
   * **`min_similarity`**: The minimum cosine similarity for two strings to be considered a match.
    Defaults to `0.8`
   * **`number_of_processes`**: The number of processes used by the cosine similarity calculation. Defaults to
    `number of cores on a machine - 1.`
   * **`ignore_index`**: Determines whether indexes are ignored or not.  If `False` (the default), index-columns will appear in the output, otherwise not.  (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
   * **`replace_na`**: For function `match_most_similar`, determines whether `NaN` values in index-columns are replaced or not by index-labels from `duplicates`. Defaults to `False`.  (See [tutorials/ignore_index_and_replace_na.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/ignore_index_and_replace_na.md) for a demonstration.)
   * **`include_zeroes`**: When `min_similarity` &le; 0, determines whether zero-similarity matches appear in the output.  Defaults to `True`.  (See [tutorials/zero_similarity.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/zero_similarity.md).)  **Note:** If `include_zeroes` is `True` and the kwarg `max_n_matches` is set then it must be sufficiently high to capture ***all*** nonzero-similarity-matches, otherwise an error is raised and `string_grouper` suggests an alternative value for `max_n_matches`.  To allow `string_grouper` to automatically use the appropriate value for `max_n_matches` then do not set this kwarg at all.
   * **`group_rep`**: For function `group_similar_strings`, determines how group-representatives are chosen.  Allowed values are `'centroid'` (the default) and `'first'`.  See [tutorials/group_representatives.md](https://github.com/Bergvca/string_grouper/blob/master/tutorials/group_representatives.md) for an explanation.
   * **`force_symmetries`**: In cases where `duplicates` is `None`, specifies whether corrections should be made to the results to account for symmetry, thus compensating for those losses of numerical significance which violate the symmetries. Defaults to `True`.
   * **`n_blocks`**: This parameter is a tuple of two `int`s provided to help boost performance, if possible, of processing large DataFrames (see [Subsection Performance](#perf)), by splitting the DataFrames into `n_blocks[0]` blocks for the left operand (of the underlying matrix multiplication) and into `n_blocks[1]` blocks for the right operand before performing the string-comparisons block-wise.  Defaults to `None`, in which case automatic splitting occurs if an `OverflowError` would otherwise occur.

## Examples

In this section we will cover a few use cases for which string_grouper may be used. We will use the same data set of company names as used in: [Super Fast String Matching in Python](https://bergvca.github.io/2017/10/14/super-fast-string-matching.html).

### Find all matches within a single data set


```python
import pandas as pd
import numpy as np
from string_grouper import match_strings, match_most_similar, \
	group_similar_strings, compute_pairwise_similarities, \
	StringGrouper
```


```python
company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
# We only look at the first 50k as an example:
companies = pd.read_csv(company_names)[0:50000]
# Create all matches:
matches = match_strings(companies['Company Name'])
# Look at only the non-exact matches:
matches[matches['left_Company Name'] != matches['right_Company Name']].head()
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>left_index</th>
      <th>left_Company Name</th>
      <th>similarity</th>
      <th>right_Company Name</th>
      <th>right_index</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>15</th>
      <td>14</td>
      <td>0210, LLC</td>
      <td>0.870291</td>
      <td>90210 LLC</td>
      <td>4211</td>
    </tr>
    <tr>
      <th>167</th>
      <td>165</td>
      <td>1 800 MUTUALS ADVISOR SERIES</td>
      <td>0.931615</td>
      <td>1 800 MUTUALS ADVISORS SERIES</td>
      <td>166</td>
    </tr>
    <tr>
      <th>168</th>
      <td>166</td>
      <td>1 800 MUTUALS ADVISORS SERIES</td>
      <td>0.931615</td>
      <td>1 800 MUTUALS ADVISOR SERIES</td>
      <td>165</td>
    </tr>
    <tr>
      <th>172</th>
      <td>168</td>
      <td>1 800 RADIATOR FRANCHISE INC</td>
      <td>1.000000</td>
      <td>1-800-RADIATOR FRANCHISE INC.</td>
      <td>201</td>
    </tr>
    <tr>
      <th>178</th>
      <td>173</td>
      <td>1 FINANCIAL MARKETPLACE SECURITIES LLC        ...</td>
      <td>0.949364</td>
      <td>1 FINANCIAL MARKETPLACE SECURITIES, LLC</td>
      <td>174</td>
    </tr>
  </tbody>
</table>
</div>


### Find all matches in between two data sets. 
The `match_strings` function finds similar items between two data sets as well. This can be seen as an inner join between two data sets:


```python
# Create a small set of artificial company names:
duplicates = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'])
# Create all matches:
matches = match_strings(companies['Company Name'], duplicates)
matches
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>left_index</th>
      <th>left_Company Name</th>
      <th>similarity</th>
      <th>right_side</th>
      <th>right_index</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>12</td>
      <td>012 SMILE.COMMUNICATIONS LTD</td>
      <td>0.944092</td>
      <td>012 SMILE.COMMUNICATIONS</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>49777</td>
      <td>B.A.S. MEDIA GROUP</td>
      <td>0.854383</td>
      <td>S MEDIA GROUP</td>
      <td>0</td>
    </tr>
    <tr>
      <th>2</th>
      <td>49855</td>
      <td>B4UTRADE COM CORP</td>
      <td>1.000000</td>
      <td>B4UTRADE COM CORP</td>
      <td>3</td>
    </tr>
    <tr>
      <th>3</th>
      <td>49856</td>
      <td>B4UTRADE COM INC</td>
      <td>0.810217</td>
      <td>B4UTRADE COM CORP</td>
      <td>3</td>
    </tr>
    <tr>
      <th>4</th>
      <td>49857</td>
      <td>B4UTRADE CORP</td>
      <td>0.878276</td>
      <td>B4UTRADE COM CORP</td>
      <td>3</td>
    </tr>
  </tbody>
</table>
</div>


Out of the four company names in `duplicates`, three companies are found in the original company data set. One company is found three times.

### Finding duplicates from a (database extract to) DataFrame where IDs for rows are supplied.

A very common scenario is the case where duplicate records for an entity have been entered into a database. That is, there are two or more records where a name field has slightly different spelling. For example, "A.B. Corporation" and "AB Corporation". Using the optional 'ID' parameter in the `match_strings` function duplicates can be found easily. A [tutorial](https://github.com/Bergvca/string_grouper/blob/master/tutorials/tutorial_1.md) that steps though the process with an example data set is available.


### For a second data set, find only the most similar match

In the example above, it's possible that multiple matches are found for a single string. Sometimes we just want a string to match with a single most similar string. If there are no similar strings found, the original string should be returned:


```python
# Create a small set of artificial company names:
new_companies = pd.Series(['S MEDIA GROUP', '012 SMILE.COMMUNICATIONS', 'foo bar', 'B4UTRADE COM CORP'],\
                          name='New Company')
# Create all matches:
matches = match_most_similar(companies['Company Name'], new_companies, ignore_index=True)
# Display the results:
pd.concat([new_companies, matches], axis=1)
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>New Company</th>
      <th>most_similar_Company Name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>S MEDIA GROUP</td>
      <td>B.A.S. MEDIA GROUP</td>
    </tr>
    <tr>
      <th>1</th>
      <td>012 SMILE.COMMUNICATIONS</td>
      <td>012 SMILE.COMMUNICATIONS LTD</td>
    </tr>
    <tr>
      <th>2</th>
      <td>foo bar</td>
      <td>foo bar</td>
    </tr>
    <tr>
      <th>3</th>
      <td>B4UTRADE COM CORP</td>
      <td>B4UTRADE COM CORP</td>
    </tr>
  </tbody>
</table>
</div>



### <a name="dedup"></a>Deduplicate a single data set and show items with most duplicates

The `group_similar_strings` function groups strings that are similar using a single linkage clustering algorithm. That is, if item A and item B are similar; and item B and item C are similar; but the similarity between A and C is below the threshold; then all three items are grouped together. 

```python
# Add the grouped strings:
companies['deduplicated_name'] = group_similar_strings(companies['Company Name'],
                                                       ignore_index=True)
# Show items with most duplicates:
companies.groupby('deduplicated_name')['Line Number'].count().sort_values(ascending=False).head(10)
```




    deduplicated_name
    ADVISORS DISCIPLINED TRUST                                      1824
    AGL LIFE ASSURANCE CO SEPARATE ACCOUNT                           183
    ANGELLIST-ART-FUND, A SERIES OF ANGELLIST-FG-FUNDS, LLC          116
    AMERICREDIT AUTOMOBILE RECEIVABLES TRUST 2001-1                   87
    ACE SECURITIES CORP. HOME EQUITY LOAN TRUST, SERIES 2006-HE2      57
    ASSET-BACKED PASS-THROUGH CERTIFICATES SERIES 2004-W1             40
    ALLSTATE LIFE GLOBAL FUNDING TRUST 2005-3                         39
    ALLY AUTO RECEIVABLES TRUST 2014-1                                33
    ANDERSON ROBERT E /                                               28
    ADVENT INTERNATIONAL GPE VIII LIMITED PARTNERSHIP                 28
    Name: Line Number, dtype: int64


The `group_similar_strings` function also works with IDs: imagine a `DataFrame` (`customers_df`) with the following content:
```python
# Create a small set of artificial customer names:
customers_df = pd.DataFrame(
   [
      ('BB016741P', 'Mega Enterprises Corporation'),
      ('CC082744L', 'Hyper Startup Incorporated'),
      ('AA098762D', 'Hyper Startup Inc.'),
      ('BB099931J', 'Hyper-Startup Inc.'),
      ('HH072982K', 'Hyper Hyper Inc.')
   ],
   columns=('Customer ID', 'Customer Name')
).set_index('Customer ID')
# Display the data:
customers_df
```

<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Customer Name</th>
    </tr>
    <tr>
      <th>Customer ID</th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>BB016741P</th>
      <td>Mega Enterprises Corporation</td>
    </tr>
    <tr>
      <th>CC082744L</th>
      <td>Hyper Startup Incorporated</td>
    </tr>
    <tr>
      <th>AA098762D</th>
      <td>Hyper Startup Inc.</td>
    </tr>
    <tr>
      <th>BB099931J</th>
      <td>Hyper-Startup Inc.</td>
    </tr>
    <tr>
      <th>HH072982K</th>
      <td>Hyper Hyper Inc.</td>
    </tr>
  </tbody>
</table>
</div>

The output of `group_similar_strings` can be directly used as a mapping table:
```python
# Group customers with similar names:
customers_df[["group-id", "name_deduped"]]  = \
    group_similar_strings(customers_df["Customer Name"])
# Display the mapping table:
customers_df
```

<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Customer Name</th>
      <th>group-id</th>
      <th>name_deduped</th>
    </tr>
    <tr>
      <th>Customer ID</th>
      <th></th>
      <th></th>
      <th></th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>BB016741P</th>
      <td>Mega Enterprises Corporation</td>
      <td>BB016741P</td>
      <td>Mega Enterprises Corporation</td>
    </tr>
    <tr>
      <th>CC082744L</th>
      <td>Hyper Startup Incorporated</td>
      <td>CC082744L</td>
      <td>Hyper Startup Incorporated</td>
    </tr>
    <tr>
      <th>AA098762D</th>
      <td>Hyper Startup Inc.</td>
      <td>AA098762D</td>
      <td>Hyper Startup Inc.</td>
    </tr>
    <tr>
      <th>BB099931J</th>
      <td>Hyper-Startup Inc.</td>
      <td>AA098762D</td>
      <td>Hyper Startup Inc.</td>
    </tr>
    <tr>
      <th>HH072982K</th>
      <td>Hyper Hyper Inc.</td>
      <td>HH072982K</td>
      <td>Hyper Hyper Inc.</td>
    </tr>
  </tbody>
</table>
</div>

Note that here `customers_df` initially had only one column "Customer Name" (before the `group_similar_strings` function call); and it acquired two more columns "group-id" (the index-column) and "name_deduped" after the call through a "[setting with enlargement](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#setting-with-enlargement)" (a `pandas` feature).

### <a name="dot"></a>Simply compute the cosine similarities of pairs of strings

Sometimes we have pairs of strings that have already been matched but whose similarity scores need to be computed.  For this purpose we provide the function `compute_pairwise_similarities`:

```python
# Create a small DataFrame of pairs of strings:
pair_s = pd.DataFrame(
    [
        ('Mega Enterprises Corporation', 'Mega Enterprises Corporation'),
        ('Hyper Startup Inc.', 'Hyper Startup Incorporated'),
        ('Hyper Startup Inc.', 'Hyper Startup Inc.'),
        ('Hyper Startup Inc.', 'Hyper-Startup Inc.'),
        ('Hyper Hyper Inc.', 'Hyper Hyper Inc.'),
        ('Mega Enterprises Corporation', 'Mega Enterprises Corp.')
   ],
   columns=('left', 'right')
)
# Display the data:
pair_s
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>left</th>
      <th>right</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Mega Enterprises Corporation</td>
      <td>Mega Enterprises Corporation</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Hyper Startup Inc.</td>
      <td>Hyper Startup Incorporated</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Hyper Startup Inc.</td>
      <td>Hyper Startup Inc.</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Hyper Startup Inc.</td>
      <td>Hyper-Startup Inc.</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Hyper Hyper Inc.</td>
      <td>Hyper Hyper Inc.</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Mega Enterprises Corporation</td>
      <td>Mega Enterprises Corp.</td>
    </tr>
  </tbody>
</table>
</div>




```python
# Compute their cosine similarities and display them:
pair_s['similarity'] = compute_pairwise_similarities(pair_s['left'], pair_s['right'])
pair_s
```




<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>left</th>
      <th>right</th>
      <th>similarity</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>Mega Enterprises Corporation</td>
      <td>Mega Enterprises Corporation</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>1</th>
      <td>Hyper Startup Inc.</td>
      <td>Hyper Startup Incorporated</td>
      <td>0.633620</td>
    </tr>
    <tr>
      <th>2</th>
      <td>Hyper Startup Inc.</td>
      <td>Hyper Startup Inc.</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>3</th>
      <td>Hyper Startup Inc.</td>
      <td>Hyper-Startup Inc.</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>4</th>
      <td>Hyper Hyper Inc.</td>
      <td>Hyper Hyper Inc.</td>
      <td>1.000000</td>
    </tr>
    <tr>
      <th>5</th>
      <td>Mega Enterprises Corporation</td>
      <td>Mega Enterprises Corp.</td>
      <td>0.826463</td>
    </tr>
  </tbody>
</table>
</div>



## The StringGrouper class

The four functions mentioned above all create a `StringGrouper` object behind the scenes and call different functions on it. The `StringGrouper` class keeps track of all tuples of similar strings and creates the groups out of these. Since matches are often not perfect, a common workflow is to:

1. Create matches
2. Manually inspect the results
3. Add and remove matches where necessary
4. Create groups of similar strings

The `StringGrouper` class allows for this without having to re-calculate the cosine similarity matrix. See below for an example. 


```python
company_names = '/media/chris/data/dev/name_matching/data/sec_edgar_company_info.csv'
companies = pd.read_csv(company_names)
```

1. Create matches


```python
# Create a new StringGrouper
string_grouper = StringGrouper(companies['Company Name'], ignore_index=True)
# Check if the ngram function does what we expect:
string_grouper.n_grams('McDonalds')
```

    ['McD', 'cDo', 'Don', 'ona', 'nal', 'ald', 'lds']


```python
# Now fit the StringGrouper - this will take a while since we are calculating cosine similarities on 600k strings
string_grouper = string_grouper.fit()
```

```python
# Add the grouped strings
companies['deduplicated_name'] = string_grouper.get_groups()
```

Suppose we know that PWC HOLDING CORP and PRICEWATERHOUSECOOPERS LLP are the same company. StringGrouper will not match these since they are not similar enough. 


```python
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Line Number</th>
      <th>Company Name</th>
      <th>Company CIK Key</th>
      <th>deduplicated_name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>478441</th>
      <td>478442</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
      <td>1064284</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>478442</th>
      <td>478443</td>
      <td>PRICEWATERHOUSECOOPERS LLP</td>
      <td>1186612</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>478443</th>
      <td>478444</td>
      <td>PRICEWATERHOUSECOOPERS SECURITIES LLC</td>
      <td>1018444</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
  </tbody>
</table>
</div>


```python
companies[companies.deduplicated_name.str.contains('PWC')]
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Line Number</th>
      <th>Company Name</th>
      <th>Company CIK Key</th>
      <th>deduplicated_name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>485535</th>
      <td>485536</td>
      <td>PWC CAPITAL INC.</td>
      <td>1690640</td>
      <td>PWC CAPITAL INC.</td>
    </tr>
    <tr>
      <th>485536</th>
      <td>485537</td>
      <td>PWC HOLDING CORP</td>
      <td>1456450</td>
      <td>PWC HOLDING CORP</td>
    </tr>
    <tr>
      <th>485537</th>
      <td>485538</td>
      <td>PWC INVESTORS, LLC</td>
      <td>1480311</td>
      <td>PWC INVESTORS, LLC</td>
    </tr>
    <tr>
      <th>485538</th>
      <td>485539</td>
      <td>PWC REAL ESTATE VALUE FUND I LLC</td>
      <td>1668928</td>
      <td>PWC REAL ESTATE VALUE FUND I LLC</td>
    </tr>
    <tr>
      <th>485539</th>
      <td>485540</td>
      <td>PWC SECURITIES CORP                                     /BD</td>
      <td>1023989</td>
      <td>PWC SECURITIES CORP                                     /BD</td>
    </tr>
    <tr>
      <th>485540</th>
      <td>485541</td>
      <td>PWC SECURITIES CORPORATION</td>
      <td>1023989</td>
      <td>PWC SECURITIES CORPORATION</td>
    </tr>
    <tr>
      <th>485541</th>
      <td>485542</td>
      <td>PWCC LTD</td>
      <td>1172241</td>
      <td>PWCC LTD</td>
    </tr>
    <tr>
      <th>485542</th>
      <td>485543</td>
      <td>PWCG BROKERAGE, INC.</td>
      <td>67301</td>
      <td>PWCG BROKERAGE, INC.</td>
    </tr>
  </tbody>
</table>
</div>


We can add these with the add function:


```python
string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'PWC HOLDING CORP')
companies['deduplicated_name'] = string_grouper.get_groups()
# Now lets check again:

companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Line Number</th>
      <th>Company Name</th>
      <th>Company CIK Key</th>
      <th>deduplicated_name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>478441</th>
      <td>478442</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
      <td>1064284</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>478442</th>
      <td>478443</td>
      <td>PRICEWATERHOUSECOOPERS LLP</td>
      <td>1186612</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>478443</th>
      <td>478444</td>
      <td>PRICEWATERHOUSECOOPERS SECURITIES LLC</td>
      <td>1018444</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>485536</th>
      <td>485537</td>
      <td>PWC HOLDING CORP</td>
      <td>1456450</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
  </tbody>
</table>
</div>


This can also be used to merge two groups:


```python
string_grouper = string_grouper.add_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Line Number</th>
      <th>Company Name</th>
      <th>Company CIK Key</th>
      <th>deduplicated_name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>478441</th>
      <td>478442</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
      <td>1064284</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>478442</th>
      <td>478443</td>
      <td>PRICEWATERHOUSECOOPERS LLP</td>
      <td>1186612</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>478443</th>
      <td>478444</td>
      <td>PRICEWATERHOUSECOOPERS SECURITIES LLC</td>
      <td>1018444</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>485536</th>
      <td>485537</td>
      <td>PWC HOLDING CORP</td>
      <td>1456450</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>662585</th>
      <td>662586</td>
      <td>ZUCKER MICHAEL</td>
      <td>1629018</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>662604</th>
      <td>662605</td>
      <td>ZUCKERMAN MICHAEL</td>
      <td>1303321</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>662605</th>
      <td>662606</td>
      <td>ZUCKERMAN MICHAEL</td>
      <td>1496366</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
  </tbody>
</table>
</div>


We can remove strings from groups in the same way:


```python
string_grouper = string_grouper.remove_match('PRICEWATERHOUSECOOPERS LLP', 'ZUCKER MICHAEL')
companies['deduplicated_name'] = string_grouper.get_groups()

# Now lets check again:
companies[companies.deduplicated_name.str.contains('PRICEWATERHOUSECOOPERS LLP')]
```


<div>
<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>Line Number</th>
      <th>Company Name</th>
      <th>Company CIK Key</th>
      <th>deduplicated_name</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>478441</th>
      <td>478442</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
      <td>1064284</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>478442</th>
      <td>478443</td>
      <td>PRICEWATERHOUSECOOPERS LLP</td>
      <td>1186612</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>478443</th>
      <td>478444</td>
      <td>PRICEWATERHOUSECOOPERS SECURITIES LLC</td>
      <td>1018444</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
    <tr>
      <th>485536</th>
      <td>485537</td>
      <td>PWC HOLDING CORP</td>
      <td>1456450</td>
      <td>PRICEWATERHOUSECOOPERS LLP                              /TA</td>
    </tr>
  </tbody>
</table>
</div>

# Performance<a name="perf"></a>

### <a name="Semilogx"></a>Semilogx plots of run-times of `match_strings()` vs the number of blocks (`n_blocks[1]`) into which the right matrix-operand of the dataset (663 000 strings from sec__edgar_company_info.csv) was split before performing the string comparison.  As shown in the legend, each plot corresponds to the number `n_blocks[0]` of blocks into which the left matrix-operand was split.
![Semilogx](https://raw.githubusercontent.com/Bergvca/string_grouper/master/images/BlockNumberSpaceExploration1.png)

String comparison, as implemented by `string_grouper`, is essentially matrix 
multiplication.  A pandas Series of strings is converted (tokenized) into a 
matrix.  Then that matrix is multiplied by itself (or another) transposed.  

Here is an illustration of multiplication of two matrices ***D*** and ***M***<sup>T</sup>:
![Block Matrix 1 1](https://raw.githubusercontent.com/Bergvca/string_grouper/master/images/BlockMatrix_1_1.png)

It turns out that when the matrix (or Series) is very large, the computer 
proceeds quite slowly with the multiplication (apparently due to the RAM being 
too full).  Some computers give up with an `OverflowError`.

To circumvent this issue, `string_grouper` now allows the division of the Series 
into smaller chunks (or blocks) and multiplies the chunks one pair at a time 
instead to get the same result:

![Block Matrix 2 2](https://raw.githubusercontent.com/Bergvca/string_grouper/master/images/BlockMatrix_2_2.png)

But surprise ... the run-time of the process is sometimes drastically reduced 
as a result.  For example, the speed-up of the following call is about 500% 
(here, the Series is divided into 200 blocks on the right operand, that is, 
1 block on the left &times; 200 on the right) compared to the same call with no
splitting \[`n_blocks=(1, 1)`, the default, which is what previous versions 
(0.5.0 and earlier) of `string_grouper` did\]:

```python
# A DataFrame of 668 000 records:
companies = pd.read_csv('data/sec__edgar_company_info.csv')

# The following call is more than 6 times faster than earlier versions of 
# match_strings() (that is, when n_blocks=(1, 1))!
match_strings(companies['Company Name')], n_blocks=(1, 200))
```

Further exploration of the block number space ([see plot above](#Semilogx)) has revealed that for any fixed 
number of right blocks, the run-time gets longer the larger the number of left 
blocks specified.  For this reason, it is recommended *not* to split the left matrix.

![Block Matrix 1 2](https://raw.githubusercontent.com/Bergvca/string_grouper/master/images/BlockMatrix_1_2.png)

In general,

&nbsp;&nbsp;&nbsp;***total runtime*** = `n_blocks[0]` &times; `n_blocks[1]` &times; ***mean runtime per block-pair***

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; = ***Left Operand Size*** &times; ***Right Operand Size*** &times; 

&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; ***mean runtime per block-pair*** / (***Left Block Size*** &times; ***Right Block Size***)

So for given left and right operands, minimizing the ***total runtime*** is the same as minimizing the

&nbsp;&nbsp;&nbsp;***runtime per string-pair comparison*** &#8797; <br>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;***mean runtime per block-pair*** / (***Left Block Size*** &times; ***Right Block Size***)


[Below is a log-log-log contour plot](#ContourPlot) of the ***runtime per string-pair comparison*** scaled by its value
at ***Left Block Size*** = ***Right Block Size*** = 5000.  Here, ***Block Size***
is the number of strings in that block, and ***mean runtime per block-pair*** is the time taken for the following call to run:
```python
# note the parameter order!
match_strings(right_Series, left_Series, n_blocks=(1, 1))
```
where `left_Series` and `right_Series`, corresponding to ***Left Block*** and ***Right Block*** respectively, are random subsets of the Series `companies['Company Name')]` from the
[sec__edgar_company_info.csv](https://www.kaggle.com/dattapiy/sec-edgar-companies-list/version/1) sample data file.

<a name="ContourPlot"></a> ![ContourPlot](https://raw.githubusercontent.com/Bergvca/string_grouper/master/images/ScaledRuntimeContourPlot.png)

It can be seen that when `right_Series` is roughly the size of 80&nbsp;000 (denoted by the 
white dashed line in the contour plot above), the runtime per string-pair comparison is at 
its lowest for any fixed `left_Series` size.  Above ***Right Block Size*** = 80&nbsp;000, the 
matrix-multiplication routine begins to feel the limits of the computer's 
available memory space and thus its performance deteriorates, as evidenced by the increase 
in runtime per string-pair comparison there (above the white dashed line).  This knowledge 
could serve as a guide for estimating the optimum block numbers &mdash;
namely those that divide the Series into blocks of size roughly equal to 
80&nbsp;000 for the right operand (or `right_Series`).

So what are the optimum block number values for *any* given Series? That is 
anyone's guess, and may likely depend on the data itself.  Furthermore, as hinted above, 
the answer may vary from computer to computer.  

We however encourage the user to make judicious use of the `n_blocks` 
parameter to boost performance of `string_grouper` whenever possible.
