# -*- coding: utf-8 -*-
from setuptools import setup

packages = \
['data_diff']

package_data = \
{'': ['*']}

install_requires = \
['click>=8.1,<9.0',
 'dsnparse',
 'rich>=10.16.2,<11.0.0',
 'runtype>=0.2.4,<0.3.0']

extras_require = \
{'mysql': ['mysql-connector-python'],
 'pgsql': ['psycopg2'],
 'preql': ['preql>=0.2.13,<0.3.0'],
 'snowflake': ['snowflake-connector-python']}

entry_points = \
{'console_scripts': ['data-diff = data_diff.__main__:main']}

setup_kwargs = {
    'name': 'data-diff',
    'version': '0.0.4',
    'description': 'Command-line tool and Python library to efficiently diff rows across two different databases.',
    'long_description': '# **data-diff**\n\n**data-diff is currently under heavy development, if you run into issues,\nplease file an issue and we\'ll help you out ASAP!**\n\n**data-diff** is a command-line tool and Python library to efficiently diff\nrows across two different databases.\n\n* 🪢 Verifies across [many different databases][dbs] (e.g. Postgres -> Snowflake)\n* 🔍 Outputs [diff of rows](#example-output) in detail\n* 🚨 Simple CLI/API to create monitoring and alerts\n* 🔥 Verify 25M+ rows in less than 10s\n* ♾️  Works for tables with 10s of billions of rows\n\n**data-diff** splits the table into smaller segments, then checksums each\nsegment in both databases. When the checksums for a segment aren\'t equal, it\nwill further divide that segment into yet smaller segments, cheksumming those\nuntil it gets to the differing row(s). See [Technical Explanation][tech-explain] for more\ndetails.\n\nThis approach has similar performance to `count(*)` when there are few/no\nchanges, but is able to output each differing row (and it might even [be\nfaster][perf]). By pushing the compute into the databases, it\'s _much_ faster\nthan querying for and comparing every row.\n\n## Table of Contents\n\n- [Common use-cases](#common-use-cases)\n- [Example output](#example-output)\n- [Supported Databases](#supported-databases)\n- [How to install](#how-to-install)\n- [How to use](#how-to-use)\n- [Technical Explanation](#technical-explanation)\n- [Performance Considerations](#performance-considerations)\n- [Development Setup](#development-setup)\n\n## Common use-cases\n\n* **Verify data migrations.** Verify all data was copied from a critical e.g.\n  Heroku Postgres to Amazon RDS migration.\n* **Verifying data pipelines.** Moving data from a relational database to a\n  warehouse/data lake with Fivetran, Airbyte, Debezium, or some other pipeline.\n* **Alerting and maintaining data integrity SLOs.** You can create and monitor\n  your SLO of e.g. 99.999% data integrity, and alert your team when data is\n  missing.\n* **Debugging complex data pipelines.** When data gets lost in pipelines that\n  may span a half-dozen systems, without verifying each intermediate datastore\n  it\'s extremely difficult to track down where a row got lost.\n* **Detecting hard deletes for an `updated_at`-based pipeline**. If you\'re\n  copying data to your warehouse based on an `updated_at`-style column, then\n  you\'ll miss hard-deletes that **data-diff** can find for you.\n* **Make your replication self-healing.** You can use **data-diff** to\n  self-heal by using the diff output to write/update rows in the target\n  database.\n\n## Example output\n\nBelow we run a comparison with the CLI for 25M rows in Postgres where the\nright-hand table is missing single row with `id=12500048`:\n\n```\n$ data-diff \\\n    postgres://postgres:password@localhost/postgres rating \\\n    postgres://postgres:password@localhost/postgres rating_del1 \\\n    --bisection-threshold 100000 \\ # for readability, try default first\n    --bisection-factor 6 \\ # for readability, try default first\n    --update-column timestamp \\\n    --verbose\n[10:15:00] INFO - Diffing tables | segments: 6, bisection threshold: 100000.\n[10:15:00] INFO - . Diffing segment 1/6, key-range: 1..4166683, size: 4166682\n[10:15:03] INFO - . Diffing segment 2/6, key-range: 4166683..8333365, size: 4166682\n[10:15:06] INFO - . Diffing segment 3/6, key-range: 8333365..12500047, size: 4166682\n[10:15:09] INFO - . Diffing segment 4/6, key-range: 12500047..16666729, size: 4166682\n[10:15:12] INFO - . . Diffing segment 1/6, key-range: 12500047..13194494, size: 694447\n[10:15:13] INFO - . . . Diffing segment 1/6, key-range: 12500047..12615788, size: 115741\n[10:15:13] INFO - . . . . Diffing segment 1/6, key-range: 12500047..12519337, size: 19290\n[10:15:13] INFO - . . . . Diff found 1 different rows.\n[10:15:13] INFO - . . . . Diffing segment 2/6, key-range: 12519337..12538627, size: 19290\n[10:15:13] INFO - . . . . Diffing segment 3/6, key-range: 12538627..12557917, size: 19290\n[10:15:13] INFO - . . . . Diffing segment 4/6, key-range: 12557917..12577207, size: 19290\n[10:15:13] INFO - . . . . Diffing segment 5/6, key-range: 12577207..12596497, size: 19290\n[10:15:13] INFO - . . . . Diffing segment 6/6, key-range: 12596497..12615788, size: 19291\n[10:15:13] INFO - . . . Diffing segment 2/6, key-range: 12615788..12731529, size: 115741\n[10:15:13] INFO - . . . Diffing segment 3/6, key-range: 12731529..12847270, size: 115741\n[10:15:13] INFO - . . . Diffing segment 4/6, key-range: 12847270..12963011, size: 115741\n[10:15:14] INFO - . . . Diffing segment 5/6, key-range: 12963011..13078752, size: 115741\n[10:15:14] INFO - . . . Diffing segment 6/6, key-range: 13078752..13194494, size: 115742\n[10:15:14] INFO - . . Diffing segment 2/6, key-range: 13194494..13888941, size: 694447\n[10:15:14] INFO - . . Diffing segment 3/6, key-range: 13888941..14583388, size: 694447\n[10:15:15] INFO - . . Diffing segment 4/6, key-range: 14583388..15277835, size: 694447\n[10:15:15] INFO - . . Diffing segment 5/6, key-range: 15277835..15972282, size: 694447\n[10:15:15] INFO - . . Diffing segment 6/6, key-range: 15972282..16666729, size: 694447\n+ (12500048, 1268104625)\n[10:15:16] INFO - . Diffing segment 5/6, key-range: 16666729..20833411, size: 4166682\n[10:15:19] INFO - . Diffing segment 6/6, key-range: 20833411..25000096, size: 4166685\n```\n\n## Supported Databases\n\n| Database      | Connection string                                                             | Status |\n|---------------|-------------------------------------------------------------------------------|--------|\n| Postgres      | `postgres://user:password@hostname:5432/database`                             |  💚    |\n| MySQL         | `mysql://user:password@hostname:5432/database`                                |  💚    |\n| Snowflake     | `snowflake://user:password@account/warehouse?database=database&schema=schema` |  💚    |\n| Oracle        | `oracle://username:password@hostname/database`                                |  💛    |\n| BigQuery      | `bigquery:///`                                                                |  💛    |\n| Redshift      | `redshift://username:password@hostname:5439/database`                         |  💛    |\n| Presto        | `presto://username:password@hostname:8080/database`                           |  💛    |\n| ElasticSearch |                                                                               |  📝    |\n| Databricks    |                                                                               |  📝    |\n| Planetscale   |                                                                               |  📝    |\n| Clickhouse    |                                                                               |  📝    |\n| Pinot         |                                                                               |  📝    |\n| Druid         |                                                                               |  📝    |\n| Kafka         |                                                                               |  📝    |\n\n* 💚: Implemented and thoroughly tested.\n* 💛: Implemented, but not thoroughly tested yet.\n* ⏳: Implementation in progress.\n* 📝: Implementation planned. Contributions welcome.\n\nIf a database is not on the list, we\'d still love to support it. Open an issue\nto discuss it.\n\n# How to install\n\nRequires Python 3.7+ with pip.\n\n```pip install data-diff```\n\nor when you need extras like mysql and postgres\n\n```pip install "data-diff[mysql,pgsql]"```\n\n# How to use\n\nUsage: `data-diff DB1_URI TABLE1_NAME DB2_URI TABLE2_NAME [OPTIONS]`\n\nOptions:\n\n  - `--help` - Show help message and exit.\n  - `-k` or `--key-column` - Name of the primary key column\n  - `-t` or `--update-column` - Name of updated_at/last_updated column\n  - `-c` or `--columns` - List of names of extra columns to compare\n  - `-l` or `--limit` - Maximum number of differences to find (limits maximum bandwidth and runtime)\n  - `-s` or `--stats` - Print stats instead of a detailed diff\n  - `-d` or `--debug` - Print debug info\n  - `-v` or `--verbose` - Print extra info\n  - `-i` or `--interactive` - Confirm queries, implies `--debug`\n  - `--min-age` - Considers only rows older than specified.\n                  Example: `--min-age=5min` ignores rows from the last 5 minutes.\n                  Valid units: `d, days, h, hours, min, minutes, mon, months, s, seconds, w, weeks, y, years`\n  - `--max-age` - Considers only rows younger than specified. See `--min-age`.\n  - `--bisection-factor` - Segments per iteration. When set to 2, it performs binary search.\n  - `--bisection-threshold` - Minimal bisection threshold. i.e. maximum size of pages to diff locally.\n  - `-j` or `--threads` - Number of worker threads to use per database. Default=1.\n\n# Technical Explanation\n\nIn this section we\'ll be doing a walk-through of exactly how **data-diff**\nworks, and how to tune `--bisection-factor` and `--bisection-threshold`.\n\nLet\'s consider a scenario with an `orders` table with 1M rows. Fivetran is\nreplicating it contionously from Postgres to Snowflake:\n\n```\n┌─────────────┐                        ┌─────────────┐\n│  Postgres   │                        │  Snowflake  │\n├─────────────┤                        ├─────────────┤\n│             │                        │             │\n│             │                        │             │\n│             │  ┌─────────────┐       │ table with  │\n│ table with  ├──┤ replication ├──────▶│ ?maybe? all │\n│lots of rows!│  └─────────────┘       │  the same   │\n│             │                        │    rows.    │\n│             │                        │             │\n│             │                        │             │\n│             │                        │             │\n└─────────────┘                        └─────────────┘\n```\n\nIn order to check whether the two tables are the same, **data-diff** splits\nthe table into `--bisection-factor=10` segments.\n\nWe also have to choose which columns we want to checksum. In our case, we care\nabout the primary key, `--key-column=id` and the update column\n`--update-column=updated_at`. `updated_at` is updated every time the row is, and\nwe have an index on it.\n\n**data-diff** starts by querying both databases for the `min(id)` and `max(id)`\nof the table. Then it splits the table into `--bisection-factor=10` segments of\n`1M/10 = 100K` keys each:\n\n```\n┌──────────────────────┐              ┌──────────────────────┐\n│       Postgres       │              │      Snowflake       │\n├──────────────────────┤              ├──────────────────────┤\n│      id=1..100k      │              │      id=1..100k      │\n├──────────────────────┤              ├──────────────────────┤\n│    id=100k..200k     │              │    id=100k..200k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=200k..300k     ├─────────────▶│    id=200k..300k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=300k..400k     │              │    id=300k..400k     │\n├──────────────────────┤              ├──────────────────────┤\n│         ...          │              │         ...          │\n├──────────────────────┤              ├──────────────────────┤\n│      900k..100k      │              │      900k..100k      │\n└───────────────────▲──┘              └▲─────────────────────┘\n                    ┃                  ┃\n                    ┃                  ┃\n                    ┃ checksum queries ┃\n                    ┃                  ┃\n                  ┌─┻──────────────────┻────┐\n                  │        data-diff        │\n                  └─────────────────────────┘\n```\n\nNow **data-diff** will start running `--threads=1` queries in parallel that\nchecksum each segment. The queries for checksumming each segment will look\nsomething like this, depending on the database:\n\n```sql\nSELECT count(*),\n    sum(cast(conv(substring(md5(concat(cast(id as char), cast(timestamp as char))), 18), 16, 10) as unsigned))\nFROM `rating_del1`\nWHERE (id >= 1) AND (id < 100000)\n```\n\nThis keeps the amount of data that has to be transferred between the databases\nto a minimum, making it very performant! Additionally, if you have an index on\n`updated_at` (highly recommended) then the query will be fast as the database\nonly has to do a partial index scan between `id=1..100k`.\n\nIf you are not sure whether the queries are using an index, you can run it with\n`--interactive`. This puts **data-diff** in interactive mode where it shows an\n`EXPLAIN` before executing each query, requiring confirmation to proceed.\n\nAfter running the checksum queries on both sides, we see that all segments\nare the same except `id=100k..200k`:\n\n```\n┌──────────────────────┐              ┌──────────────────────┐\n│       Postgres       │              │      Snowflake       │\n├──────────────────────┤              ├──────────────────────┤\n│    checksum=0102     │              │    checksum=0102     │\n├──────────────────────┤   mismatch!  ├──────────────────────┤\n│    checksum=ffff     ◀──────────────▶    checksum=aaab     │\n├──────────────────────┤              ├──────────────────────┤\n│    checksum=abab     │              │    checksum=abab     │\n├──────────────────────┤              ├──────────────────────┤\n│    checksum=f0f0     │              │    checksum=f0f0     │\n├──────────────────────┤              ├──────────────────────┤\n│         ...          │              │         ...          │\n├──────────────────────┤              ├──────────────────────┤\n│    checksum=9494     │              │    checksum=9494     │\n└──────────────────────┘              └──────────────────────┘\n```\n\nNow **data-diff** will do exactly as it just did for the _whole table_ for only\nthis segment: Split it into `--bisection-factor` segments.\n\nHowever, this time, because each segment has `100k/10=10k` entries, which is\nless than the `--bisection-threshold` it will pull down every row in the segment\nand compare them in memory in **data-diff**.\n\n```\n┌──────────────────────┐              ┌──────────────────────┐\n│       Postgres       │              │      Snowflake       │\n├──────────────────────┤              ├──────────────────────┤\n│    id=100k..110k     │              │    id=100k..110k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=110k..120k     │              │    id=110k..120k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=120k..130k     │              │    id=120k..130k     │\n├──────────────────────┤              ├──────────────────────┤\n│    id=130k..140k     │              │    id=130k..140k     │\n├──────────────────────┤              ├──────────────────────┤\n│         ...          │              │         ...          │\n├──────────────────────┤              ├──────────────────────┤\n│      190k..200k      │              │      190k..200k      │\n└──────────────────────┘              └──────────────────────┘\n```\n\nFinally **data-diff** will output the `(id, updated_at)` for each row that was different:\n\n```\n(122001, 1653672821)\n```\n\nIf you pass `--stats` you\'ll see e.g. what % of rows were different.\n\n## Performance Considerations\n\n* Ensure that you have indexes on the columns you are comparing. Preferably a\n  compound index. You can run with `--interactive` to see an `EXPLAIN` for the\n  queries.\n* Consider increasing the number of simultaneous threads executing\n  queries per database with `--threads`. For databases that limit concurrency\n  per query, e.g. Postgres/MySQL, this can improve performance dramatically.\n  This is how comparisons with **data-diff** can be faster than `count(*)` which\n  has limited concurrency, and in some cases will never complete due to\n  timeouts.\n* If you are only interested in _whether_ something changed, pass `--limit 1`.\n  This can be useful if changes are very rare. This often faster than doing a\n  `count(*)`, for the reason mentioned above.\n* If the table is _very_ large, consider a larger `--bisection-factor`. Explained in\n  the [technical explanation][tech-explain]. Otherwise you may run into timeouts.\n* If there are a lot of changes, consider a larger `--bisection-threshold`.\n  Explained in the [technical explanation][tech-explain].\n* If there are very large gaps in your table, e.g. 10s of millions of\n  continuous rows missing, then **data-diff** may perform poorly doing lots of\n  queries for ranges of rows that do not exist (see [technical\n  explanation][tech-explain]). There are various things we could do to optimize\n  the algorithm for this case with complexity that has not yet been introduced,\n  please open an issue.\n* The fewer columns you verify (passed with `--columns`), the faster\n  **data-diff** will be. On one extreme you can verify every column, on the\n  other you can verify _only_ `updated_at`, if you trust it enough. You can also\n  _only_ verify `id` if you\'re interested in only presence, e.g. to detect\n  missing hard deletes. You can do also do a hybrid where you verify\n  `updated_at` and the most critical value, e.g a money value in `amount` but\n  not verify a large serialized column like `json_settings`.\n\n# Development Setup\n\nThe development setup centers around using `docker-compose` to boot up various\ndatabases, and then inserting data into them.\n\nFor Mac for performance of Docker, we suggest enabling in the UI:\n\n* Use new Virtualization Framework\n* Enable VirtioFS accelerated directory sharing\n\n**1. Install Data Diff**\n\nWhen developing/debugging, it\'s recommended to install dependencies and run it\ndirectly with `poetry` rather than go through the package.\n\n```\n$ brew install mysql postgresql # MacOS dependencies for C bindings\n$ apt-get install libpq-dev libmysqlclient-dev # Debian dependencies\n\n$ pip install poetry # Python dependency isolation tool\n$ poetry install # Install dependencies\n```\n**2. Start Databases**\n\n[Install **docker-compose**][docker-compose] if you haven\'t already.\n\n```shell-session\n$ docker-compose up -d mysql postgres # run mysql and postgres dbs in background\n```\n\n[docker-compose]: https://docs.docker.com/compose/install/\n\n**3. Run Unit Tests**\n\n```shell-session\n$ poetry run python3 -m unittest\n```\n\n**4. Seed the Database(s)**\n\nFirst, download the CSVs of seeding data:\n\n```shell-session\n$ curl https://datafold-public.s3.us-west-2.amazonaws.com/1m.csv -o dev/ratings.csv\n\n# For a larger data-set (but takes 25x longer to import):\n# - curl https://datafold-public.s3.us-west-2.amazonaws.com/25m.csv -o dev/ratings.csv\n```\n\nNow you can insert it into the testing database(s):\n\n```shell-session\n# It\'s optional to seed more than one to run data-diff(1) against.\n$ preql -f dev/prepare_db.pql mysql://mysql:Password1@127.0.0.1:3306/mysql\n$ preql -f dev/prepare_db.pql postgres://postgres:Password1@127.0.0.1:5432/postgres\n\n# Cloud databases\n$ preql -f dev/prepare_db.psq snowflake://<uri>\n$ preql -f dev/prepare_db.psq mssql://<uri>\n$ preql -f dev/prepare_db_bigquery.pql bigquery:///<project> # Bigquery has its own scripts\n```\n\n**5. Run **data-diff** against seeded database**\n\n```bash\npoetry run python3 -m data_diff postgres://user:password@host:db Rating mysql://user:password@host:db Rating_del1 -c timestamp --stats\n\nDiff-Total: 250156 changed rows out of 25000095\nDiff-Percent: 1.0006%\nDiff-Split: +250156  -0\n```\n\n# License\n\n[MIT License](https://github.com/datafold/data-diff/blob/master/LICENSE)\n\n[dbs]: #supported-databases\n[tech-explain]: #technical-explanation\n[perf]: #performance-considerations\n',
    'author': 'Erez Shinnan',
    'author_email': 'erezshin@gmail.com',
    'maintainer': None,
    'maintainer_email': None,
    'url': 'https://github.com/datafold/data-diff',
    'packages': packages,
    'package_data': package_data,
    'install_requires': install_requires,
    'extras_require': extras_require,
    'entry_points': entry_points,
    'python_requires': '>=3.7,<4.0',
}


setup(**setup_kwargs)
