Metadata-Version: 2.1
Name: dgraphpandas
Version: 0.0.6
Summary: Transform Pandas DataFrames into Exports to be sent to DGraph
Home-page: https://github.com/kiran94/dgraphpandas
License: MIT
Description: # dgraphpandas
        
        [![Python Build](https://github.com/kiran94/dgraphpandas/actions/workflows/python-package.yml/badge.svg)](https://github.com/kiran94/dgraphpandas/actions/workflows/python-package.yml) ![PyPI](https://img.shields.io/pypi/v/dgraphpandas?color=blue&style=flat-square) [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
        
        A Library (with accompanying cli tool) to transform [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/index.html#user-guide) DataFrames into Exports ([RDF](https://en.wikipedia.org/wiki/Resource_Description_Framework)) to be sent to [DGraph Live Loader](https://dgraph.io/docs/deploy/fast-data-loading/live-loader/)
        
        - [dgraphpandas](#dgraphpandas)
          - [Usage](#usage)
            - [Command Line](#command-line)
            - [Module](#module)
            - [Working with Larger Files](#working-with-larger-files)
              - [Command Line](#command-line-1)
              - [Module](#module-1)
          - [Configuration](#configuration)
            - [Additional Configuration](#additional-configuration)
          - [Samples](#samples)
          - [Local Setup](#local-setup)
        
        ## Usage
        
        ```sh
        python -m pip install dgraphpandas
        ```
        
        ### Command Line
        
        This is a real example which you can find in the samples folder and run from the root of this repository.
        
        ```sh
        python -m dgraphpandas \
          --config samples/planets/dgraphpandas.json \
          --config_file_key planet \
          --file samples/planets/solar_system.csv \
          --output samples/planets/output
        ```
        
        ### Module
        
        ```py
        from dgraphpandas.strategies.horizontal import horizontal_transform
        from dgraphpandas.strategies.vertical import vertical_transform
        from dgraphpandas.writers.upserts import generate_upserts
        
        # Define a Configuration for your data files(s). Explained further in the Configuration section.
        config = {
          "transform": "horizontal",
          "files": {
            "planet": {
              "subject_fields": ["id"],
              "edge_fields": ["type"],
              "type_overrides": {
                "order_from_sun": "int32",
                "diameter_earth_relative": "float32",
                "diameter_km": "float32",
                "mass_earth_relative": "float32",
                "mean_distance_from_sun_au": "float32",
                "orbital_period_years": "float32",
                "orbital_eccentricity": "float32",
                "mean_orbital_velocity_km_sec": "float32",
                "rotation_period_days": "float32",
                "inclination_axis_degrees": "float32",
                "mean_temperature_surface_c": "float32",
                "gravity_equator_earth_relative": "float32",
                "escape_velocity_km_sec": "float32",
                "mean_density": "float32",
                "number_moons": "int32",
                "rings": "bool"
              },
              "ignore_fields": ["image", "parent"]
            }
          }
        }
        
        # Perform a Horizontal Transform on the passed file using the config/key
        intrinsic, edges = horizontal_transform('solar_system.csv', config, "planet")
        
        # Generate RDF Upsert statements
        intrinsic_upserts, edges_upserts = generate_upserts(intrinsic, edges)
        
        # Do something with these statements e.g write to zip and ship to DGraph
        # The cli will zip this output automatically
        print(intrinsic)
        print(edges)
        ```
        
        ### Working with Larger Files
        
        If you have very large input files, it may make sense to break up your files into smaller ones to reduce the likely hood of memory issues.
        
        dgraphpandas provides facilities to break up exports via the cli tool into chunks or if you are using the module directly then you can find an example below on how to use pandas to break up your file.
        
        #### Command Line
        
        In the CLI you have the `chunk_size` parameter to determine an upper limit for your files.
        
        ```sh
        python -m dgraphpandas \
          -c samples/netflix/dgraphpandas.json \
          -ck title -f samples/netflix/input/netflix_titles.csv \
          -o samples/netflix/output \
          --chunk_size 1000
        ```
        
        When you pass this, only `chunk_size` lines will be pushed through the RDF generation logic at a time and the output will be indexed per chunk. For example:
        
        ```sh
        ❯ ls -la samples/netflix/output/
        total 12M
        drwxr-xr-x 2 kiran kiran 4.0K Apr  4 18:13 .
        drwxr-xr-x 6 kiran kiran 4.0K Apr  4 16:45 ..
        -rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges.gz
        -rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_2.gz
        -rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_3.gz
        -rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_4.gz
        -rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_5.gz
        -rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_6.gz
        -rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_7.gz
        -rw-r--r-- 1 kiran kiran 706K Apr  4 18:13 netflix_titles_edges_8.gz
        -rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic.gz
        -rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_2.gz
        -rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_3.gz
        -rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_4.gz
        -rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_5.gz
        -rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_6.gz
        -rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_7.gz
        -rw-r--r-- 1 kiran kiran 701K Apr  4 18:13 netflix_titles_intrinsic_8.gz
        ```
        
        You can then take these exports and live load them as normal.
        
        #### Module
        
        When you are using the module directly, you can leverage the fact that the transform methods can take a `DataFrame` directly and you can pre-chunk before they enter.
        
        For Example:
        
        ```py
        from dgraphpandas.strategies.horizontal import horizontal_transform
        from dgraphpandas.writers.upserts import generate_upserts
        
        # Each Chunk won't be loaded into memory until it hits that particular loop.
        for index, frame in enumerate(pd.read_csv('your_input.csv', chunksize=1000)):
        
          # Generate for this Chunk
          intrinsic, edges = horizontal_transform(frame, dgraphpandas_config, 'your_input_key')
        
          # Generate Rdf Upserts for this Chunk
          intrinsic_upserts, edges_upserts = generate_upserts(intrinsic, edges)
        
          # Then you can do whatever you want with these
        ```
        
        ## Configuration
        
        A Configuration file influences how we transform a DataFrame. It consists of:
        
        - Global configuration options
          - Options which will be applied to files
          - These can either be defined in the configuration or as `kwargs` in the transform method.
          - A collection of `files`
        
        - File configuration options
          - Options which will be applied only to this entry
          - `subject_fields` is required so the unique identifier for a row in the DataFrame can be found
          - `edge_fields` are optional and if provided will generate edge output
          - `type_overrides` are optional but recommended to ensure the correct type is attached in RDF
        
        *If you are running this with the module and passing via `kwargs` then these options may also be lambda callable with takes the dataframe. For example if you didn't want to hard code all your edge fields and were following a convention that all edge fields have suffix `_id` then you could set the edge_fields to `lambda frame: frame.loc[frame['predicate'].str.endswith('_id'), 'predicate'].unique().tolist()`
        `*
        
        ```py
        config = {
          "transform": "horizontal",
          "files": {
            "planet": {
              "subject_fields": ["id"],
              "edge_fields": ["type"],
              "type_overrides": {
                "order_from_sun": "int32",
                "diameter_earth_relative": "float32",
                "diameter_km": "float32",
                "mass_earth_relative": "float32",
                "mean_distance_from_sun_au": "float32",
                "orbital_period_years": "float32",
                "orbital_eccentricity": "float32",
                "mean_orbital_velocity_km_sec": "float32",
                "rotation_period_days": "float32",
                "inclination_axis_degrees": "float32",
                "mean_temperature_surface_c": "float32",
                "gravity_equator_earth_relative": "float32",
                "escape_velocity_km_sec": "float32",
                "mean_density": "float32",
                "number_moons": "int32",
                "rings": "bool"
              },
              "ignore_fields": ["image", "parent"]
            }
          }
        }
        ```
        
        ### Additional Configuration
        
        **Global Level**
        
        These options can be placed on the root of the config or passed as `kwargs` directly.
        
        - `add_dgraph_type_records`
            - DGraph has a special field called `dgraph.type`, this can be used to query via the `type()`
            function. If `add_dgraph_type_records` is enabled, then we add `dgraph.type` fields
            to the current frame.
        - `strip_id_from_edge_names`
            - Its common for a data set to have a reference to another 'table' using `_id` convention
            - For example if you had a Student & School then the student might more sense to have (Student) - school -> (School) rather then having an `_id` in the predicate.
        - `drop_na_intrinsic_objects`
            - Automatically drop intrinsic records where the object is NA. In a relational model, you might have a column with a null entry however in a graph model you may want to omit the attribute altogether
        - `drop_na_edge_objects`
            - Same as `drop_na_intrinsic_objects` but for edges
        - `key_separator`
            - Separator used to combine key fields. For example if the key separator was `_` and we were operating on an intrinsic attribute for a customer with id 1 then the `xid` would be `customer_1`
        - `illegal_characters`
            - Characters to strip from intrinsic and edge subjects. if the unique identifier has a character not supported by RDF/DGraph then strip them away or they will not be accepted by live loading.
        - `illegal_characters_intrinsic_object`
            - Same as `illegal_characters` but for the subject on intrinsic fields. These have a different set of illegal characters because subjects on intrinsic records are actual data values and are quoted. They therefore can accept many more characters then the subject.
        
        **File Level**
        - `type_overrides`
            - Recommended. This ensures that data types are being treated as a type and the output RDF has the correct type mapped into it. Without this fields will go under the default rdf type `<xs:string>` but you may want a field to be a true int in RDF.
            - Additionally certain data types such as `datetime64` will activate special handling to ensure the output in RDF is within the correct format to be ingested into DGraph.
            - Supported Types can be found [here](https://github.com/kiran94/dgraphpandas/blob/main/dgraphpandas/types.py)
        - `csv_edges`
            - Sometimes a vendor will provide a data file where a single column is actually a csv list and each csv value should be broken into multiple RDF statements (because they relate to independent entities). Adding that column into this list will do that.
            - For example in the [Netflix sample / title file](https://github.com/kiran94/dgraphpandas/blob/main/samples/netflix/dgraphpandas.json) we have a `cast` column where the values are `actor_1, actor2` then adding to csv_edges will ensure that the movie has 2 different relationships for each cast member.
        - `ignore_fields`
            - Add fields in the input that we don't are about to this list so they aren't present in the output
        - `override_edge_name`
            - Ensure that the edge name as a different predicate and/or target_node_type to what is defined in the file.
            - For example in the [Pokemon sample / pokemon_species](https://github.com/kiran94/dgraphpandas/blob/main/samples/pokemon/dgraphpandas.json) file you will see a column called `evolves_from_species` which tells us for a given pokemon which other pokemon does it evolve from. If we were to use the raw data here we would get a `evolves_from_species` edge with an incorrect target xid. Instead we want to override the `target_node_type` to `pokemon` so the edge correctly loops back to a node of the same type.
        - `pre_rename`
            - Rename intrinsic predicates or edge names to something else
        
        ## Samples
        
        Samples can be found [here](https://github.com/kiran94/dgraphpandas/tree/main/samples). They follow a convention where the download script can be found within the `input` directory and the config, generate_upsert, publish scripts can be found root of each respective sample.
        
        There are also [Jupyter Notebooks](https://github.com/kiran94/dgraphpandas/tree/main/samples/notebooks) which should show step by step examples.
        
        ## Local Setup
        
        Assuming you have already cloned the repo and have a terminal in the root of the project.
        
        ```sh
        # Create Virtual Environment and Activate it
        conda create -n dgraphpandas python=3.6 # or venv
        conda activate dgraphpandas
        
        # Restore packages
        python -m pip install -r requirements-dev.txt
        python -m pip install -r requirements.txt
        
        # Run Flake
        flake8 --count .
        
        # Run Tests
        python -m unittest
        
        # Create & Run DGraph
        docker-compose up
        
        # Try a Sample
        # See Sample section for more details
        # It should help getting some data,
        # generating rdf and publishing to your
        # local DGraph
        
        # Install a Local Copy of the Library
        python -m pip install -e .
        
        # Remember to Uninstall once ready
        python -m pip uninstall dgraphpandas -y
        ```
        
Keywords: dgraph,pandas,rdf
Platform: UNKNOWN
Requires-Python: >=3.6
Description-Content-Type: text/markdown
