Metadata-Version: 2.1
Name: json-stream
Version: 1.0.0
Summary: Streaming JSON decoder
Home-page: https://github.com/daggaz/json-stream
Author: Jamie Cockburn
Author-email: jamie_cockburn@hotmail.co.uk
License: UNKNOWN
Project-URL: Bug Reports, https://github.com/daggaz/json-stream/issues
Project-URL: Source, https://github.com/daggaz/json-stream/
Description: # json-stream
        
        Simple streaming JSON parser.
        
        `json-stream` is a JSON parser just like the standard library's
         [`json.load()`](https://docs.python.org/3/library/json.html#json.load). It 
         will read a JSON document and convert it into native python types.
        
        ```python
        import json_stream
        data = json_stream.load(f)
        ```
        
        Features:
        * stream all JSON data types (objects or lists)
        * stream nested data
        * simple pythonic `list`-like/`dict`-like interface
        * stream truncated or malformed JSON data (up to the first error)
        * pure python
        * no dependencies
        
        Unlike `json.load()`, `json-stream` can _stream_ JSON data from a file-like
        object. This has the following benefits:
        
        * It does not require the whole json document to be into memory up-front
        * It can start producing data before the entire document has finished loading
        * It only requires enough memory to hold the data currently being parsed
        
        ## What are the problems with standard `json.load()`?
        
        The problem with the `json.load()` stem from the fact that it must read
        the whole JSON document into memory before parsing it.
        
        ### Memory usage
        
        `json.load()` first reads the whole document into memory as a string. It
        then starts parsing that string and converting the whole document into python
        types again stored in memory. For a very large document, this could be more
        memory than you have available to your system.
        
        `json_stream.load()` does not read the whole document into memory, it only
        buffers enough from the stream to produce the next item of data.
        
        Additionally, in transient mode (see below) `json-stream` also doesn't store 
        up all of the parsed data in memory.
        
        ### Latency
        
        `json.load()` produces all the data after parsing the whole document. If you
        only care about the first 10 items in a list of 2 million items, then you
        have wait until all 2 million items have been parsed first.
        
        `json_stream.load()` produces data as soon as it is available in the stream.
        
        ## Usage
        
        ### `json_stream.load()`
        
        `json_stream.load()` has two modes of operation, controlled by
        the `persistent` argument (default false).
        
        #### Transient mode (default)
        
        This mode is appropriate if you can consume the data iteratively. You cannot 
        move backwards through the stream to read data that has already been skipped
        over. It is the mode you **must** use if you want process large amounts of
        JSON data without consuming large amounts of memory required.
        
        In transient mode, only the data currently being read is stored in memory. Any
        data previously read from the stream is discarded (it's up to you what to do 
        with it) and attempting to access this data results in a
        `TransientAccessException`.
        
        ```python
        import json_stream
        
        # JSON: {"count": 3, "results": ["a", "b", "c"]}
        data = json_stream.load(f)  # data is a transient dict-like object 
        # stream has been read up to "{"
        
        # use data like a dict
        results = data["results"]  # results is a transient list-like object
        # stream has been read up to "[", we now cannot read "count"
        
        # iterate transient list
        for result in results:
            print(result)  # prints a, b, c
        # stream has been read up to "]"
        
        # attempt to read "count" from earlier in stream
        count = data["count"]  # will raise exception
        # stream is now exhausted
        
        # attempt to read from list that has already be iterated
        for result in results:  # will raise exception
            pass
        ```
        
        #### Persistent mode
        
        In persistent mode all previously read data is stored in memory as
        it is parsed. The returned `dict`-like or `list`-like objects
        can be used just like normal data structures.
        
        If you request an index or key that has already been read from the stream
        then it is retrieved from memory. If you request an index or key that has
        not yet been read from the stream, then the request blocks until that item
        is found in the stream.
        
        ```python
        import json_stream
        
        # JSON: {"count": 1, "results": ["a", "b", "c"]}
        data = json_stream.load(f, persistent=True)
        # data is a streaming  dict-like object 
        # stream has been read up to "{"
        
        # use data like a dict
        results = data["results"]  # results is a streaming list-like object
        # stream has been read up to "["
        # count has been stored data
        
        # use results like a list
        a_result = results[1]  # a_result = "b"
        # stream has been read up to the middle of list
        # "a" and "b" have been stored in results
        
        # read earlier data from memory
        count = data["count"]  # count = 1
        
        # consume rest of list
        results.read_all()
        # stream has been read up to "}"
        # "c" is now stored in results too
        # results.is_streaming() == False
        
        # consume everything
        data.read_all()
        # stream is now exhausted
        # data.is_streaming() == False
        ```
        
        Persistent mode is not appropriate if you care about memory consumption, but
        provides an identical experience compared to `json.load()`.
        
        ### visitor pattern
        
        You can also parse using a visitor-style approach where a function you supply
        is called for each data item as it is parsed (depth-first).
        
        This uses a transient parser under the hood, so does not consume memory for
        the whole document.
        
        ```python
        import json_stream
        
        # JSON: {"x": 1, "y": {}, "xxxx": [1,2, {"yyyy": 1}, "z", 1, []]}
        
        def visitor(item, path):
            print(f"{item} at path {path}")
        
        json_stream.visit(f, visitor)
        ```
        
        Output:
        ```
        1 at path ('x',)
        {} at path ('y',)
        1 at path ('xxxx', 0)
        2 at path ('xxxx', 1)
        1 at path ('xxxx', 2, 'yyyy')
        z at path ('xxxx', 3)
        1 at path ('xxxx', 4)
        [] at path ('xxxx', 5)
        ```
        
        ### Stream a URL
        
        #### urllib
        
        ```python
        import urllib.request
        import json_stream
        
        with urllib.request.urlopen('http://example.com/data.json') as response:
            data = json_stream.load(response)
        ```
        
        #### requests
        
        ```python
        import requests
        import json_stream.requests
        
        with requests.get('http://example.com/data.json', stream=True) as response:
            data = json_stream.requests.load(response)
        ```
        
        # Future improvements
        
        * Allow long strings in the JSON to be read as streams themselves
        * Allow transient mode on seekable streams to seek to data earlier in
        the stream instead of raising a `TransientAccessException`
        * A more efficient tokenizer?
        
        # Alternatives
        
        ## NAYA
        
        [NAYA](https://github.com/danielyule/naya) is a pure python JSON parser for
        parsing a simple JSON list as a stream.
        
        ### Why not NAYA?
        
        * It can only stream JSON containing a top-level list 
        * It does not provide a pythonic `dict`/`list`-like interface 
        
        ## Yajl-Py
        
        [Yajl-Py]() is a wrapper around the Yajl JSON library that can be used to 
        generate SAX style events while parsing JSON.
        
        ### Why not Yajl-Py?
        
        * It's not pure python
        * It does not provide a pythonic `dict`/`list`-like interface 
        
        # Acknowledgements
        
        The JSON tokenizer used in the project was taken from the
        [NAYA](https://github.com/danielyule/naya) project.
         
Keywords: sample,setuptools,development
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Libraries
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.5
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3 :: Only
Requires-Python: >=3.5, <4
Description-Content-Type: text/markdown
