Metadata-Version: 2.1
Name: code_tokenize
Version: 0.0.1.post1
Summary: Fast program tokenization and structural analysis in Python
Home-page: https://github.com/cedricrupb/code_tokenize
Author: Cedric Richter
Author-email: cedricr.upb@gmail.com
License: apache-2.0
Download-URL: https://github.com/cedricrupb/code_tokenize/archive/refs/tags/v0.0.1.tar.gz
Keywords: code,tokenization,tokenize,program,language processing
Platform: UNKNOWN
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: License :: OSI Approved :: Apache Software License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Description-Content-Type: text/markdown
License-File: LICENSE

<p align="center">
  <img height="150" src="https://github.com/cedricrupb/ptokenizers/raw/main/resources/code_tokenize.svg" />
</p>

------------------------------------------------

Programminng Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages. 
To achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programminng languages. Especially the syntactical structure can be exploited to gain knowledge about programs.

Code(dot)tokenize provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.
By relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information.

## Installation
The package is currently only tested under Python 3. It can be installed via:
```
pip install code-tokenize
```


## Library highlights
Whether you are on the search for a fast multilingual program tokenizer or want to start your next PLP project, here are some reason why you should build upon ptokenizers:

* **Easy to use** All it takes to tokenize your code is to run a single line:
```
import code_tokenize as ctok

ctok.tokenize(
    '''
        def my_func():
            print("Hello World")
    ''',
lang = "python")

```

* **Most programming languages supported** Since all our tokenizers are backed by [Tree-Sitter](https://tree-sitter.github.io/tree-sitter/) we support a long list of programming languages. This also includes popular languages such as Python, Java and JavaScript.


## Roadmap
code(dot)tokenize is currently under active development. To enable application for various types of PLP methods, the following features are planned for future versions:

- **Token tagging** Automatically identify certain token types including variable usages, definition and type usages.

- **Syntactic relations** Automatically identify syntactic relations between tokens. This includes read and write relations or structural dependencies.

- **Basic CFG analysis** Automatically identify statement heads which are connected via a control flow


