# Dora
**Repository Path**: lanicon/Dora
## Basic Information
- **Project Name**: Dora
- **Description**: Tools for exploratory data analysis in Python
- **Primary Language**: Python
- **License**: Not specified
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-08-21
- **Last Updated**: 2024-11-26
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Dora
Exploratory data analysis toolkit for Python.
## Contents
- [Summary](#summary)
- [Setup](#setup)
- [Usage](#use)
- [Reading Data & Configuration](#config)
- [Cleaning](#clean)
- [Feature Selection & Extraction](#feature)
- [Visualization](#visual)
- [Model Validation](#model)
- [Data Versioning](#version)
- [Testing](#test)
- [Contribute](#contribute)
- [License](#license)
## Summary
Dora is a Python library designed to automate the painful parts of exploratory data analysis.
The library contains convenience functions for data cleaning, feature selection & extraction, visualization, partitioning data for model validation, and versioning transformations of data.
The library uses and is intended to be a helpful addition to common Python data analysis tools such as pandas, scikit-learn, and matplotlib.
## Setup
```
$ pip3 install Dora
$ python3
>>> from Dora import Dora
```
## Usage
#### Reading Data & Configuration
```python
# without initial config
>>> dora = Dora()
>>> dora.configure(output = 'A', data = 'path/to/data.csv')
# is the same as
>>> import pandas as pd
>>> dataframe = pd.read_csv('path/to/data.csv')
>>> dora = Dora(output = 'A', data = dataframe)
>>> dora.data
A B C D useless_feature
0 1 2 0 left 1
1 4 NaN 1 right 1
2 7 8 2 left 1
```
#### Cleaning
```python
# read data with missing and poorly scaled values
>>> import pandas as pd
>>> df = pd.DataFrame([
... [1, 2, 100],
... [2, None, 200],
... [1, 6, None]
... ])
>>> dora = Dora(output = 0, data = df)
>>> dora.data
0 1 2
0 1 2 100
1 2 NaN 200
2 1 6 NaN
# impute the missing values (using the average of each column)
>>> dora.impute_missing_values()
>>> dora.data
0 1 2
0 1 2 100
1 2 4 200
2 1 6 150
# scale the values of the input variables (center to mean and scale to unit variance)
>>> dora.scale_input_values()
>>> dora.data
0 1 2
0 1 -1.224745 -1.224745
1 2 0.000000 1.224745
2 1 1.224745 0.000000
```
#### Feature Selection & Extraction
```python
# feature selection / removing a feature
>>> dora.data
A B C D useless_feature
0 1 2 0 left 1
1 4 NaN 1 right 1
2 7 8 2 left 1
>>> dora.remove_feature('useless_feature')
>>> dora.data
A B C D
0 1 2 0 left
1 4 NaN 1 right
2 7 8 2 left
# extract an ordinal feature through one-hot encoding
>>> dora.extract_ordinal_feature('D')
>>> dora.data
A B C D=left D=right
0 1 2 0 1 0
1 4 NaN 1 0 1
2 7 8 2 1 0
# extract a transformation of another feature
>>> dora.extract_feature('C', 'twoC', lambda x: x * 2)
>>> dora.data
A B C D=left D=right twoC
0 1 2 0 1 0 0
1 4 NaN 1 0 1 2
2 7 8 2 1 0 4
```
#### Visualization
```python
# plot a single feature against the output variable
dora.plot_feature('column-name')
# render plots of each feature against the output variable
dora.explore()
```
#### Model Validation
```python
# create random partition of training / validation data (~ 80/20 split)
dora.set_training_and_validation()
# train a model on the data
X = dora.training_data[dora.input_columns()]
y = dora.training_data[dora.output]
some_model.fit(X, y)
# validate the model
X = dora.validation_data[dora.input_columns()]
y = dora.validation_data[dora.output]
some_model.score(X, y)
```
#### Data Versioning
```python
# save a version of your data
>>> dora.data
A B C D useless_feature
0 1 2 0 left 1
1 4 NaN 1 right 1
2 7 8 2 left 1
>>> dora.snapshot('initial_data')
# keep track of changes to data
>>> dora.remove_feature('useless_feature')
>>> dora.extract_ordinal_feature('D')
>>> dora.impute_missing_values()
>>> dora.scale_input_values()
>>> dora.data
A B C D=left D=right
0 1 -1.224745 -1.224745 0.707107 -0.707107
1 4 0.000000 0.000000 -1.414214 1.414214
2 7 1.224745 1.224745 0.707107 -0.707107
>>> dora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']
# use a previous version of the data
>>> dora.snapshot('transform1')
>>> dora.use_snapshot('initial_data')
>>> dora.data
A B C D useless_feature
0 1 2 0 left 1
1 4 NaN 1 right 1
2 7 8 2 left 1
>>> dora.logs
[]
# switch back to your transformation
>>> dora.use_snapshot('transform1')
>>> dora.data
A B C D=left D=right
0 1 -1.224745 -1.224745 0.707107 -0.707107
1 4 0.000000 0.000000 -1.414214 1.414214
2 7 1.224745 1.224745 0.707107 -0.707107
>>> dora.logs
["self.remove_feature('useless_feature')", "self.extract_ordinal_feature('D')", 'self.impute_missing_values()', 'self.scale_input_values()']
```
## Testing
To run the test suite, simply run `python3 spec.py` from the `Dora` directory.
## Contribute
Pull requests welcome! Feature requests / bugs will be addressed through issues on this repository. While not every feature request will necessarily be handled by me, maintaining a record for interested contributors is useful.
Additionally, feel free to submit pull requests which add features or address bugs yourself.
## License
**The MIT License (MIT)**
> Copyright (c) 2016 Nathan Epstein
>
> Permission is hereby granted, free of charge, to any person obtaining a copy
> of this software and associated documentation files (the "Software"), to deal
> in the Software without restriction, including without limitation the rights
> to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
> copies of the Software, and to permit persons to whom the Software is
> furnished to do so, subject to the following conditions:
>
> The above copyright notice and this permission notice shall be included in
> all copies or substantial portions of the Software.
>
> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
> IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
> AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
> LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
> OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
> THE SOFTWARE.