# GRASP

**Repository Path**: Leon_02/GRASP

## Basic Information

- **Project Name**: GRASP
- **Description**: Accurate Prediction of Genome-wide RNA Secondary Structure Based on Extreme Gradient Boosting
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-14
- **Last Updated**: 2021-04-14

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# TH-GRASS: Accurate Prediction of Genome-wide RNA Secondary Structure Based on Extreme Gradient Boosting

## Getting Start
These instructions will get you a copy of the project up and running on your local machine.

### Environment

TH-GRASS has been implemented in `Python3`.

### Requirements
Installing requirements:  
```
pip3 install -r requirements.txt
```

or avoiding problems in multiple Python environments:  

```
python3 -m pip install -r requirements.txt
```

## Training

### Command line:  
```
    usage: train.py [-h] [--input INPUT] [--training_mode TRAINING_MODE]
                    [--dataset DATASET] [--window_size WINDOW_SIZE]
                    [--n_jobs N_JOBS]

    optional arguments:
    -h, --help            show this help message and exit
    --input INPUT         The full path of the input file.
    --training_mode TRAINING_MODE
                            There are two training mode: single or global,
                            'single' is trained by single dataset and 'global' is
                            trained by all datasets.
    --dataset DATASET     This is for dataset, 'py' is for PARS-Yeast, 'ph' is
                            for PARS-Human, 'pdb' is for NMR/X-ray and 'global' is
                            for the three dataset mixed.
    --window_size WINDOW_SIZE
                            The window size when truncating RNA sequences.
    --n_jobs N_JOBS       Number of jobs to run in parallel.
```

### Start Training: 
The input data has been pre-processed from raw dataset. [Here](https://github.com/sysu-yanglab/TH-GRASS/tree/master/preprocessing "Data preprocess") is the data preprocessing methods.  

Example of training:  
* Using one of the datasets(PARS-Yeast, PARS-Human or SS_PDB) as training:
```
    python3 train.py --input='./data/PARS_yeast/py_encode_37.csv' --training_mode='single' --dataset='py' --window_size=37
    python3 train.py --input='./data/PARS_human/ph_encode_37.csv' --training_mode='single' --dataset='ph' --window_size=37
    python3 train.py --input='./data/NMR_X-Ray/pdb_encode_37.csv' --training_mode='single' --dataset='pdb' --window_size=37
```
That you can obtain three single models and save them in the directory: "./model/".  

* Using all of the datasets as training:
```
    python3 train.py --training_mode="global" --dataset="global" --window_size=37
```
That you can obtain the consensus model and it has been saved in the directory: "./model".

### Output:
* model:  
    The models file have been saved in the `'./model/'` directory.  
    If you follow the above commands to training, you will get some models named like "py_37.model", where 'py' is the name of training dataset, '37' is the window size you select.

* parameters:   
    Because training with `GridSearchCV` and `StratifiedKFold` function, the best parameter combinations would be selected from the parameter candidates and it would be saved in the directory: `'./parameters/'`.

## Evaluation

The performance were mainly evaluated by AUC.  

### Command line:  
```
    usage: evaluation.py [-h] [--mode MODE] --test_dataset TEST_DATASET
                        --train_model TRAIN_MODEL [--window_size WINDOW_SIZE]
                        [--output OUTPUT]

    optional arguments:
    -h, --help            show this help message and exit
    --mode MODE           The trained model is trained by two training modes:
                            single or global, 'single' is trained by single
                            dataset and 'global' is trained by all datasets.
    --test_dataset TEST_DATASET
                            This is for test-dataset. Choose from: 'py' is PARS-
                            Yeast 'ph' is PARS-Human, 'pdb' is NMR/X-ray.
    --train_model TRAIN_MODEL
                            This is for using which trained model. Choose from:
                            'py' is PARS-Yeast 'ph' is PARS-Human, 'pdb' is
                            NMR/X-ray and 'global' is Consensus-model.
    --window_size WINDOW_SIZE
                            The window size when truncating RNA sequences.
    --output OUTPUT         The directory for saving predictions on test-dataset.
```

### Usage:
* Single models  
``` 
    python3 evaluation.py --mode='single' --test_dataset='ph' --train_model='py' --window_size=37
```

* Consensus model
```
    python3 evaluation.py --mode='global' --train_model='global' --window_size=37
```

## Testing
Also, you can submit your own test RNA-sequences.

### Input file  
FASTA format, please ensure that your sequences only contain A, C, G, T, and U.

### Usage  

* Command line:  
```
    usage: test.py [-h] --input_fasta INPUT_FASTA --model_file MODEL_FILE
                --window_size WINDOW_SIZE --output_file OUTPUT_FILE

    optional arguments:
    -h, --help            show this help message and exit
    --input_fasta INPUT_FASTA
                            Input the test file of RNA sequences(FASTA format,
                            please ensure that your sequences only contain A, C,
                            G, T, and U)
    --model_file MODEL_FILE
                            The full path of the trained model.
    --window_size WINDOW_SIZE
                            The window size when truncating RNA sequences.
    --output_file OUTPUT_FILE
                            The full path of the output file.
```

* Run:  
```
    python3 test.py --input_fasta=InputFile --model_file='./model/PARS_yeast/py_37.model' --window_size=37 --output_file='./output/preds'
```

## Cite
If you find this work useful in your research, please consider citing the paper:  
**"Accurate Prediction of Genome-wide RNA Secondary Structure Based on eXtreme Gradient Boosting."**


## Contact
`yuedong.yang@gmail.com`, `raojh6@mail2.sysu.edu.cn`  or `keyaobin@mail2.sysu.edu.cn`