# e2efold
**Repository Path**: Leon_02/e2efold
## Basic Information
- **Project Name**: e2efold
- **Description**: pytorch implementation for "RNA Secondary Structure Prediction By Learning Unrolled Algorithms"
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2021-04-15
- **Last Updated**: 2021-04-15
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# E2Efold: RNA Secondary Structure Prediction By Learning Unrolled Algorithms
pytorch implementation for [RNA Secondary Structure Prediction By Learning Unrolled Algorithms](https://openreview.net/forum?id=S1eALyrYDH) [1]
[[Paper](https://openreview.net/pdf?id=S1eALyrYDH)] [[Presentation](https://iclr.cc/virtual_2020/poster_S1eALyrYDH.html)] [[Slides](http://xinshi-chen.com/papers/slides/iclr2020-e2efold.pdf)]
[GaTech news]
[Chinese news]
[Chinese introduction]
[Plain explanation]
## Setup
### Install the package
The environment that we use is given in `environment.yml`. You can consider to use exactly the same environment by running the following command.
```
Conda env create -f environment.yml
```
Please navigate to the root of this repository, and run the following command to install the package e2efold.
```
source activate rna_ss # activate the enviornment
pip install -e .
```
### Data
Please download the RNA secondary structure [data](https://drive.google.com/open?id=19KPRYJjjMJh1qdMhtmUoYA_ncw3ocAHc) and put all the `.tgz` files in the `/data` folder. Then run:
```
tar -xzf rnastralign_all.tgz
tar -xzf rnastralign_all_600.tgz
tar -xzf archiveII_all.tgz
```
These files contain the processed data. As a reference, the codes for preprocessing the data are also given in this `/data` folder.
### Folder structure
Finally the project should have the following folder structure:
```
e2efold
|___e2efold # source code
|___e2efold_productive # productive code for handling new sequences
|___data # data
|___archiveII_all
|___rnastralign_all_600
|___rnastralign_all
|___preprocess_archiveii.py # just as a reference. no need to run.
|......
|___models_ckpt # trained models
|___results
|___experiment_archiveii
|___experiment_rnastralign
|___slides_and_articles # slides and articles related to the project
...
```
## Prediction for user's input sequence
To directly use our trained model to make prediction for any RNA sequence, please refer to the information in `/e2efold_productive` folder.
## Reproduce experimental results in the paper
To reproduce the experiments in our paper, please refer to the following steps:
## Test with trained model
You can download the [pretrained models](https://drive.google.com/open?id=1m038Fw0HBGEzsvhS0mRxd0U7cGXqLAVt) and put the `.pt` files in the folder `/models_ckpt`.
### RNAStralign
You can navigate to the `/experiment_ranstralign` folder and run the following command to test the model on RNAStralign test dataset:
```
python e2e_learning_stage3.py -c config.json --test True
python e2e_learning_stage3_rnastralign_all_long.py -c config_long.json --test True
```
### ArchiveII
You can navigate to the `/experiment_archiveii` folder and run the following command to test the model on ArchiveII data. Note that the saved model is trained on the RNAStralign database.
```
# For sequences shorter than 600
python e2e_learning_stage3.py -c config.json
# For sequences from 600 to 1800, not performing well on long sequence in archiveii
python e2e_learning_stage3_rnastralign_all_long.py -c config_long.json
```
## Reproduce the training process or re-train the model on a new dataset
The model is trained on the RNAstralign training set. To reproduce the training process, you can navigate to the folder
`e2efold_rnastralign` and run:
```
# For sequences shorter than 600
python e2e_learning_stage1.py -c config.json # pre-train the score network
python e2e_learning_stage3.py -c config.json # end-to-end training
# For sequences from 600 to 1800
python e2e_learning_stage1_rnastralign_all_long.py -c config_long.json
python e2e_learning_stage3_rnastralign_all_long.py -c config_long.json
```
Given the training logic implemented in the above python files, you can modify the data generator to re-train the model on other datasets. Our data generator in defined in `e2efold/data_generator.py`. You could probably choose to define a Sub Class based on the Class `RNASSDataGenerator`.
## Citation
If you found this library useful in your research, please consider citing
```
@article{chen2020rna,
title={RNA Secondary Structure Prediction By Learning Unrolled Algorithms},
author={Chen, Xinshi and Li, Yu and Umarov, Ramzan and Gao, Xin and Song, Le},
journal={arXiv preprint arXiv:2002.05810},
year={2020}
}
```
## References
[1] Xinshi Chen*, Yu Li*, Ramzan Umarov, Xin Gao, Le Song. "RNA Secondary Structure Prediction By Learning Unrolled Algorithms." *In International Conference on Learning Representations.* 2020.