# e2efold **Repository Path**: Leon_02/e2efold ## Basic Information - **Project Name**: e2efold - **Description**: pytorch implementation for "RNA Secondary Structure Prediction By Learning Unrolled Algorithms" - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2021-04-15 - **Last Updated**: 2021-04-15 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # E2Efold: RNA Secondary Structure Prediction By Learning Unrolled Algorithms pytorch implementation for [RNA Secondary Structure Prediction By Learning Unrolled Algorithms](https://openreview.net/forum?id=S1eALyrYDH) [1] [[Paper](https://openreview.net/pdf?id=S1eALyrYDH)] [[Presentation](https://iclr.cc/virtual_2020/poster_S1eALyrYDH.html)] [[Slides](http://xinshi-chen.com/papers/slides/iclr2020-e2efold.pdf)] [GaTech news] [Chinese news] [Chinese introduction] [Plain explanation] ## Setup ### Install the package The environment that we use is given in `environment.yml`. You can consider to use exactly the same environment by running the following command. ``` Conda env create -f environment.yml ``` Please navigate to the root of this repository, and run the following command to install the package e2efold. ``` source activate rna_ss # activate the enviornment pip install -e . ``` ### Data Please download the RNA secondary structure [data](https://drive.google.com/open?id=19KPRYJjjMJh1qdMhtmUoYA_ncw3ocAHc) and put all the `.tgz` files in the `/data` folder. Then run: ``` tar -xzf rnastralign_all.tgz tar -xzf rnastralign_all_600.tgz tar -xzf archiveII_all.tgz ``` These files contain the processed data. As a reference, the codes for preprocessing the data are also given in this `/data` folder. ### Folder structure Finally the project should have the following folder structure: ``` e2efold |___e2efold # source code |___e2efold_productive # productive code for handling new sequences |___data # data |___archiveII_all |___rnastralign_all_600 |___rnastralign_all |___preprocess_archiveii.py # just as a reference. no need to run. |...... |___models_ckpt # trained models |___results |___experiment_archiveii |___experiment_rnastralign |___slides_and_articles # slides and articles related to the project ... ``` ## Prediction for user's input sequence To directly use our trained model to make prediction for any RNA sequence, please refer to the information in `/e2efold_productive` folder. ## Reproduce experimental results in the paper To reproduce the experiments in our paper, please refer to the following steps: ## Test with trained model You can download the [pretrained models](https://drive.google.com/open?id=1m038Fw0HBGEzsvhS0mRxd0U7cGXqLAVt) and put the `.pt` files in the folder `/models_ckpt`. ### RNAStralign You can navigate to the `/experiment_ranstralign` folder and run the following command to test the model on RNAStralign test dataset: ``` python e2e_learning_stage3.py -c config.json --test True python e2e_learning_stage3_rnastralign_all_long.py -c config_long.json --test True ``` ### ArchiveII You can navigate to the `/experiment_archiveii` folder and run the following command to test the model on ArchiveII data. Note that the saved model is trained on the RNAStralign database. ``` # For sequences shorter than 600 python e2e_learning_stage3.py -c config.json # For sequences from 600 to 1800, not performing well on long sequence in archiveii python e2e_learning_stage3_rnastralign_all_long.py -c config_long.json ``` ## Reproduce the training process or re-train the model on a new dataset The model is trained on the RNAstralign training set. To reproduce the training process, you can navigate to the folder `e2efold_rnastralign` and run: ``` # For sequences shorter than 600 python e2e_learning_stage1.py -c config.json # pre-train the score network python e2e_learning_stage3.py -c config.json # end-to-end training # For sequences from 600 to 1800 python e2e_learning_stage1_rnastralign_all_long.py -c config_long.json python e2e_learning_stage3_rnastralign_all_long.py -c config_long.json ``` Given the training logic implemented in the above python files, you can modify the data generator to re-train the model on other datasets. Our data generator in defined in `e2efold/data_generator.py`. You could probably choose to define a Sub Class based on the Class `RNASSDataGenerator`. ## Citation If you found this library useful in your research, please consider citing ``` @article{chen2020rna, title={RNA Secondary Structure Prediction By Learning Unrolled Algorithms}, author={Chen, Xinshi and Li, Yu and Umarov, Ramzan and Gao, Xin and Song, Le}, journal={arXiv preprint arXiv:2002.05810}, year={2020} } ``` ## References [1] Xinshi Chen*, Yu Li*, Ramzan Umarov, Xin Gao, Le Song. "RNA Secondary Structure Prediction By Learning Unrolled Algorithms." *In International Conference on Learning Representations.* 2020.