# Enzymatic_Transformer **Repository Path**: aiworkstep/Enzymatic_Transformer ## Basic Information - **Project Name**: Enzymatic_Transformer - **Description**: enzyme transfomer - **Primary Language**: Python - **License**: MIT - **Default Branch**: Enzymatic_Transformer - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 1 - **Created**: 2026-01-13 - **Last Updated**: 2026-01-13 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Enzymatic Transformer This repo complements the "[Predicting Enzymatic Reactions with a Molecular Transformer](https://chemrxiv.org/articles/preprint/Predicting_Enzymatic_Reactions_with_a_Molecular_Transformer/13161359/1)" publication ## Requirements ### Specific versions used: - Python: 3.6.10 - Torch: 1.5.1 - TorchText: 0.6.1 - OpenNMT: 1.1.1 - RDKit: 2017.09.1 ### Conda Environment Setup ```bash conda create -n enztrans_test python=3.6 conda activate enztrans_test conda install -c rdkit rdkit=2017.09.1 -y conda install -c pytorch pytorch=1.5.1 -y git clone https://github.com/reymond-group/OpenNMT-py.git cd OpenNMT-py git checkout Enzymatic_Transformer pip install -e . ``` ## Quickstart The training and evaluation was performed using [OpenNMT-py](https://github.com/OpenNMT/OpenNMT-py). The full documentation of the OpenNMT can be found [here](https://opennmt.net/OpenNMT-py/). ### Step 1: Tokenization The reaction SMILES are tokenized using the tokenization function available from the Molecular Transformer [here](https://github.com/pschwllr/MolecularTransformer) Enzyme sentences are tokenized using the Hugging Face tokenizers available [here](https://github.com/huggingface/tokenizers/tree/master/bindings/python#build-your-own). The custom tokenizer can be build from a file containing the list of sentences using the following commands: ```python from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors # Initialize a tokenizer tokenizer2 = Tokenizer(models.BPE()) # Customize pre-tokenization and decoding tokenizer2.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True) tokenizer2.decoder = decoders.ByteLevel() tokenizer2.post_processor = processors.ByteLevel(trim_offsets=True) # And then train trainer = trainers.BpeTrainer(vocab_size=9000, min_frequency=2, limit_alphabet=55, special_tokens=['ase', 'hydro', 'mono', 'cyclo', 'thermo', 'im']) tokenizer2.train(trainer, ["list_of_sentences.txt"]) ``` Then, sentences of the dataset are tokenized using the following function: ```python def enzyme_sentence_tokenizer(sentence): ''' Tokenize a sentenze, optimized for enzyme-like descriptions & names ''' encoded = tokenizer2.encode(sentence) my_list = [item for item in encoded.tokens if 'Ġ' != item] my_list = [item.replace('Ġ', '_') for item in my_list] my_list = ' '.join(my_list) return my_list ``` ### Step 2: Preprocess the data ```bash DATASET=data/uspto_dataset DATASET_TRANSFER=data/transfer_dataset preprocess.py -train_ids ENZR ST_sep_aug \ -train_src DATADIR/src_train.txt $DATASET_TRANSFER/src-train.txt \ -train_tgt DATADIR/tgt_train.txt $DATASET_TRANSFER/tgt-train.txt \ -valid_src DATADIR/src_val.txt -valid_tgt $DATASET_TRANSFER/multi_task /tgt_val.txt \ -save_data DATADIR/Preprocessed \-src_seq_length 3000 -tgt_seq_length 3000 \ -src_vocab_size 3000 -tgt_vocab_size 3000 \-share_vocab -lower ``` ### Step 3: Training of the model The Enzymatic Transformer was trained using the following parameters: Multi-task transfer learning: ```bash WEIGHT1=1 WEIGHT2=9 train.py -data DATADIR/Preprocessed \ -save_model ENZR_MTL -seed 42 -train_steps 200000 -param_init 0 \ -param_init_glorot -max_generator_batches 32 -batch_size 6144 \ -batch_type tokens -normalization tokens -max_grad_norm 0 -accum_count 4 \ -optim adam -adam_beta1 0.9 -adam_beta2 0.998 -decay_method noam \ -warmup_steps 8000 -learning_rate 4 -label_smoothing 0.0 -layers 4 \ -rnn_size 384 -word_vec_size 384 \ -encoder_type transformer -decoder_type transformer \ -dropout 0.1 -position_encoding -global_attention general \ -global_attention_function softmax -self_attn_type scaled-dot \ -heads 8 -transformer_ff 2048 \ -data_ids ENZR ST_sep_aug -data_weights WEIGHT1 WEIGHT2 \ -valid_steps 5000 -valid_batch_size 4 -early_stopping_criteria accuracy \ ``` ### Step 4: Model prediction A reaction can be predicted after tokenization using the following command: ```bash translate.py -model model_uspto_ENZR_multitask.pt \ -src DATASET/src_test.txt \ -output predictions.txt \ -batch_size 64 -replace_unk -max_length 1000 \ -log_probs -beam_size 5 -n_best 5 \ ``` ## Citation ### Enzymatic Transformer: ```bash @article{kreutter_predicting_2020, title = {Predicting {Enzymatic} {Reactions} with a {Molecular} {Transformer}}, author = {Kreutter, David and Schwaller, Philippe and Reymond, Jean-Louis}, url = {/articles/preprint/Predicting_Enzymatic_Reactions_with_a_Molecular_Transformer/13161359/1}, doi = {10.26434/chemrxiv.13161359.v1}, urldate = {2020-10-30}, month = oct, year = {2020}, note = {Publisher: ChemRxiv} } ``` ### Original OpenNMT-py: If you reuse this code please also cite the underlying code framework: [OpenNMT: Neural Machine Translation Toolkit](https://arxiv.org/pdf/1805.11462.pdf) [OpenNMT technical report](https://www.aclweb.org/anthology/P17-4012/) ```bash @inproceedings{opennmt, author = {Guillaume Klein and Yoon Kim and Yuntian Deng and Jean Senellart and Alexander M. Rush}, title = {Open{NMT}: Open-Source Toolkit for Neural Machine Translation}, booktitle = {Proc. ACL}, year = {2017}, url = {https://doi.org/10.18653/v1/P17-4012}, doi = {10.18653/v1/P17-4012} } ```