# HedgeCode **Repository Path**: chengong1012/HedgeCode ## Basic Information - **Project Name**: HedgeCode - **Description**: No description available - **Primary Language**: Python - **License**: Not specified - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-05 - **Last Updated**: 2025-08-05 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # HedgeCode: A Multi-Task Hedging Contrastive Learning Framework for Code Search > *HedgeCode* consists of three stages. It aligns the representation spaces of code and text in the first stage and then further optimizes the representation learning in the second stage. Finally, it uses the trained model for code search. ![1](Figure/Overview.png) > In the representation alignment stage, we align the representation spaces of code and text through a relevance detection task and design hedging contrastive learning (HCL) method to capture fine-grained differences between the code and text. In the multi-task joint learning stage, we employ joint learning to optimize the code-text relevance detection task (CTRD), code-text contrastive learning task (CTC), and code search task (CS). In the code search stage, we employ the trained encoder to search codes from the codebase. ## Source Code #### Environment ```bash conda env create -f env-hedgecode.yaml conda activate hedgecode ``` #### Dataset > Two datasets were used in the experiments. Data statistics of these two datasets are as follows: **Data statistic of CodeSearchNet.** | PL | Training | Validation | Test | Candidate Codes| | :--------- | :------: | :----: | :----: |:----: | | Ruby | 24,927 | 1,400 | 1,261 |4,360| | JavaScript | 58,025 | 3,885 | 3,291 |13,981| | Java | 164,923 | 5,183 | 10,955 |40,347| | Go | 167,288 | 7,325 | 8,122 |28,120| | PHP | 241,241 | 12,982 | 14,014 |52,660| | Python | 251,820 | 13,914 | 14,918 |43,827| **Data statistic of Relevance Detection Pairs.** | PL | Training | Validation | Test | | :--------- | :------: | :----: | :----: | | Ruby | 74,781 | 4,200 | 3,783 | | JavaScript | 17,4075 | 11,655 | 9,873 | | Java | 494,769 | 15,549 | 32,865 | | Go | 501,864 | 21,975 | 24,366 | | PHP | 723,723 | 38,946 | 42,042 | | Python | 755,460 | 41,742 | 44,754 | * Please refer to the [README.md](./dataset/README.md) in the dataset folder for more details of datasets. #### Model Training ##### Encoder HedgeCode is a model-agnostic framework. Any transformer-based code large language model (code LLM) can be integrated as an encoder. In this study, we integrated three wildly used code LLMs as encoders respectively ([CodeBERT](https://huggingface.co/microsoft/codebert-base), [Unixcoder](https://huggingface.co/microsoft/unixcoder-base), and [CoCoSoDa](https://huggingface.co/DeepSoftwareAnalytics/CoCoSoDa)). ##### Training and Evaluation HedgeCode main has two training stages: Representation Alignment Stage and Multi-task Joint Learning Stage. ###### 1. Representation Alignment Stage > Detector Training. ~~~bash cd ./RA python representation_alignment.py --language=ruby --output_dir=./save_results --detection_dir="../dataset/detection dataset" --encoder=codebert --nl_length=128 --code_length=256 --loss_type=hcl --batch_size=64 --learning_rate=1e-6 --num_train_epochs=100 ~~~ ###### 2. Multi-task Joint Learning Stage > Training with CodeBERT. ```bash cd ./MJL/HedgeCode_CodeBERT lang=ruby mode=hcl mkdir -p ./saved_models/$lang/$mode python run.py --output_dir=./saved_models/$lang/$mode --config_name=microsoft/codebert-base --model_name_or_path=microsoft/codebert-base --tokenizer_name=microsoft/codebert-base --do_train --do_eval --do_test --train_data_file=../../dataset/codesearchnet/$lang/train.jsonl --eval_data_file=../../dataset/codesearchnet/$lang/valid.jsonl --test_data_file=../../dataset/codesearchnet/$lang/test.jsonl --codebase_file=../../dataset/codesearchnet/$lang/codebase.jsonl --detector_path="../../RA/save_results/$lang/codebert/$mode/detector.pth" --num_train_epochs=100 --code_length=256 --nl_length=128 --train_batch_size=64 --eval_batch_size=64 --learning_rate=2e-5 --seed=123456 --fewshot=False 2>&1 | tee saved_models/$lang/$mode/train.log ``` > Training with Unixcoder. ```bash cd ./MJL/HedgeCode_Unixcoder lang=ruby mode=hcl mkdir -p ./saved_models/$lang/$mode python run.py --output_dir=./saved_models/$lang/$mode --config_name=microsoft/unixcoder-base --model_name_or_path=microsoft/unixcoder-base --tokenizer_name=microsoft/unixcoder-base --do_train --do_eval --do_test --train_data_file=../../dataset/codesearchnet/$lang/train.jsonl --eval_data_file=../../dataset/codesearchnet/$lang/valid.jsonl --test_data_file=../../dataset/codesearchnet/$lang/test.jsonl --codebase_file=../../dataset/codesearchnet/$lang/codebase.jsonl --detector_path="../../RA/save_results/$lang/unixcoder/$mode/detector.pth" --num_train_epochs=100 --code_length=256 --nl_length=128 --train_batch_size=64 --eval_batch_size=64 --learning_rate=2e-5 --seed=123456 --fewshot=False 2>&1 | tee saved_models/$lang/$mode/train.log ``` > Training with CoCoSoDa. ~~~bash cd ./MJL/HedgeCode_CoCoSoDa lang=ruby mode=hcl bash run.sh $lang $mode ~~~ #### Zero-shot and Few-shot code search ##### zero-shot with detector > First, recall the topN code from the codebase with ball-tree. ~~~bash cd ./RA/zero-shot lang=ruby python recall_topn.py --language=$lang --output_dir=./zero-shot/$lang --query_file="../../dataset/detection dataset/pairs/$lang/test.jsonl" --codebase_file="../../dataset/codesearchnet/$lang/codebase.jsonl" --plugin_checkpoint_path="../save_results/$lang/codebert/hcl/detector.pth" --encoder=codebert --batch_size=1024 --topK=1000 ~~~ > Then, adopt the trained detector to search the code from codebase. ~~~bash lang=ruby python zero_shot_search.py --language=$lang --output_dir=./zero-shot/$lang --pair_dataset_file="./zero-shot/$lang/detection_pair_dataset.jsonl" --plugin_checkpoint_path="../save_results/$lang/codebert/hcl/detector.pth" --encoder=codebert --batch_size=5120 --topK=1000 ~~~ ##### few-shot code search >Train with CodeBERT. ~~~bash cd ./MJL/HedgeCode_CodeBERT lang=ruby mode=hcl mkdir -p ./saved_models/$lang/$mode/fewshot python run.py --output_dir=./saved_models/$lang/$mode/fewshot --config_name=microsoft/codebert-base --model_name_or_path=microsoft/codebert-base --tokenizer_name=microsoft/codebert-base --do_train --do_eval --do_test --train_data_file=../../dataset/codesearchnet/$lang/train.jsonl --eval_data_file=../../dataset/codesearchnet/$lang/valid.jsonl --test_data_file=../../dataset/codesearchnet/$lang/test.jsonl --codebase_file=../../dataset/codesearchnet/$lang/codebase.jsonl --detector_path="../../RA/save_results/$lang/codebert/$mode/detector.pth" --num_train_epochs=100 --code_length=256 --nl_length=128 --train_batch_size=64 --eval_batch_size=64 --learning_rate=2e-5 --seed=123456 --fewshot 2>&1 | tee saved_models/$lang/$mode/train.log ~~~ > Training with Unixcoder. ~~~bash cd ./MJL/HedgeCode_Unixcoder lang=ruby mode=hcl mkdir -p ./saved_models/$lang/$mode/fewshot python run.py --output_dir=./saved_models/$lang/$mode/fewshot --config_name=microsoft/unixcoder-base --model_name_or_path=microsoft/unixcoder-base --tokenizer_name=microsoft/unixcoder-base --do_train --do_eval --do_test --train_data_file=../../dataset/codesearchnet/$lang/train.jsonl --eval_data_file=../../dataset/codesearchnet/$lang/valid.jsonl --test_data_file=../../dataset/codesearchnet/$lang/test.jsonl --codebase_file=../../dataset/codesearchnet/$lang/codebase.jsonl --detector_path="../../RA/save_results/$lang/unixcoder/$mode/detector.pth" --num_train_epochs=100 --code_length=256 --nl_length=128 --train_batch_size=64 --eval_batch_size=64 --learning_rate=2e-5 --seed=123456 --fewshot 2>&1 | tee saved_models/$lang/$mode/train.log ~~~ > Training with CoCoSoDa. Add parameter "--fewshot" in the "run.sh" file, and then execute the following script. ~~~bash cd ./MJL/HedgeCode_CoCoSoDa lang=ruby mode=hcl bash run.sh $lang $mode ~~~ #### Ablation study >Train with CodeBERT. Add parameters "--without_ctrd" and "--without_ctc" in the script. ~~~bash cd ./MJL/HedgeCode_CodeBERT lang=ruby mode=hcl mkdir -p ./saved_models/$lang/$mode/ablation python run.py --output_dir=./saved_models/$lang/$mode/ablation --without_ctc --without_ctrd --config_name=microsoft/codebert-base --model_name_or_path=microsoft/codebert-base --tokenizer_name=microsoft/codebert-base --do_train --do_eval --do_test --train_data_file=../../dataset/codesearchnet/$lang/train.jsonl --eval_data_file=../../dataset/codesearchnet/$lang/valid.jsonl --test_data_file=../../dataset/codesearchnet/$lang/test.jsonl --codebase_file=../../dataset/codesearchnet/$lang/codebase.jsonl --detector_path="../../RA/save_results/$lang/codebert/$mode/detector.pth" --num_train_epochs=100 --code_length=256 --nl_length=128 --train_batch_size=64 --eval_batch_size=64 --learning_rate=2e-5 --seed=123456 2>&1 | tee saved_models/$lang/$mode/train.log ~~~ > Training with Unixcoder. Add parameters "--without_ctrd" and "--without_ctc" in the script. ~~~bash cd ./MJL/HedgeCode_Unixcoder lang=ruby mode=hcl mkdir -p ./saved_models/$lang/$mode/ablation python run.py --output_dir=./saved_models/$lang/$mode/ablation --without_ctc --without_ctrd --config_name=microsoft/unixcoder-base --model_name_or_path=microsoft/unixcoder-base --tokenizer_name=microsoft/unixcoder-base --do_train --do_eval --do_test --train_data_file=../../dataset/codesearchnet/$lang/train.jsonl --eval_data_file=../../dataset/codesearchnet/$lang/valid.jsonl --test_data_file=../../dataset/codesearchnet/$lang/test.jsonl --codebase_file=../../dataset/codesearchnet/$lang/codebase.jsonl --detector_path="../../RA/save_results/$lang/unixcoder/$mode/detector.pth" --num_train_epochs=100 --code_length=256 --nl_length=128 --train_batch_size=64 --eval_batch_size=64 --learning_rate=2e-5 --seed=123456 2>&1 | tee saved_models/$lang/$mode/train.log ~~~ > Training with CoCoSoDa. Add parameters "--without_ctrd" and "--without_ctc" in the "run.sh" file, and then execute the following script. ~~~bash cd ./MJL/HedgeCode_CoCoSoDa lang=ruby mode=hcl bash run.sh $lang $mode ~~~ #### Code Search Results (MRR Score) | Model | Ruby | Javascript | Go | Python | Java | PHP | Avg. | | -------------- | :-------: | :--------: | :-------: | :-------: | :-------: | :-------: | :-------: | |RoBERTa | 62.8 | 56.2 | 85.9 | 61.0 | 62.0 | 57.9 | 64.3 | |CodeBERT | 67.9 | 62.0 | 88.2 | 67.2 | 67.6 | 62.8 | 69.3 | |GraphCodeBERT | 70.3 | 64.4 | 89.7 | 69.2 | 69.1 | 64.9 | 71.3 | |CodeT5 | 71.9 | 65.5 | 88.8 | 69.8 | 68.6 | 64.5 | 71.5 | |SYNCOBERT | 72.2 | 67.7 | 91.3 | 72.4 | 72.3 | 67.8 | 74.0 | |UniXcoder | 74.0 | 68.4 | 91.5 | 72.0 | 72.6 | 67.6 | 74.4 | |CodeT5+ | 77.7 | 70.8 | 92.4 | 75.6 | 76.1 | 69.8 | 77.1 | |CodeRetriever | 77.1 | 71.9 | 92.4 | 75.8 | 76.5 | 70.8 | 77.4 | |CoCoSoDa | 81.8 | 76.4 | 92.1 | 75.7 | 76.3 | 70.3 | 78.8| |**HedgeCode** | **82.5** | **77.1**| **92.7** | **77.6** | **78.5** | **73.8** | **80.3**|