# DNABoundedHomopolymerEncoding **Repository Path**: mirrors_microsoft/DNABoundedHomopolymerEncoding ## Basic Information - **Project Name**: DNABoundedHomopolymerEncoding - **Description**: A codec for encoding and decoding binary strings into DNA strands with bounded homopolymer runs - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-08 - **Last Updated**: 2025-10-11 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Bounded Homopolymer Encoding for DNA Storage Codec for converting binary strings into DNA-like sequences over the alphabet `{A, C, G, T}` while guaranteeing that no symbol repeats more than *k* times consecutively (length *k* homopolymer run). The repository ships a fast C++ encoder/decoder that supports maximum homopolymer run lengths `k ∈ {1, 2, 3, 4, 5}`. See the [writeup](https://github.com/microsoft/DNABoundedHomopolymerEncoding/blob/main/BoundedHomopolymerEncoding.pdf) for how the encoding and decoding algorithms work. This code is written as part of the [DNA Storage Project](https://www.microsoft.com/en-us/research/project/dna-storage/) at Microsoft. ## Performance - About 50Mbps for encoding and 80 Mbps for decoding on a single core. Can be easily parallelized across multiple cores. - Max run length 1 is about 3 times faster, because it is basically base 2 to base 3 conversion. - Uses GMP multi precision library for maintaining large integers. ## Why bounded homopolymers? DNA synthesis and sequencing are sensitive to long runs of identical bases (e.g., `AAA` or `CCCC`). By bounding the length of these homopolymer runs, stored strands become more robust to physical errors. The encoder in this repository ensures that every generated strand respects the chosen bound *k* while still allowing you to recover the original binary payload exactly. ## Features - Deterministic encoding of binary strings into base-4 strings subject to a homopolymer constraint. - Efficient decoder that returns the original binary message. - Supports five constraint strengths (`k = 1…5`). - Includes benchmarking harnesses to measure encoding/decoding throughput with GMP-backed big-integer arithmetic. ## Prerequisites The C++ sources depend on: - A C++17-compatible compiler (tested with `g++`). - [GNU Multiple Precision Arithmetic Library (GMP)](https://gmplib.org/) with C++ bindings (`libgmpxx`). ### Installing GMP ```bash # Ubuntu / Debian sudo apt-get update sudo apt-get install -y libgmp3-dev # macOS using Homebrew brew install gmp ``` For Windows, download the prebuilt binaries or build from source following the instructions at [gmplib.org](https://gmplib.org/) or the [NYU GMP guide](https://cs.nyu.edu/~exact/core/gmp/index.html). ## Building the encoder/decoder Clone this repository and compile the main driver: ```bash git clone https://github.com/microsoft/DNABoundedHomopolymerEncoding.git cd DNABoundedHomopolymerEncoding g++ -O3 BoundedHomopolymerEncoding.cpp -lgmpxx -lgmp -o bounded_homopolymer ``` This produces an executable named `bounded_homopolymer` that links against GMP. If you also need the rate benchmarking tool, build `BHE_rates.cpp` in the same way: ```bash g++ -O3 BHE_rates.cpp -lgmpxx -lgmp -o bhe_rates ``` ## Usage ### Command line driver The compiled `bounded_homopolymer` binary encodes random binary messages and checks that the decoder inverts the process. ```bash ./bounded_homopolymer ``` - `max_homopolymer_run_length` (`k`): Choose from `1` to `5`. This bounds the length of any homopolymer run in the encoded strand. For example, `k = 3` guarantees that substrings such as `AAAA` or `GGGG` never appear; the maximum run is `AAA`, `CCC`, `GGG`, or `TTT`. - `encoding_length`: Desired length (in bases) of the encoded strand. - `input_data_length`: Length (in bits) of the binary message to encode. - `number_trials`: Number of random messages to encode/decode for benchmarking. Example run: ```bash ./bounded_homopolymer 3 25 40 100 ``` The program prints the maximum encodable message size for the provided parameters and reports encoding/decoding timings. If any trial fails to round-trip, an error message is emitted. ### Estimating minimal encoding lengths The companion `bhe_rates` tool leverages the same encoder to estimate the smallest strand length that can host a target number of data bits while honoring the homopolymer constraint. It evaluates each `k ∈ {1,…,5}` and reports the achievable rate (bits per base). ```bash ./bhe_rates ``` - `encoding_length`: Strand length (in bases) you are willing to allocate. - Output: For each `k ∈ {1,…,5}`, the tool reports the maximum payload size (in bits) that fits within `encoding_length` bases while respecting the homopolymer constraint, plus the corresponding rate. Example: ```bash ./bhe_rates 96 ``` Sample output: ``` k max_input_bits rate(bits/base) ------------------------------------------ 1 152 1.583333 2 184 1.916667 3 190 1.979167 4 191 1.989583 5 191 1.989583 ``` If a given constraint cannot store any payload at that length, the tool reports `0` bits for that row. ### Using the classes in your own code Both `BoundedHomopolymerEncoder` and `BoundedHomopolymerDecoder` are self-contained in `BoundedHomopolymerEncoding.cpp`. You can include the file or extract the classes into your project. ```cpp #include "BoundedHomopolymerEncoding.cpp" int main() { const int k = 3; // Maximum homopolymer run length const int encoded_len = 25; // Length of the output DNA strand const int data_len = 12; // Length of the binary input BoundedHomopolymerEncoder encoder(k, encoded_len, data_len); std::string input = "101101001011"; // Provide your own binary string std::string encoded = encoder.encode(input); // Map digits {0,1,2,3} to {A,C,G,T} (or your preferred ordering) to obtain a DNA strand. BoundedHomopolymerDecoder decoder(k, encoded_len, data_len); std::string recovered = decoder.decode(encoded); } ``` ## Troubleshooting - **`Max data bits that can be encoded` warning:** Reduce `input_data_length` or increase `encoding_length` so the message fits the selected constraint. - **Linker errors about GMP:** Confirm `libgmp` and `libgmpxx` are installed and search paths are visible to your compiler. ## Acknowledgements The author would like to thank [Jon Schneider](https://jschnei.github.io/) for his insightful comments which motivated the encoding and decoding algorithms. ## Contributing This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit [Contributor License Agreements](https://cla.opensource.microsoft.com). When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA. This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments. ## Trademarks This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/legal/intellectualproperty/trademarks/usage/general). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.