# deltalake-cdc **Repository Path**: mirrors_ClickHouse/deltalake-cdc ## Basic Information - **Project Name**: deltalake-cdc - **Description**: No description available - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: master - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-08-19 - **Last Updated**: 2026-01-03 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README # Delta Lake to ClickHouse CDC Pipeline This project provides tools to generate sample data to a Delta Lake table and stream changes to ClickHouse using Change Data Feed (CDF). ## Limitations - INSERTS / UPDATES support only (DELETEs are ignored). - The data generator generates a fixed schema for the Delta table. ## Prerequisites - Python 3.8+ - AWS credentials configured with access to S3 - ClickHouse server (local or cloud) - Required Python packages (install with `pip install -r requirements.txt`): ## 1. Generate Sample Data First, let's generate some sample data to a Delta Lake table in S3: ```bash python data_generator.py -p s3://your-bucket/path/to/deltalake/table -r us-east-1 ``` Options: - `-p, --bucket_path`: S3 path where the Delta table will be stored (required) - `-r, --delta_region`: AWS region for the S3 bucket (default: us-east-1) - `-b, --batch-size`: Number of rows per batch (default: 10000) ## 2. Query Delta Lake from ClickHouse You can query the Delta Lake table directly from ClickHouse using the DeltaLake table engine: ```sql CREATE TABLE my_delta_table ENGINE = DeltaLake('s3://your-bucket/path/to/table') ``` ## 3. Create Destination Table in ClickHouse Create a table in ClickHouse to store the CDC changes. The schema should match your Delta table with the metadata columns: ```sql CREATE TABLE default.my_cdc_table ( `id` String, `name` String, `age` Int64, `created_at` DateTime, `_change_type` String, `_commit_version` Int64, `_commit_timestamp` DateTime ) ENGINE = ReplacingMergeTree(`_commit_version`) PARTITION BY toYYYYMM(`created_at`) ORDER BY (name, age) SETTINGS index_granularity = 8192; ``` ## 4. Run the CDC Script Run the CDC script to stream changes from the Delta Lake table to ClickHouse: ```bash python main.py \ -p "s3://your-bucket/path/to/table" \ -r "us-east-2" \ -t "default.my_cdc_table" \ -H "host.us-west-2.aws.clickhouse.cloud" \ -u "default" \ -P "password" \ --access-key "[EXAMPLE]" \ --secret-key "[EXAMPLE]" \ -v 1 ```