# ClickLoad

**Repository Path**: mirrors_ClickHouse/ClickLoad

## Basic Information

- **Project Name**: ClickLoad
- **Description**: No description available
- **Primary Language**: Unknown
- **License**: Not specified
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No

## Statistics

- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-06-03
- **Last Updated**: 2026-01-10

## Categories & Tags

**Categories**: Uncategorized

**Tags**: None

## README

# ClickLoad

Orchestration for loading a large dataset with trillions of rows incrementally and reliably over a long period of time. 

We described the data loading orchestration mechanism used by the script in detail in a [blog](https://clickhouse.com/blog/supercharge-your-clickhouse-data-loads-part3).


## Capabilities
 - Reliably import data from files hosted in a object storage bucket into ClickHouse
 - Supports [any partitioning key](./internals/README.md#support-for-arbitrary-partitioning-keys), projections, and materialized views
 - Job queue for files to be imported. Scales linearly
 - Continuous data loading [can be set up](./examples/pypi/README.md#setting-up-a-continuous-data-load).

## Pre-requisites

- python3.10+
- clickhouse-client
- ClickHouse instance with support for [KeeperMap](https://clickhouse.com/docs/en/engines/table-engines/special/keeper-map) and [keeper_map_strict_mode](https://clickhouse.com/docs/en/engines/table-engines/special/keeper-map#updates)

- ~1GB of RAM per Keeper node per 1 million scheduled files in the KeeperMap backed job task table

## Installing

`pip install -r requirements.txt`

Pre-create tables in ClickHouse.

### Table Schemas for job task table

```sql
CREATE TABLE tasks
(
	file_path String,
	file_paths Array(String),
	worker_id String DEFAULT '',
	started_time DateTime DEFAULT 0,
	scheduled DateTime MATERIALIZED now()
)
ENGINE = KeeperMap('tasks')
PRIMARY KEY file_path;
```

## Running

### Scheduling files for ClickHouse import

```shell
usage: queue_files.py [-h] 
# ① ClickHouse connection settings for instance hosting the job task table
--host HOST 
--port PORT 
--username USERNAME 
--password PASSWORD 

# ② files scheduling settings
--file FILE # The file containing the set of object storage urls for the files to be loaded
--task_database DATABASE # Name of the ClickHouse database for the task table
--task_table TABLE # Name of the task table.
[--files_chunk_size_min SIZE] # How many files are atomically processed together at a minimum
[--files_chunk_size_max SIZE] # How many files are atomically processed together at a maximum
```

### Starting a worker that continuosly imports scheduled files into ClickHouse
```shell
usage: worker.py [-h] 
# ① ClickHouse connection settings for the target instance
--host HOST 
--port PORT 
--username USERNAME 
--password PASSWORD 

# ② data loading - main settings
--database DATABASE # Name of the target ClickHouse database
--table TABLE # Name of the target table.
--task_database DATABASE # Name of the ClickHouse database for the task table
--task_table TABLE # Name of the task table.
[--worker_id ID] # Unique id for this worker
[--files_chunk_size_max SIZE] # How many files are atomically processed together at a maximum

# ③ Data loading - optional settings
[--cfg.function CFG.FUNCTION] # Name of the table function for accessing the to-be-loaded files
[--cfg.bucket_access_key CFG.ACCESS_KEY] # Access key for the object storage bucket hosting the files to be loaded
[--cfg.bucket_access_secret CFG.ACCESS_SECRET] # Access secret for the object storage bucket hosting the files to be loaded
[--cfg.format CFG.FORMAT] # Name of the file format used
[--cfg.structure CFG.STRUCTURE] # Structure of the file data
[--cfg.select CFG.SELECT] # Custom SELECT clause for retrieving the file data
[--cfg.where CFG.WHERE] # Custom WHERE clause for retrieving the file data
[--cfg.query_settings CFG.QUERY_SETTINGS [CFG.QUERY_SETTINGS ...]] # Custom query-level settings
```

## Example

We provide an example of how to use the script with some large example data set [here](./examples/pypi/README.md).