Dec 7, 2024

How ChPulse Solves ClickHouse Data Update Performance Issues

Learn how ChPulse addresses the performance challenges of updating and deleting data in ClickHouse while maintaining real-time data accuracy.

In the world of data analytics, ClickHouse shines with its extremely fast query speed and excellent analytical capabilities. However, as a data warehouse database designed for analytics, it still suffers from a common issue in such databases - poor performance when updating and deleting data.

ChPulse provides a complete real-time cloud data warehouse solution, including real-time data integration and ClickHouse-optimized data ingestion. This article will describe our understanding of ClickHouse data ingestion issues from a technical perspective, how we designed and optimized our solution to address these problems, and how we ensure that data in ClickHouse is always up-to-date.

Technical Challenges of ClickHouse Data Ingestion

ClickHouse is an excellent analytical data warehouse with extremely high data insertion efficiency and aggregation analysis performance. In contrast, its performance in data updates and deletions is severely lacking. We can understand the reason for this phenomenon by examining its standard storage engine, MergeTree.

The MergeTree engine, as the name suggests, consists of "Merge" and "Tree." The "Tree" means that data is organized according to sorting rules established by the primary key, making queries more efficient. This is an indexing method adopted by many databases.

The reason for ClickHouse's poor update and delete performance lies in the "Merge" part. "Merge" refers to the process of merging data. In ClickHouse, data is divided into multiple small parts when written, called "parts." The engine periodically merges these small parts into larger ones according to configured rules, a process known as "Merge."

Image: Add an illustration explaining how Merge works

Merge can remove duplicate data, optimize sorting, and reorganize data according to partitions. While there are many benefits, to maximize performance, data in this structure cannot be changed once written. When performing updates and deletions, ClickHouse actually needs to query and rewrite the parts containing the data. It affects not just one record but all data in a part, which is fundamentally much less efficient than traditional databases.

When implementing data updates and deletions, the officially recommended instructions are to use the ALTER command, as shown below:

Update data: ALTER TABLE table UPDATE value = newValue WHERE XXX
Delete data: ALTER TABLE table DELETE WHERE XXX

In actual testing, we found that the execution rate of ALTER is only about 1/1000 of INSERT, which cannot meet the data update performance requirements in real-time integration scenarios.

Replacing Updates and Deletions with Insertions

To solve the efficiency problem of updates and deletions, we designed a ClickHouse-optimized solution based on the characteristics of MergeTree. It includes three parts: table statement adjustments, update operation adjustments, and delete operation adjustments.

1. Table Statement Adjustments

When creating tables, we chose the ReplacingMergeTree engine as the standard engine and utilized its merge update capability.

In addition to retaining ordinary primary keys, we added several fields, including:

version field: Records a version number, set to the current time when writing, which is also the basis for merge deduplication
is_deleted field: Indicates whether the data has been deleted

Example table creation statement:

CREATE TABLE table (
    id UInt64,
    value String,
    version UInt64 DEFAULT now(),
    is_deleted UInt8 DEFAULT 0
) ENGINE = ReplacingMergeTree(version)
PRIMARY KEY id
ORDER BY id;

After creating the table, ChPulse automatically creates a corresponding view with a _view suffix that simplifies querying:

CREATE VIEW table_view AS
SELECT * FROM table FINAL WHERE is_deleted = 0;

2. Converting Update Operations to Insert Operations

To optimize update efficiency, traditional update operations are converted to insert operations. The rule is: each time data needs to be updated, insert a new data record and ensure new data overwrites old data through the version field.

For example, to update a record: id=1, value="A" to id=1, value="B", the previous operation statement was:

ALTER TABLE table UPDATE value = "B" WHERE id=1

The optimized operation statement is:

INSERT INTO table (id, value) VALUES(1, "B")

Since ClickHouse data insertion is very fast, this operation will be completed 1000 times faster.

3. Converting Delete Operations to Insert Operations

Similar to update operations, delete operations are also converted to insert operations. The rule is: each time data needs to be deleted, insert a new data record and set the is_deleted field to 1.

For example, to delete a record: id=1, value="A", the previous operation statement was:

ALTER TABLE table DELETE WHERE id=1

The optimized operation statement is:

INSERT INTO table (id, is_deleted) VALUES(1, 1)

Through this method, deletion efficiency can be consistent with insertion operations, maintaining overall data processing efficiency.

Query Optimization and Data Cleanup

After using the above optimization solution, query operations can be simplified through the automatically created view. The system provides two query approaches:

1. Direct Table Queries

When querying the table directly, you need to add appropriate conditions:

Add the FINAL keyword after all queries to ensure that the latest version of data is displayed in the query results
Add WHERE is_deleted = 0 to all query conditions to filter out deleted data

For example: Before optimization, the SQL statement to query table was:

SELECT * FROM table

After optimization, it needs to be adjusted to:

SELECT * FROM table FINAL WHERE is_deleted = 0

2. View-Based Queries (Recommended)

For simplified querying, use the automatically created view that includes FINAL and filtering logic:

SELECT * FROM table_view

This view automatically applies the FINAL keyword and filters out deleted records (is_deleted = 0), providing a cleaner query interface.

3. Data Cleanup Process

To maintain optimal performance and storage efficiency, ChPulse implements a periodic cleanup process:

OPTIMIZE TABLE Operations: Periodically executed to merge data parts and physically remove records marked as deleted (is_deleted = 1)
Write Blocking: During OPTIMIZE TABLE execution, write operations are temporarily blocked to ensure data consistency
Batch Cleanup: Deleted records are removed in batches during the merge process, improving overall system efficiency

Principle Analysis

Compared to using ALTER TABLE for data updates and deletions, ChPulse' solution replaces parts reorganization operations with fast data insertion, resulting in a thousand-fold performance improvement.

For the inserted data, in update scenarios, the version field ensures that the updated data is retained during the next data merge. In deletion scenarios, the is_deleted field marks records for deletion, and these records are physically removed during OPTIMIZE TABLE operations, ensuring data consistency ultimately.

During queries, the combination of the FINAL keyword and the is_deleted = 0 filter (automatically handled by the view) ensures that even if merging has not been completed, the queried data is definitely correct. The periodic OPTIMIZE TABLE operations ensure that deleted data is physically removed from storage, maintaining optimal performance and storage efficiency while temporarily blocking writes to guarantee data consistency during the cleanup process.

It's worth noting that for this method to be effective, ChPulse needs to obtain complete field data during each update. This depends on the source database's CDC settings and log settings. After enabling this feature, ChPulse will perform preliminary checks to ensure that the source database's configuration can meet the requirements for fast updates. If the requirements cannot be met, ChPulse will suggest that users still use the traditional ALTER mode for data updates and deletions to ensure data accuracy even at a lower frequency of data changes.

Image: Add a product image of configuration detection here

Conclusion

Through reasonable table structure design and optimization of update and delete operations, we have successfully overcome the technical challenges of ClickHouse in efficient data ingestion, making the speed of updates and deletions consistent with insertion speed, truly achieving real-time data synchronization and efficient processing. This solution not only improves performance but also brings users a smoother experience.

In the future, we will further optimize the solution and explore more possibilities to fully unleash the potential of ClickHouse and empower real-time data processing.

How ClickPipes Solves Limited Network Access to Local Databases

Discover how ClickPipes addresses the challenge of securely and efficiently transferring data from restricted network environments to cloud data warehouses.

How ClickPipes Solves the Challenge of Synchronizing Large Tables

Learn how ClickPipes tackles the complex challenge of synchronizing large database tables to ClickHouse with its innovative sharding solution.