Welcome to the NoSQL migration guide. This guide is designed to assist you in migration your data from Cassandra to purpose-built NoSQL databases on AWS using CQLReplicator.
The guide covers multiple migration use cases, providing detailed instructions and examples to help you understand the process. Whether you're moving data from Cassandra to Amazon Keyspaces, S3, or Amazon MemoryDB for Redis.
In addition to migration use cases, the guide explores how you can leverage your data for Data Analytics/ML and incremental backup. Lear how to extract insights from Amazon Keyspaces, optimize your storage, and make data-driven decisions.
The objective of this use-case is to support customers in seamlessly migrating from self-managed Cassandra clusters to Amazon Keyspaces. This migration approach ensures zero downtime, no code compilation, predictable incremental traffic, and migration costs.
Identify the keyspaces and tables in your Cassandra cluster that you want to migrate. You can use the DESCRIBE
command
in CQLSH
to list all keyspaces and tables.
Amazon Keyspaces is compatible with CQL 3.11 API(backward-compatibility with version 2.x). Check with supported Cassandra APIs, operations, functions, and data types in Amazon Keyspaces.
Ensure that you have enough capacity/resources (CPUs, memory, network throughput, and storage) on the Cassandra cluster to handle migration workload.
Estimate the number of rows per table in your Cassandra cluster. This will help you estimate the migration costs for AWS Glue (which be used for CQLReplicator) and Amazon Keyspaces (which will store the migrated data).
Deploy Amazon Keyspaces tables that correspond to your Cassandra tables. Remember to pr-warm or provision these tables to ensure optimal performance.
Follow the instructions provided in the CQLReplicator documentation to configure, initialize, and run the migration tool.
The objective of this use-case is to support customers in seamlessly migrating from self-managed Cassandra clusters to Amazon MemoryDB. This migration approach ensures zero downtime, no code compilation, predictable incremental traffic, and migration costs.
Identify the keyspaces and tables in your Cassandra cluster that you want to migrate. You can use the DESCRIBE
command
in CQLSH
to list all keyspaces and tables.
CQLReplicator performs two keys transformations:
Primary Key Conversion: CQLReplicator converts the primary key from Cassandra into a string format for MemoryDB,
e.g.
a primary key composed of multiple partition and clustering keys would be transformed into a single key1#key2#key3
in
MemoryDB.
Regular columns as JSON: Regular columns from Cassandra are stored as JSON text in MemoryDB. This allows for
flexible and efficient
data access in MemoryDB, as JSON is widely used data interchange format.
Ensure that you have enough capacity/resources (CPUs, memory, network throughput, and storage) on the Cassandra cluster to handle migration workload.
Estimate the number of rows per table in your Cassandra cluster. This will help you estimate the migration costs for AWS Glue (which be used for CQLReplicator) and traffic against Amazon MemoryDB (which will store the migrated data).
Deploy Amazon MemoryDB cluster that correspond to your Cassandra tables.
Follow the instructions provided in the CQLReplicator documentation to configure, initialize, and run the migration tool.
Materialized view (MV) is a powerful concept to manage and access your data. MV allows you to create a new table (the materialized view) that is differently partitioned. This allows you to create MVs that are optimized for your specific access pattern.
It's important to remember to provision enough Read Capacity Units (RCUs) on the source table. This ensures that your source table can handle the additional read workload from CQLReplicator without impacting of your other applications.
The main change you need to do is to replace a content of conf/CassandraConnector.conf
by conf/KeyspacesConnector.conf
.
So, that conf/CassandraConnector.conf
points CQLReplicator to Amazon Keyspaces instead of Cassandra cluster.
Follow the instructions provided in the CQLReplicator documentation to configure, initialize, and run your replication process between the source table (base table) and the target table (MV).
By using MV concept and CQLReplicator, you can create efficient, scalable, and flexible data access patterns with Amazon Keyspaces.
Incremental backups are another way to back up your Amazon Keyspaces data on Amazon S3. With incremental backups every 2 minutes, only the data has changed since your last backup is persisted.
CQLReplicator can be used to facilitate this process. You can point CQLReplicator to a source table in Amazon Keyspaces and a target Amazon S3 bucket. CQLReplicator flushes parquet files by 32MB into the Amazon S3 bucket.
The structure of the S3 prefix is as follows: bucket-name/[tile]/[operation]/[timestamp]/*.parquet
.
The tile is the current CQLReplicator number that handles an operation, from 0 to N tiles, the operations include
INSERT
, UPDATE
, and DELETE
, the timestamp is a local timestamp in following format yyyy-mm-dd hh:mm:ss
.
The incremental backup feature in CQLReplicator, facilitated by AWS Glue, not only ensures incremental data backup but also opens up opportunities for advanced analytics and machine learning (ML).
The parquet files, stored in Amazon S3 as part of the backup process, can be directly used by various AWS services for analytics and ML purposes:
- Amazon SageMaker: You can use the Parquet files in Amazon SageMaker to train ML models. SageMaker provides a complete set of tools for every set of the ML workflow.
- Amazon EMR: Amazon EMR can process large amount of data stored in Parquet files. You can run big data frameworks such as Apache Spark and Hadoop on EMR to analyze your data.
- Amazon Athena: Amazon Athena allows you to query data directly from S3 using standard SQL. You can use Athena to run ad-hoc queries on Parquet files.
- AWS Glue: AWS Glue can catalog your Parquet files in S3 and make them available for ETL jobs. You can transform your data and make it available for a wide range of analytics and ML services.
By leveraging these services, you can extract valuable insights from your data, build predictive models, and make data-driven decisions.
CQLReplicator is a tool to replicate payloads (INSERTS, UPDATES) from Amazon Keyspaces to Amazon OpenSearch Service.
Identify the keyspaces and tables in your source that you want to replicate to Amazon OpenSearch Service.
You can use the DESCRIBE
command in CQLSH
to list all keyspaces and tables.
Create a search index target_keyspace_name-target_table_name
Amazon OpenSearch Service (optional). if the search index
doesn't exist
CQLReplicator will create a simple schema that maps all your columns to search fields. All fields will be text
.
Ensure that you have enough RCUs (read capacity units) on the Keyspaces table to handle replication traffic.
Estimate the number of rows in your source table. This will help you estimate the replication costs for AWS Glue (which be used for CQLReplicator) and traffic against Amazon OpenSearch service.
Deploy Amazon OpenSearch Service that correspond to your Amazon Keyspaces tables.
Follow the instructions provided in the CQLReplicator documentation to configure, initialize, and run the migration tool.
if you want quickly validate the replicated workload with SQL use opensearchsql
You can replicate the Amazon Keyspaces table from a different account without a shared VPC endpoint by using the public endpoint or deploying a private VPC endpoint in the target account. When not using a shared VPC, each account requires its own VPC endpoint. In our example Account B requires its own VPC endpoints to share the table. When using VPC endpoints in this configuration, Amazon Keyspaces appears as a single node cluster to the Cassandra client driver instead of a multi-node cluster.
Identify the keyspaces and tables in your source that you want to replicate to Account B.
You can use the DESCRIBE
command in CQLSH
to list all keyspaces and tables.
Ensure that you have enough RCUs (read capacity units) on the Keyspaces table to handle replication traffic.
conf/KeyspacesConnector.conf
points CQLReplicator to Amazon Keyspaces in Account B.conf/CassandraConnector.conf
points CQLReplicator to Amazon Keyspaces in Account A.
Upon connection, the client driver reaches the DNS server which returns one of the available endpoints in the account’s
VPC.
But the client driver is not able to access the system.peers
table to discover additional endpoints.
Because there are fewer hosts available, the driver makes fewer connections. To adjust this, increase the connection
pool
setting of the driver by a factor of three in KeyspacesConnector.conf
.
connection {
max-requests-per-connection = 2000
pool {
local.size = 3
}
}
During the init phase use the --skip-keyspaces-ledger
to skip the migration.ledger
creation in Account A.
After, you need to manually create the migration.ledger
table in Account B before running the replication process.
aws keyspaces create-table --keyspace-name migration --table-name ledger --schema-definition '{
"allColumns": [ { "name": "ks", "type": "text" },
{ "name": "tbl", "type": "text" },
{ "name": "tile", "type": "int" },
{ "name": "ver", "type": "text" },
{ "name": "dt_load", "type": "timestamp" },
{ "name": "dt_offload", "type": "timestamp" },
{ "name": "load_status", "type": "text" },
{ "name": "location", "type": "text" },
{ "name": "offload_status", "type": "text" } ],
"partitionKeys": [ { "name": "ks" }, { "name": "tbl" } ],
"clusteringKeys": [ { "name": "tile", "orderBy": "ASC" }, { "name": "ver", "orderBy": "ASC" } ] }'
Follow the instructions provided in the CQLReplicator documentation to configure, initialize, and run the migration tool.
This tool licensed under the Apache-2 License. See the LICENSE file.