At the moment you are forced to use AWS S3 or the file system for Datomic backups. The latest Datomic Pro version uses the aws-s3-1.x SDK where it is not possible to change the S3_ENDPOINT via an env var. This option has been introduced in AWS SDK for Java v2 2.28.1.
Any chance to get a Datomic-specific env var or Java property to modify the S3_ENDPOINT? This would allow to do backups to Google Cloud Storage and many other S3-compatible storages.
@maxweber I think I follow, and recall that you have requested this with me in past conversation. We have a current project to move to AWS SDK v2 and I am glad you brought this up because I want to ensure we exercise this option when we update to v2.
I am curious if you have tried to utilize the file path with a google Cloud storage location and if that works.
I just realized there is no way that could work. I will discuss with dev.
@jaret thanks a lot for discussing this with the dev team. Having this would be awesome and we could get rid of our AWS account (which we just have for the backups).
@maxweber Incremental backup should work if the file is the same? Have you tried it?
Additionally, I believe its worth re-testing as backup code has changed over the years (reducing the number of operations needed see releases like Datomic Pro Change Log | Datomic) and its possible that whatever friction you are recalling we have actually resolved. Additionally we have a first class feature for verifying backups now called verify-backup which allows you to read all files and verify integrity of a backup.
How frequently are you backing up?
Do you have any other problems backing up like speed?
If you want to move to GCP you could just take a full backup and point it at GCP using file backup. If it goes too slow we can help with tuning to possibly make it go faster.
Tried it multiple times. I guess the problem is that the Datomic backup does a .listFiles call on java.io.File (or something similar). That’s totally fine for a normal file system, but on gcsfuse the operation will cause thousands of list calls to the Google Cloud Storage API, since gcsfuse does not cache the entire directory tree.
On our old system (Storrito 1.0) every 4 hours. On our new system (Storrito 2.0) we plan to backup way more often (there we also have one logical db per customer).
I gave up to find a solution to use gcsfuse or something similar. Adding an option to set the S3 endpoint would probably solve all the associated troubles. ChatGPT gave me that example code for the AWS S3 SDK 1.x lib:
import com.amazonaws.ClientConfiguration;
import com.amazonaws.Protocol;
import com.amazonaws.auth.AWSStaticCredentialsProvider;
import com.amazonaws.auth.BasicAWSCredentials;
import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.AmazonS3ClientBuilder;
import com.amazonaws.client.builder.AwsClientBuilder.EndpointConfiguration;
public class GcsS3Client {
public static void main(String[] args) {
// Example credentials: GCS does not use them in the same way, but SDK requires them.
BasicAWSCredentials credentials = new BasicAWSCredentials("GCS_ACCESS_KEY", "GCS_SECRET_KEY");
// GCS S3-compatible endpoint
String gcsEndpoint = "https://storage.googleapis.com";
// Region doesn't matter for GCS, but is required by SDK
String region = "auto"; // Can be anything, commonly "auto" or "us-east1"
AmazonS3 s3Client = AmazonS3ClientBuilder.standard()
.withEndpointConfiguration(new EndpointConfiguration(gcsEndpoint, region))
.withCredentials(new AWSStaticCredentialsProvider(credentials))
.withPathStyleAccessEnabled(true) // GCS requires path-style access
.build();
// Example usage
s3Client.listBuckets().forEach(bucket -> System.out.println(bucket.getName()));
}
}
Configuring the endpoint, region and maybe PathStyleAccessEnabled via env vars or Java system properties could probably make Datomic backup compatible with most storage services with an S3 compatible API. I guess upgrading to the AWS 2.x SDKs is a lot of work and would delay a solution for that challenge quite a while?