I'm trying to build a production-esk setup for Signoz for my client in China - they are insisting to use Fargate for most of it and using terraform for build.
I've created a bunch of modules to try and stand all this up over in EU first then we'll take bits to put into their existing ecosystem.
Figured i'd discuss it here, to give others ideas and get some feedback on the bits that aren't working. So far I have the following:
• Networking Stack
◦ Route53 zones for external access
◦ VPC's, public and private subnets, internet gateways, nat gateways and a bastion host.
◦ Internal Service Discovery Zones (CloudMap)
◦ Generic Log Groups
◦ ECS Cluster (on fargate) and associated execution and task IAM stuffs.
• Database Stack
◦ Creates 2 x EC2 instances for Clickhouse
▪︎ UserData scripts install SSM, add clickhouse server and client, templates and creates a clickhouse cluster.xml and starts clickhouse
▪︎ The cluster.xml defines 3 x keepers (clickhouse keeper not zookeeper, more soon) + 2 clusters in 1 shard with replication between them. Adding macros for the shard and cluster id.
▪︎ TODO: S3 backups - need to think about the AWS Secrets and using the IAM Profile to fetch them dynamically somehow....
▪︎ TODO: Mounting 3rd external volume to store data on and amend mountpoints etc.
◦ Creates 3 x Fargate tasks and services running clickhouse keeper
▪︎ Dynamically creates clickhouse keeper xml files showing the 3 servers in the raft config and injects the right server id.
▪︎ Uses the service discovery zones to make sure they can be accessed by clickhouse and each other.
▪︎ TODO: need to investigate if the volumes need to be persistent (i.e. log and snapshot) or if they can exist for the lifetime of the task.
◦ TODO: Whilst replication IS working properly - need to think about how we "load balance" this. Currently we have clickhouse-1 and clickhouse-2 dns entries and a clickhouse. that has both IPs in. Currently Query Service is just targeting clickhouse-1 - more later.
• Signoz Stack
◦ Creates a Alert Manager Service
▪︎ Single Service currently: TODO - Investigate Fault Tolerance if needed.
▪︎ Points the queryService.url via command line to query-service.fqdn:8085
▪︎ Has a persistent /data mount point from EFS (and pointed to it via storage.path) for any data that needs storing.
▪︎ Adds to service discovery as needed later.
▪︎ TODO: Add health checks
◦ Creates a Query Manager Service
▪︎ Single Service currently - Q: From what i can see this has to be a single service due to the local DB.
▪︎ Dynamic Prometheus yml pointing to the alertmanager and clickhouse endpoint (Currently pointing to clickhouse-1)
▪︎ Has a persistent disk via EFS, for the /shared/dashboards and prometheus config
▪︎ ClickhouseUrl is currently just clickhouse-1.
▪︎ TODO: Investigate how we should be talking to a replicated clickhouse - i.e. do i need a LB and only 1 node at a time or not.
◦ Creates a Frontend Service
▪︎ Single Service Currently - presumably i can change to multiple without issue.
▪︎ Dynamic NGINX config config pointing to the right query-service endpoints.
▪︎ TODO: Investigate further the proxy_pass timeouts, as if query service changes ip it breaks as it changes DNS and the resolver configs don't seem to work and don't really want yet more LB's.
▪︎ Attaches itself to a LB (see below)
◦ LB / SSL Stack
▪︎ Creates acm certs for the domain and subdomains as needed as SANs + validation zones
▪︎ Creates real dns entries as needed and routes to a LB
▪︎ Creates an external facing LB
• listens on 80 (redirect to 443)
• listens on 443
◦ uses rules to map to right target-group
◦ if www.fqdn -> ui-target (from frontend service)
◦ default = 404 error
◦ Migration Service Stack
▪︎ Creates a task which we can run via local shell scripts for upgrades that:
• Creates a otel migration container that points currently to clickhouse-1 but has replication set to true.
• We run this script as a bootstrap task AFTER creating the DB's and it appears to setup your databases in replicated format ok.
• Shell script runs the task, tails the logs and checks for errors.
• Collector Stack
◦ Creates a otel collector service (currently but shouldn't be an issue to scale out)
▪︎ Dynamically creates a otel-collector config and omap config
▪︎ Is configured to listen on otel firehose, otel grpc, otel http + health,zpages and pprof
▪︎ Q: DOCKER_MULTI_NODE_CLUSTER - this seems to be confusing, in some docs it says it needs to be set to true but only ONCE, how does this differ from the otel-collector-migration stack if that is the case?
◦ Adds a firehose target to the LB and sets it up via a new firehose_rule.
Most of it seems to be sort of working so far. I can login to the UI, no 500 errors or anything any more, obviously i don't have any "data" yet. Currently my OTEL collectors are the bit that i'm working on. But figured i'd share progress to data to see if there's anything obvious i've missed so far