Let’s take a look at what is deployed on AWS upon running the quick start example script.
Note: you can very easily edit the script or run each of the terraform modules independantly, giving you the flexibility to design the topology of your pipeline according to your needs.
Collector load balancer
This is an application load balancer (ELB) for your inbound HTTP/S traffic. Traffic is routed from the load balancer to the collector.
For further details on the resources, default and required input variables, and outputs see the terraform-aws-alb module github repository.
Find out more about the Collector terraform module, and explore the full set of variables here: https://registry.terraform.io/modules/snowplow-devops/collector-kinesis-ec2/aws/latest
This is a Snowplow app written in scala which:
- Reads raw Snowplow events off a Kinesis stream populated by the Scala Stream Collector
- Validates each raw event
- Enriches each event (e.g. infers the location of the user from his/her IP address)
- Writes the enriched Snowplow event to another stream
FInd out more about the Enrich modules and explore the full set of variables available here: https://registry.terraform.io/modules/snowplow-devops/enrich-kinesis-ec2/aws/latest
Your kinesis streams are a key component of ensuring a non-lossy pipeline, providing crucial back-up, as well as serving as a mechanism to drive real time use cases from the enriched stream.
FInd out more about the Kinesis stream module and explore the full set of variables available here: https://registry.terraform.io/modules/snowplow-devops/enrich-kinesis-ec2/aws/latest
Collector payloads are written to this raw kinesis stream, before being picked up by the Enrich application. The S3 loader (raw) also reads from this raw stream and writes to the raw S3 folder.
Events that have been validated and enriched by the Enrich application are written to this enriched stream. The S3 loader (enriched) reads from this enriched stream and writes to the enriched folder on S3.
Bad 1 stream
This bad stream is for events that the collector, enrich or S3 loader (raw and enriched) applications fail to process. An event can fail at the collector point due to, for instance, it being too large for the stream creating a size violation bad row, or it can fail during enrichment due to a schema violation or enrichment failure. More details can be found here.
Bad 2 stream
This bad stream is for failed events generated by the S3 loader as it tries to write from the bad 1 stream to the bad folder on S3.
Iglu allows you to publish, test and serve schemas via an easy-to-use RESTful interface. It is split into a few services.
Iglu load balancer
This load balances the inbound traffic and routes traffic to the Iglu Server.
FInd out more about the application load balancer module and explore the full set of variables available here: https://registry.terraform.io/modules/snowplow-devops/alb/aws/latest
The Iglu Server serves requests for Iglu schemas stored in your schema registry.
Find out more about the Iglu Server module and explore the full set of variables available here: https://registry.terraform.io/modules/snowplow-devops/iglu-server-ec2/aws/latest
This is the Iglu Server database where the Iglu schemas themselves are stored.
Find out more about the RDS module and explore the full set of variables available here: https://registry.terraform.io/modules/snowplow-devops/rds/aws/latest
Find out more about the S3 loader module and explore the full set of variables available here: https://registry.terraform.io/modules/snowplow-devops/s3-loader-kinesis-ec2/aws/latest
S3 loader raw
Responsible for reading from the raw stream (i.e. events from the collector that have not yet been validated or enriched) and writing to the raw folder on S3. Any events that have failed to be processed by the raw S3 loader get written to your bad-1 stream.
S3 loader bad
Responsible for reading from the bad-1 stream and writing to the bad folder on S3. Any events that fail to be processed by the bad S3 loader get written to the bad-2 stream.
S3 loader enriched
Responsible for reading from the enriched stream and writing to your enriched folder on S3. Any events that fail to be processed by the enriched S3 loader get written to the bad-1 stream.
S3 loader bucket
Your S3 bucket where the raw, enriched and bad data gets written to by the S3 loader.
Find out more about the S3 bucket module and explore the full set of variables available here: https://registry.terraform.io/modules/snowplow-devops/s3-bucket/aws/latest
The Snowplow application responsible for reading the enriched and bad data and loading to Postgres.
Find out more about the S3 bucket module and explore the full set of variables available here: https://registry.terraform.io/modules/snowplow-devops/postgres-loader-kinesis-ec2/aws/latest
On the first run of each of the applications (Enrich, S3 loaders, Postgres loaders) the Kinesis Connectors Library creates a DynamoDB table to keep track of what they have consumed from the stream so far. Each Kinesis consumer maintains its own checkpoint information.
The DynamoDB autoscaling module enables autoscaling for a target DynamoDB table. Note that there is a
kcl_write_max_capacity variable which can be set to your expected RPS, but setting it high will of course incur more cost.
You can find further details here: https://registry.terraform.io/modules/snowplow-devops/dynamodb-autoscaling/aws/latest
Have more questions? Take a look at our Quick Start FAQs or reach out to us on discourse!