query spark delta tables with aws athena

stevenhurwitt/reddit-streaming: streaming eight subreddits from reddit api using kafka producer & spark structured streaming. (github.com )

spark is a popular framework for all sorts of data engineering work. APIs are available for python, scala, java, r, etc but it was written in scala so is the fastest there (presumably. this is said not to be true in 2023 w/ pyspark updates).

data lakes have risen in popularity over the past few years as cloud storage has exploded as a cheap & reliable globally available web storage service.

instead of storing everything in a complicated data model in a database or data warehouse we can store tables in a familiar folder structure partitioned by certain columns.

object storage exists on all clouds and is used as a data lake for spark parquet/delta tables (specifically i am using delta tables stored on s3 on aws).

parquet/delta tables are file formats specifically optimized for larger columnar structured datasets. These can be partitioned and stored into different nodes of a cluster, which is how spark works in the backend (specifically im using delta which is like parquet but includes compression, ACID transactions and time travel). these can be queried with sql engines.

cluster orchestration can be done with managed services like databricks or on kubernetes, glue, lambda, docker-compose, etc. we use docker-compose here.