query spark delta tables with aws athena

query spark delta tables with aws athena

streaming data from 8 subreddits using the reddit api, kafka, pyspark and delta tables on s3 data lake transformed by glue jobs & queried w/ athena.

stevenhurwitt/reddit-streaming: streaming eight subreddits from reddit api using kafka producer & spark structured streaming. (github.com)

spark is a popular framework for all sorts of data engineering work. APIs are available for python, scala, java, r, etc but it was written in scala so is the fastest there (presumably. this is said not to be true in 2023 w/ pyspark updates).

data lakes have risen in popularity over the past few years as cloud storage has exploded as a cheap & reliable globally available web storage service.

instead of storing everything in a complicated data model in a database or data warehouse we can store tables in a familiar folder structure partitioned by certain columns.

object storage exists on all clouds and is used as a data lake for spark parquet/delta tables (specifically i am using delta tables stored on s3 on aws).

parquet/delta tables are file formats specifically optimized for larger columnar structured datasets. These can be partitioned and stored into different nodes of a cluster, which is how spark works in the backend (specifically im using delta which is like parquet but includes compression, ACID transactions and time travel). these can be queried with sql engines.

cluster orchestration can be done with managed services like databricks or on kubernetes, glue, lambda, docker-compose, etc. we use docker-compose here.

producing raw data

reddit-streaming/reddit_producer.py at last-stable-ubuntu · stevenhurwitt/reddit-streaming · GitHub

reads data from reddit api and sends as kakfa messages through a producer.

to-do: could microbatch to optimize delta tables

https://imgur.com/a/dhTU0zx

streaming kafka w/ pyspark

reddit-streaming/reddit_streaming.py at last-stable-ubuntu · stevenhurwitt/reddit-streaming · GitHub

reads kafka stream and writes to delta table in s3 bucket ‘s3://reddit-streaming-stevenhurwitt-2{subreddit}’

https://imgur.com/a/m4rnAzd

transforming data w/ glue

reddit-streaming/aws-curation.py at last-stable-ubuntu · stevenhurwitt/reddit-streaming · GitHub

curates raw data to s3://reddit-streaming-stevenhurwitt-2{subreddit}_clean.

Glue crawlers generate symlink manifest to specify latest version of table (avoids duplicates in sql).

https://imgur.com/a/fyDLUI3

query tables w/ athena

create tables in Athena w/ queries from delta files on s3. need required iam roles. saved to s3 bucket: s3://xyz

example query & result:

select * from reddit.aws_clean order by date desc limit 100;

https://imgur.com/a/mI7Am7V