query spark delta tables with aws athena
streaming data from 8 subreddits using the reddit api, kafka, pyspark and delta tables on s3 data lake transformed by glue jobs & queried w/ athena.
stevenhurwitt/reddit-streaming: streaming eight subreddits from reddit api using kafka producer & spark structured streaming. (github.com)
spark is a popular framework for all sorts of data engineering work. APIs are available for python, scala, java, r, etc but it was written in scala so is the fastest there (presumably. this is said not to be true in 2023 w/ pyspark updates).
data lakes have risen in popularity over the past few years as cloud storage has exploded as a cheap & reliable globally available web storage service.
instead of storing everything in a complicated data model in a database or data warehouse we can store tables in a familiar folder structure partitioned by certain columns.
object storage exists on all clouds and is used as a data lake for spark parquet/delta tables (specifically i am using delta tables stored on s3 on aws).
parquet/delta tables are file formats specifically optimized for larger columnar structured datasets. These can be partitioned and stored into different nodes of a cluster, which is how spark works in the backend (specifically im using delta which is like parquet but includes compression, ACID transactions and time travel). these can be queried with sql engines.
cluster orchestration can be done with managed services like databricks or on kubernetes, glue, lambda, docker-compose, etc. we use docker-compose here.
producing raw data
reddit-streaming/reddit_producer.py at last-stable-ubuntu · stevenhurwitt/reddit-streaming · GitHub
reads data from reddit api and sends as kakfa messages through a producer.
to-do: could microbatch to optimize delta tables
streaming kafka w/ pyspark
reddit-streaming/reddit_streaming.py at last-stable-ubuntu · stevenhurwitt/reddit-streaming · GitHub
reads kafka stream and writes to delta table in s3 bucket ‘s3://reddit-streaming-stevenhurwitt-2{subreddit}’
transforming data w/ glue
reddit-streaming/aws-curation.py at last-stable-ubuntu · stevenhurwitt/reddit-streaming · GitHub
curates raw data to s3://reddit-streaming-stevenhurwitt-2{subreddit}_clean.
Glue crawlers generate symlink manifest to specify latest version of table (avoids duplicates in sql).
query tables w/ athena
create tables in Athena w/ queries from delta files on s3. need required iam roles. saved to s3 bucket: s3://xyz…
example query & result:
select * from reddit.aws_clean order by date desc limit 100;