Read and group json files by date element using pyspark
I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document.
I have multiple JSON files (10 TB ~) on a S3 bucket, and I need to organize these files by a date element present in every json document.
I am using docker-compose to set up a scalable airflow cluster. I based my approach off of this Dockerfile https://hub.docker.com/r/puckel/docker-airflow/
I would like to create a conditional task in Airflow as described in the schema below. The expected scenario is the following: