I am new to astronomer/airflow, I am trying to execute pyspark ETL script using airflow.
Scenario:
Reading json file from s3 location with the defined schema in S3. Adding a date column at last for partition and loading as parquet file into different s3 location.
Executing code:
dataframe = sqlContext.read.schema(schema_string).json(s3_file_prefixes) --got success
dataframe = adding_date_column_function(dataframe, ‘system_timestamp’) --got success
dataframe.write.partitionBy(‘date_column’).format(“parquet”).mode(“append”).save(“s3://bucket_name/key_prefix”) --step failed
Error:
py4j.protocol.Py4JJavaError: An error occurred while calling o58.save.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"
Requirement.txt:
low-code-dags==0.8.1
PyMySQL==1.0.2
dbt==0.20.2
apache-airflow-providers-qubole==1.0.2
apache-airflow-providers-snowflake==1.1.1
snowflake-connector-python==2.4.1
snowflake-sqlalchemy==1.2.4
SQLAlchemy==1.3.23
apache-airflow-providers-apache-spark
apache-airflow-providers-jdbc
boto
apache-airflow-providers-amazon
apache-airflow[s3,jdbc,hdfs]
I got stuck in the step while trying to write data frame to parquet file. Tried multiple options but couldn’t pass through, can you please help me where exactly I did it wrong or if any jars has to be added? Please let me know if you need additional info on this.
Thanks in advance.