Loading parquet files into s3 using spark dataframe

Goutham_R · November 19, 2021, 12:29pm

I am new to astronomer/airflow, I am trying to execute pyspark ETL script using airflow.

Scenario:
Reading json file from s3 location with the defined schema in S3. Adding a date column at last for partition and loading as parquet file into different s3 location.

Executing code:
dataframe = sqlContext.read.schema(schema_string).json(s3_file_prefixes) --got success
dataframe = adding_date_column_function(dataframe, ‘system_timestamp’) --got success
dataframe.write.partitionBy(‘date_column’).format(“parquet”).mode(“append”).save(“s3://bucket_name/key_prefix”) --step failed

Error:

py4j.protocol.Py4JJavaError: An error occurred while calling o58.save.
: org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

Requirement.txt:
low-code-dags==0.8.1

PyMySQL==1.0.2

dbt==0.20.2

apache-airflow-providers-qubole==1.0.2

apache-airflow-providers-snowflake==1.1.1

snowflake-connector-python==2.4.1

snowflake-sqlalchemy==1.2.4

SQLAlchemy==1.3.23

apache-airflow-providers-apache-spark

apache-airflow-providers-jdbc

boto

apache-airflow-providers-amazon

apache-airflow[s3,jdbc,hdfs]

I got stuck in the step while trying to write data frame to parquet file. Tried multiple options but couldn’t pass through, can you please help me where exactly I did it wrong or if any jars has to be added? Please let me know if you need additional info on this.

Thanks in advance.

Topic		Replies	Views
Load Pandas dataframe to Snowflake Airflow astro-sdk	1	4147	January 2, 2023
Run pyspark from another server on airflow Airflow docker , airflow , pyspark	1	1940	August 4, 2023
SparkSubmitOperator Airflow pyspark	0	258	September 29, 2024
Astronomer enviroment connection: Google Cloud Platform connection is not imported correctly into local Airflow development environment Astronomer gcp , connection	1	770	March 20, 2024
Error loading shared library libstdc++.so.6: No such file or directory (needed by /usr/lib/python3.6/site-packages/pandas/_libs/window.cpython-36m-x86_64-linux-gnu.so) Airflow	3	5403	August 14, 2019

Loading parquet files into s3 using spark dataframe

Related topics