Skip to main content

parquet2csv

Synopsis

starlake parquet2csv [options]

Description

Convert parquet files to CSV. The folder hierarchy should be in the form /input_folder/domain/schema/part*.parquet Once converted the csv files are put in the folder /output_folder/domain/schema.csv file When the specified number of output partitions is 1 then /output_folder/domain/schema.csv is the file containing the data otherwise, it is a folder containing the part*.csv files. When output_folder is not specified, then the input_folder is used a the base output folder.

starlake parquet2csv
--input_dir /tmp/datasets/accepted/
--output_dir /tmp/datasets/csv/
--domain sales
--schema orders
--option header=true
--option separator=,
--partitions 1
--write_mode overwrite

Parameters

ParameterCardinalityDescription
--input_dir:<value>RequiredFull Path to input directory
--output_dir:<value>OptionalFull Path to output directory, if not specified, input_dir is used as output dir
--domain:<value>OptionalDomain name to convert. All schemas in this domain are converted. If not specified, all schemas of all domains are converted
--schema:<value>OptionalSchema name to convert. If not specified, all schemas are converted.
--delete_source:<value>OptionalShould we delete source parquet files after conversion ?
--write_mode:<value>OptionalOne of Set(OVERWRITE, APPEND)
--options:k1=v1,k2=v2...OptionalAny Spark option to use (sep, delimiter, quote, quoteAll, escape, header ...)
--partitions:<value>OptionalHow many output partitions