PyArrow Parsers Module Description#

This module houses parser classes that are responsible for data parsing on the workers for the PyArrow storage format. Parsers for PyArrow storage formats follow an interface of pandas format parsers: parser class of every file format implements parse method, which parses the specified part of the file and builds PyArrow tables from the parsed data, based on the specified chunk size and number of splits. The resulted PyArrow tables will be used as a partitions payload in the PyarrowOnRayDataframe.

Public API#

Module houses Modin parser classes, that are used for data parsing on the workers.

class modin.core.storage_formats.pyarrow.parsers.PyarrowCSVParser#

Class for handling CSV files on the workers using PyArrow storage format.

parse(fname, num_splits, start, end, header, **kwargs)#

Parse CSV file into PyArrow tables.

Parameters

fname (str) – Name of the CSV file to parse.
num_splits (int) – Number of partitions to split the resulted PyArrow table into.
start (int) – Position in the specified file to start parsing from.
end (int) – Position in the specified file to end parsing at.
header (str) – Header line that will be interpret as the first line of the parsed CSV file.
**kwargs (kwargs) – Serves the compatibility purpose. Does not affect the result.

Returns

List with splitted parse results and it’s metadata:

First num_split elements are PyArrow tables, representing the corresponding chunk.
Next element is the number of rows in the parsed table.
Last element is the pandas Series, containing the data-types for each column of the parsed table.

Return type

list