:orphan: IO Module Description """"""""""""""""""""" Dispatcher Classes Workflow Overview '''''''''''''''''''''''''''''''''''' Calls from ``read_*`` functions of execution-specific IO classes (for example, ``PandasOnRayIO`` for Ray engine and pandas storage format) are forwarded to the ``_read`` function of the file format-specific class (for example ``CSVDispatcher`` for CSV files), where function parameters are preprocessed to check if they are supported (defaulting to pandas if not) and common metadata is computed for all partitions. The file is then split into chunks (splitting mechanism described below) and the data is used to launch tasks on the remote workers. After the remote tasks finish, additional postprocessing is performed on the results, and a new query compiler with the imported data will be returned. Data File Splitting Mechanism ''''''''''''''''''''''''''''' Modin's file splitting mechanism differs depending on the data format type: * text format type - the file is split into bytes according to user specified arguments. In the simplest case, when no row related parameters (such as ``nrows`` or ``skiprows``) are passed, data chunk limits (start and end bytes) are derived by dividing the file size by the number of partitions (chunks can slightly differ between each other because usually end byte may occurs inside a line and in that case the last byte of the line should be used instead of initial value). In other cases the same splitting mechanism is used, but chunks sizes are defined according to the number of lines that each partition should contain. * columnar store type - the file is split so that each chunk contains approximately the same number of columns. * SQL type - chunking is obtained by wrapping initial SQL query with a query that specifies initial row offset and number of rows in the chunk. After file splitting is complete, chunks data is passed to the parser functions (``PandasCSVParser.parse`` for ``read_csv`` function with pandas storage format) for further processing on each worker. Submodules Description '''''''''''''''''''''' ``modin.core.io`` module is used mostly for storing utils and dispatcher classes for reading files of different formats. * ``io.py`` - class containing basic utils and default implementation of IO functions. * ``file_dispatcher.py`` - class reading data from different kinds of files and handling some util functions common for all formats. Also this class contains ``read`` function which is entry point function for all dispatchers ``_read`` functions. * text - directory for storing all text file format dispatcher classes * ``text_file_dispatcher.py`` - class for reading text formats files. This class holds ``partitioned_file`` function for splitting text format files into chunks, ``offset`` function for moving file offset at the specified amount of bytes, ``_read_rows`` function for moving file offset at the specified amount of rows and many other functions. * format/feature specific dispatchers: ``csv_dispatcher.py``, ``excel_dispatcher.py``, ``fwf_dispatcher.py`` and ``json_dispatcher.py``. * column_stores - directory for storing all columnar store file format dispatcher classes * ``column_store_dispatcher.py`` - class for reading columnar type files. This class holds ``build_query_compiler`` function that performs file splitting, deploying remote tasks and results postprocessing and many other functions. * format/feature specific dispatchers: ``feather_dispatcher.py``, ``hdf_dispatcher.py`` and ``parquet_dispatcher.py``. * sql - directory for storing SQL dispatcher class * ``sql_dispatcher.py`` - class for reading SQL queries or database tables. Public API '''''''''' .. automodule:: modin.core.io :members: Handling ``skiprows`` Parameter ''''''''''''''''''''''''''''''' Handling ``skiprows`` parameter by pandas import functions can be very tricky, especially for ``read_csv`` function because of interconnection with ``header`` parameter. In this section the techniques of ``skiprows`` processing by both pandas and Modin are covered. Processing ``skiprows`` by pandas ================================= Let's consider a simple snippet with ``pandas.read_csv`` in order to understand interconnection of ``header`` and ``skiprows`` parameters: .. code-block:: python import pandas from io import StringIO data = """0 1 2 3 4 5 6 7 8 """ # `header` parameter absence is equivalent to `header="infer"` or `header=0` # rows 1, 5, 6, 7, 8 are read with header "0" df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4]) # rows 5, 6, 7, 8 are read with header "1", row 0 is skipped additionally df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=1) # rows 6, 7, 8 are read with header "5", rows 0, 1 are skipped additionally df = pandas.read_csv(StringIO(data), skiprows=[2, 3, 4], header=2) In the examples above list-like ``skiprows`` values are fixed and ``header`` is varied. In the first example with no ``header`` provided, rows 2, 3, 4 are skipped and row 0 is considered as the header. In the second example ``header == 1``, so the zeroth row is skipped and the next available row is considered the header. The third example illustrates when the ``header`` and ``skiprows`` parameters values are both present - in this case ``skiprows`` rows are dropped first and then the ``header`` is derived from the remaining rows (rows before header are skipped too). In the examples above only list-like ``skiprows`` and integer ``header`` parameters are considered, but the same logic is applicable for other types of the parameters. Processing ``skiprows`` by Modin ================================ As it can be seen, skipping rows in the pandas import functions is complicated and distributing this logic across multiple workers can complicate it even more. Thus in some rare corner cases default pandas implementation is used in Modin to avoid excessive Modin code complication. Modin uses two techniques for skipping rows: 1) During file partitioning (setting file limits that should be read by each partition) exact rows can be excluded from partitioning scope, thus they won't be read at all and can be considered as skipped. This is the most effective way of skipping rows since it doesn't require any actual data reading and postprocessing, but in this case ``skiprows`` parameter can be an integer only. When it is possible Modin always uses this approach. 2) Rows for skipping can be dropped after full dataset import. This is more expensive way since it requires extra IO work and postprocessing afterwards, but ``skiprows`` parameter can be of any non-integer type supported by ``pandas.read_csv``. In some cases, if ``skiprows`` is uniformly distributed array (e.g. [1, 2, 3]), ``skiprows`` can be "squashed" and represented as an integer to make a fastpath by skipping these rows during file partitioning (using the first option). But if there is a gap between the first row for skipping and the last line of the header (that will be skipped too since header is read by each partition to ensure metadata is defined properly), then this gap should be assigned for reading first by assigning the first partition to read these rows by setting ``pre_reading`` parameter. Let's consider an example of skipping rows during partitioning when ``header="infer"`` and ``skiprows=[3, 4, 5]``. In this specific case fastpath can be done since ``skiprows`` is uniformly distributed array, so we can "squash" it to an integer and set "partitioning" skiprows to 3. But if no additional action is done, these three rows will be skipped right after header line, that corresponds to ``skiprows=[1, 2, 3]``. To avoid this discrepancy, we need to assign the first partition to read data between header line and the first row for skipping by setting special ``pre_reading`` parameter to 2. Then, after the skipping of rows considered to be skipped during partitioning, the rest data will be divided between the rest of partitions, see rows assignment below: .. code-block:: 0 - header line (skip during partitioning) 1 - pre reading (assign to read by the first partition) 2 - pre reading (assign to read by the first partition) 3 - "partitioning" skiprows (skip during partitioning) 4 - "partitioning" skiprows (skip during partitioning) 5 - "partitioning" skiprows (skip during partitioning) 6 - data to partition (divide between the rest of partitions) 7 - data to partition (divide between the rest of partitions)