Experimental IO Module Description#

The module is used mostly for storing experimental utils and dispatcher classes for reading/writing files of different formats.

Submodules Description#

  • text - directory for storing all text file format dispatcher classes

    • format/feature specific dispatchers: csv_glob_dispatcher.py, custom_text_dispatcher.py.

  • sql - directory for storing SQL dispatcher class

    • format/feature specific dispatchers: sql_dispatcher.py

  • pickle - directory for storing Pickle dispatcher class

    • format/feature specific dispatchers: pickle_dispatcher.py

Public API#

Experimental IO functions implementations.

class modin.experimental.core.io.ExperimentalCSVGlobDispatcher#

Class contains utils for reading multiple .csv files simultaneously.

classmethod file_exists(file_path: str, storage_options=None) bool#

Check if the file_path is valid.

Parameters:
  • file_path (str) – String representing a path.

  • storage_options (dict, optional) – Keyword from read_* functions.

Returns:

True if the path is valid.

Return type:

bool

classmethod get_path(file_path: str, storage_options=None) list#

Return the path of the file(s).

Parameters:
  • file_path (str) – String representing a path.

  • storage_options (dict, optional) – Keyword from read_* functions.

Returns:

List of strings of absolute file paths.

Return type:

list

classmethod partitioned_file(files, fnames: List[str], num_partitions: int = None, nrows: int = None, skiprows: int = None, skip_header: int = None, quotechar: bytes = b'"', is_quoting: bool = True) List[List[Tuple[str, int, int]]]#

Compute chunk sizes in bytes for every partition.

Parameters:
  • files (file or list of files) – File(s) to be partitioned.

  • fnames (str or list of str) – File name(s) to be partitioned.

  • num_partitions (int, optional) – For what number of partitions split a file. If not specified grabs the value from modin.config.NPartitions.get().

  • nrows (int, optional) – Number of rows of file to read.

  • skiprows (int, optional) – Specifies rows to skip.

  • skip_header (int, optional) – Specifies header rows to skip.

  • quotechar (bytes, default: b'"') – Indicate quote in a file.

  • is_quoting (bool, default: True) – Whether or not to consider quotes.

Returns:

List, where each element of the list is a list of tuples. The inner lists of tuples contains the data file name of the chunk, chunk start offset, and chunk end offsets for its corresponding file.

Return type:

list

Notes

The logic gets really complicated if we try to use the TextFileDispatcher.partitioned_file.

class modin.experimental.core.io.ExperimentalCustomTextDispatcher#

Class handles utils for reading custom text files.

class modin.experimental.core.io.ExperimentalGlobDispatcher#

Class implements reading/writing different formats, parallelizing by the number of files.

classmethod write(qc, **kwargs)#

When * is in the filename, all partitions are written to their own separate file.

The filenames is determined as follows: - if * is in the filename, then it will be replaced by the ascending sequence 0, 1, 2, … - if * is not in the filename, then the default implementation will be used.

Parameters:
  • qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run to_<format>_glob on.

  • **kwargs (dict) – Parameters for pandas.to_<format>(**kwargs).

class modin.experimental.core.io.ExperimentalSQLDispatcher#

Class handles experimental utils for reading SQL queries or database tables.

classmethod preprocess_func()#

Prepare a function for transmission to remote workers.