Experimental IO Module Description#
The module is used mostly for storing experimental utils and dispatcher classes for reading/writing files of different formats.
Submodules Description#
text - directory for storing all text file format dispatcher classes
format/feature specific dispatchers:
csv_glob_dispatcher.py
,custom_text_dispatcher.py
.
sql - directory for storing SQL dispatcher class
format/feature specific dispatchers:
sql_dispatcher.py
pickle - directory for storing Pickle dispatcher class
format/feature specific dispatchers:
pickle_dispatcher.py
Public API#
Experimental IO functions implementations.
- class modin.experimental.core.io.ExperimentalCSVGlobDispatcher#
Class contains utils for reading multiple .csv files simultaneously.
- classmethod file_exists(file_path: str, storage_options=None) bool #
Check if the file_path is valid.
- Parameters:
file_path (str) – String representing a path.
storage_options (dict, optional) – Keyword from read_* functions.
- Returns:
True if the path is valid.
- Return type:
bool
- classmethod get_path(file_path: str, storage_options=None) list #
Return the path of the file(s).
- Parameters:
file_path (str) – String representing a path.
storage_options (dict, optional) – Keyword from read_* functions.
- Returns:
List of strings of absolute file paths.
- Return type:
list
- classmethod partitioned_file(files, fnames: List[str], num_partitions: int = None, nrows: int = None, skiprows: int = None, skip_header: int = None, quotechar: bytes = b'"', is_quoting: bool = True) List[List[Tuple[str, int, int]]] #
Compute chunk sizes in bytes for every partition.
- Parameters:
files (file or list of files) – File(s) to be partitioned.
fnames (str or list of str) – File name(s) to be partitioned.
num_partitions (int, optional) – For what number of partitions split a file. If not specified grabs the value from modin.config.NPartitions.get().
nrows (int, optional) – Number of rows of file to read.
skiprows (int, optional) – Specifies rows to skip.
skip_header (int, optional) – Specifies header rows to skip.
quotechar (bytes, default: b'"') – Indicate quote in a file.
is_quoting (bool, default: True) – Whether or not to consider quotes.
- Returns:
List, where each element of the list is a list of tuples. The inner lists of tuples contains the data file name of the chunk, chunk start offset, and chunk end offsets for its corresponding file.
- Return type:
list
Notes
The logic gets really complicated if we try to use the TextFileDispatcher.partitioned_file.
- class modin.experimental.core.io.ExperimentalCustomTextDispatcher#
Class handles utils for reading custom text files.
- class modin.experimental.core.io.ExperimentalGlobDispatcher#
Class implements reading/writing different formats, parallelizing by the number of files.
- classmethod write(qc, **kwargs)#
When * is in the filename, all partitions are written to their own separate file.
The filenames is determined as follows: - if * is in the filename, then it will be replaced by the ascending sequence 0, 1, 2, … - if * is not in the filename, then the default implementation will be used.
- Parameters:
qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run
to_<format>_glob
on.**kwargs (dict) – Parameters for
pandas.to_<format>(**kwargs)
.