
To use Modin, replace the pandas import:

Scale your pandas workflow by changing a single line of code¶
Modin uses Ray or Dask to provide an effortless way to speed up your pandas notebooks, scripts, and libraries. Unlike other distributed DataFrame libraries, Modin provides seamless integration and compatibility with existing pandas code. Even using the DataFrame constructor is identical.
import modin.pandas as pd
import numpy as np
frame_data = np.random.randint(0, 100, size=(2**10, 2**8))
df = pd.DataFrame(frame_data)
To use Modin, you do not need to know how many cores your system has and you do not need to specify how to distribute the data. In fact, you can continue using your previous pandas notebooks while experiencing a considerable speedup from Modin, even on a single machine. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas.
Installation and choosing your compute engine¶
Modin can be installed from PyPI:
pip install modin
If you don’t have Ray or Dask installed, you will need to install Modin with one of the targets:
pip install "modin[ray]" # Install Modin dependencies and Ray to run on Ray
pip install "modin[dask]" # Install Modin dependencies and Dask to run on Dask
pip install "modin[all]" # Install all of the above
Modin will automatically detect which engine you have installed and use that for scheduling computation!
If you want to choose a specific compute engine to run on, you can set the environment
variable MODIN_ENGINE
and Modin will do computation with that engine:
export MODIN_ENGINE=ray # Modin will use Ray
export MODIN_ENGINE=dask # Modin will use Dask
This can also be done within a notebook/interpreter before you import Modin:
import os
os.environ["MODIN_ENGINE"] = "ray" # Modin will use Ray
os.environ["MODIN_ENGINE"] = "dask" # Modin will use Dask
import modin.pandas as pd
Faster pandas, even on your laptop¶

The modin.pandas
DataFrame is an extremely light-weight parallel DataFrame. Modin
transparently distributes the data and computation so that all you need to do is
continue using the pandas API as you were before installing Modin. Unlike other parallel
DataFrame systems, Modin is an extremely light-weight, robust DataFrame. Because it is so
light-weight, Modin provides speed-ups of up to 4x on a laptop with 4 physical cores.
In pandas, you are only able to use one core at a time when you are doing computation of
any kind. With Modin, you are able to use all of the CPU cores on your machine. Even in
read_csv
, we see large gains by efficiently distributing the work across your entire
machine.
import modin.pandas as pd
df = pd.read_csv("my_dataset.csv")
Modin is a DataFrame for datasets from 1MB to 1TB+¶
We have focused heavily on bridging the solutions between DataFrames for small data (e.g. pandas) and large data. Often data scientists require different tools for doing the same thing on different sizes of data. The DataFrame solutions that exist for 1MB do not scale to 1TB+, and the overheads of the solutions for 1TB+ are too costly for datasets in the 1KB range. With Modin, because of its light-weight, robust, and scalable nature, you get a fast DataFrame at 1MB and 1TB+.
Modin is currently under active development. Requests and contributions are welcome!
If you are interested in contributions please refer to ‘developer documentation’ section, where you can find ‘Getting started’ guide, system architecture and internal implementation details docs and lots of other useful information.
Installation¶
There are several ways to install Modin. Most users will want to install with
pip
or using conda
tool, but some users may want to build from the master branch
on the GitHub repo. The master branch has the most recent patches, but may be less
stable than a release installed from pip
or conda
.
Installing with pip¶
Stable version¶
Modin can be installed with pip
on Linux, Windows and MacOS. 2 engines are available for those platforms:
Ray and Dask
To install the most recent stable release run the following:
pip install -U modin # -U for upgrade in case you have an older version
If you don’t have Ray or Dask installed, you will need to install Modin with one of the targets:
pip install modin[ray] # Install Modin dependencies and Ray to run on Ray
pip install modin[dask] # Install Modin dependencies and Dask to run on Dask
pip install modin[all] # Install all of the above
Modin will automatically detect which engine you have installed and use that for scheduling computation!
Release candidates¶
Before most major releases, we will upload a release candidate to If you would like to install a pre-release of Modin, run the following:
pip install --pre modin
These pre-releases are uploaded for dependencies and users to test their existing code to ensure that it still works. If you find something wrong, please raise an issue or email the bug reporter: bug_reports@modin.org.
Installing specific dependency sets¶
Modin has a number of specific dependency sets for running Modin on different backends or for different functionalities of Modin. Here is a list of dependency sets for Modin:
pip install "modin[dask]" # If you want to use the Dask backend
Installing with conda¶
Using conda-forge channel¶
Modin releases can be installed using conda
from conda-forge channel. Starting from 0.10.1
it is possible to install modin with chosen engine(s) alongside. Current options are:
Package name in conda-forge |
Engine(s) |
Supported OSs |
modin |
Linux, Windows, MacOS |
|
modin-dask |
Dask |
Linux, Windows, MacOS |
modin-ray |
Linux, Windows |
|
modin-omnisci |
Linux |
|
modin-all |
Dask, Ray, OmniSci |
Linux |
So for installing Dask and Ray engines into conda environment following command should be used:
conda install -c conda-forge modin-ray modin-dask
All set of engines could be available in conda environment by specifying
conda install -c conda-forge modin-all
or explicitly
conda install -c conda-forge modin-ray modin-dask modin-omnisci
Using Intel® Distribution of Modin¶
With conda
it is also possible to install Intel Distribution of Modin, a special version of Modin
that is part of Intel® oneAPI AI Analytics Toolkit. This version of Modin is powered by OmniSci
engine that contains a bunch of optimizations for Intel hardware. More details can be found on Intel Distribution of Modin page.
Installing from the GitHub master branch¶
If you’d like to try Modin using the most recent updates from the master branch, you can
also use pip
.
pip install git+https://github.com/modin-project/modin
This will install directly from the repo without you having to manually clone it! Please be aware that these changes have not made it into a release and may not be completely stable.
Windows¶
All Modin engines except OmniSci are available both on Windows and Linux as mentioned above. Default engine on Windows is Ray. It is also possible to use Windows Subsystem For Linux (WSL), but this is generally not recommended due to the limitations and poor performance of Ray on WSL, a roughly 2-3x cost.
Building Modin from Source¶
If you’re planning on contributing to Modin, you will need to ensure that you are building Modin from the local repository that you are working off of. Occasionally, there are issues in overlapping Modin installs from pypi and from source. To avoid these issues, we recommend uninstalling Modin before you install from source:
pip uninstall modin
To build from source, you first must clone the repo. We recommend forking the repository first through the GitHub interface, then cloning as follows:
git clone https://github.com/<your-github-username>/modin.git
Once cloned, cd
into the modin
directory and use pip
to install:
cd modin
pip install -e .
Using Modin¶
Modin is an early stage DataFrame library that wraps pandas and transparently distributes the data and computation, accelerating your pandas workflows with one line of code change. The user does not need to know how many cores their system has, nor do they need to specify how to distribute the data. In fact, users can continue using their previous pandas notebooks while experiencing a considerable speedup from Modin, even on a single machine. Only a modification of the import statement is needed, as we demonstrate below. Once you’ve changed your import statement, you’re ready to use Modin just like you would pandas, since the API is identical to pandas.
Quickstart¶
# import pandas as pd
import modin.pandas as pd
That’s it. You’re ready to use Modin on your previous pandas notebooks.
We currently have most of the pandas API implemented and are working toward full functional parity with pandas (as well as even more tools and features ).
Using Modin on a Single Node¶
In local (without a cluster) modin will create and manage a local (dask or ray) cluster for the execution
In order to use the most up-to-date version of Modin, please follow the instructions on the installation page.
Once you import the library, you should see something similar to the following output:
>>> import modin.pandas as pd
Waiting for redis server at 127.0.0.1:14618 to respond...
Waiting for redis server at 127.0.0.1:31410 to respond...
Starting local scheduler with the following resources: {'CPU': 4, 'GPU': 0}.
======================================================================
View the web UI at http://localhost:8889/notebooks/ray_ui36796.ipynb?token=ac25867d62c4ae87941bc5a0ecd5f517dbf80bd8e9b04218
======================================================================
Once you have executed import modin.pandas as pd
, you’re ready to begin
running your pandas pipeline as you were before.
APIs Supported¶
Please note, the API is not yet complete. For some methods, you may see the following:
NotImplementedError: To contribute to Modin, please visit github.com/modin-project/modin.
We have compiled a list of currently supported methods.
If you would like to request a particular method be implemented, feel free to open an issue. Before you open an issue please make sure that someone else has not already requested that functionality.
Using Modin on a Cluster (experimental)¶
Modin is able to utilize Ray’s built-in autoscaled cluster. However, this usage is still under heavy development. To launch a Ray autoscaled cluster using Amazon Web Service (AWS), you can use the file examples/cluster/aws_example.yaml as the config file when launching an autoscaled Ray cluster. For the commands, refer to the autoscaler documentation.
We will provide a sample config file for private servers and other cloud service providers as we continue to develop and improve Modin’s cluster support.
See more on the Modin in the Cloud documentation page.
Advanced usage (experimental)¶
In some cases, it may be useful to customize your Ray environment. Below, we have listed a few ways you can solve common problems in data management with Modin by customizing your Ray environment. It is possible to use any of Ray’s initialization parameters, which are all found in Ray’s documentation.
import ray
ray.init()
import modin.pandas as pd
Modin will automatically connect to the Ray instance that is already running. This way, you can customize your Ray environment for use in Modin!
Exceeding memory (Out of core pandas)¶
Modin experimentally supports out of core operations. See more on the Out of Core documentation page.
Reducing or limiting the resources Modin can use¶
By default, Modin will use all of the resources available on your machine. It is possible, however, to limit the amount of resources Modin uses to free resources for another task or user. Here is how you would limit the number of CPUs Modin used in your bash environment variables:
export MODIN_CPUS=4
You can also specify this in your python script with os.environ
. Make sure
you update the CPUS before you import Modin!:
import os
os.environ["MODIN_CPUS"] = "4"
import modin.pandas as pd
If you’re using a specific engine and want more control over the environment Modin uses, you can start Ray or Dask in your environment and Modin will connect to it. Make sure you start the environment before you import Modin!
import ray
ray.init(num_cpus=4)
import modin.pandas as pd
Specifying num_cpus
limits the number of processors that Modin uses. You may also
specify more processors than you have available on your machine, however this will not
improve the performance (and might end up hurting the performance of the system).
Examples¶
You can find an example on our recent blog post or on the Jupyter Notebook that we used to create the blog post.
Out of Core in Modin¶
If you are working with very large files or would like to exceed your memory, you may change the primary location of the DataFrame. If you would like to exceed memory, you can use your disk as an overflow for the memory.
Starting Modin with out of core enabled¶
Out of core is now enabled by default for both Ray and Dask engines.
Disabling Out of Core¶
Out of core is enabled by the compute engine selected. To disable it, start your preferred compute engine with the appropriate arguments. For example:
import modin.pandas as pd
import ray
ray.init(_plasma_directory="/tmp") # setting to disable out of core in Ray
df = pd.read_csv("some.csv")
If you are using Dask, you have to modify local configuration files. Visit the Dask documentation on object spilling to see how.
Running an example with out of core¶
Before you run this, please make sure you follow the instructions listed above.
import modin.pandas as pd
import numpy as np
frame_data = np.random.randint(0, 100, size=(2**20, 2**8)) # 2GB each
df = pd.DataFrame(frame_data).add_prefix("col")
big_df = pd.concat([df for _ in range(20)]) # 20x2GB frames
print(big_df)
nan_big_df = big_df.isna() # The performance here represents a simple map
print(big_df.groupby("col1").count()) # group by on a large dataframe
This example creates a 40GB DataFrame from 20 identical 2GB DataFrames and performs various operations on them. Feel free to play around with this code and let us know what you think!
Examples¶
scikit-learn with LinearRegression¶
Here is a Jupyter Notebook example which uses Modin with scikit-learn and linear regression sklearn LinearRegression.
Overview¶
Modin aims to not only optimize Pandas, but also provide a comprehensive, integrated toolkit for data scientists. We are actively developing data science tools such as DataFrame - spreadsheet integration, DataFrame algebra, progress bars, SQL queries on DataFrames, and more. Join the Discourse for the latest updates!
Modin Spreadsheet API: Render Dataframes as Spreadsheets¶
The Spreadsheet API for Modin allows you to render the dataframe as a spreadsheet to easily explore your data and perform operations on a graphical user interface. The API also includes features for recording the changes made to the dataframe and exporting them as reproducible code. Built on top of Modin and SlickGrid, the spreadsheet interface is able to provide interactive response times even at a scale of billions of rows. See our Modin Spreadsheet API documentation for more details.

Progress Bar¶
Visual progress bar for Dataframe operations such as groupby and fillna, as well as for file reading operations such as read_csv. Built using the tqdm library and Ray execution engine. See Progress Bar documentation for more details.

Dataframe Algebra¶
A minimal set of operators that can be composed to express any dataframe query for use in query planning and optimization. See our paper for more information, and full documentation is coming soon!
SQL on Modin Dataframes¶

Read about Modin Dataframe support for SQL queries in this recent blog post. Check out the Modin SQL documentation as well!
Distributed XGBoost on Modin¶
Modin provides an implementation of distributed XGBoost machine learning algorithm on Modin DataFrames. See our Distributed XGBoost on Modin documentation for details about installation and usage, as well as Modin XGBoost architecture documentation for information about implementation and internal execution flow.
SQL on Modin Dataframes¶
MindsDB has teamed up with Modin to bring in-memory SQL to distributed Modin Dataframes. Now you can run SQL alongside the pandas API without copying or going through your disk. What this means is that you can now have a SQL solution that you can seamlessly scale horizontally and vertically, by leveraging the incredible power of Ray.
A Short Example Using the Google Play Store¶
import modin.pandas as pd
import modin.experimental.sql as sql
# read google play app store list from csv
gstore_apps_df = pd.read_csv("https://tinyurl.com/googleplaystorecsv")

Imagine that you want to quickly select from ‘gstore_apps_df’ the columns App, Category, and Rating, where Price is ‘0’.
# You can then define the query that you want to perform
sql_str = "SELECT App, Category, Rating FROM gstore_apps WHERE Price = '0'"
# And simply apply that query to a dataframe
result_df = sql.query(sql_str, gstore_apps=gstore_apps_df)
# Or, in this case, where the query only requires one table,
# you can also ignore the FROM part in the query string:
query_str = "SELECT App, Category, Rating WHERE Price = '0' "
# sql.query can take query strings without FROM statement
# you can specify from as the function argument
result_df = sql.query(query_str, from=gstore_apps_df)
Writing Complex Queries¶
Let’s explore a more complicated example.
gstore_reviews_df = pd.read_csv("https://tinyurl.com/gstorereviewscsv")

Say we want to retrieve the top 10 app categories ranked by best average ‘sentiment_polarity’ where the average ‘sentiment_subjectivity’ is less than 0.5.
Since ‘Category’ is on the gstore_apps_df and sentiment_polarity is on gstore_reviews_df, we need to join the two tables, and operate averages on that join.
# Single query with join and group by
sql_str = """
SELECT
category,
avg(sentiment_polarity) as avg_sentiment_polarity,
avg(sentiment_subjectivity) as avg_sentiment_subjectivity
FROM (
SELECT
category,
CAST(sentiment as float) as sentiment,
CAST(sentiment_polarity as float) as sentiment_polarity
FROM gstore_apps_df
INNER JOIN gstore_reviews_df
ON gstore_apps_df.app = gstore_reviews_df.app
) sub
GROUP BY category
HAVING avg_sentiment_subjectivity < 0.5
ORDER BY avg_sentiment_polarity DESC
LIMIT 10
"""
# Run query using apps and reviews dataframes,
# NOTE: that you simply pass the names of the tables in the query as arguments
result_df = sql.query( sql_str,
gstore_apps_df = gstore_apps_df,
gstore_reviews_df = gstore_reviews_df)
Or, you can bring the best of doing this in python and run the query in multiple parts (it’s up to you).
# join the items and reviews
result_df = sql.query( """
SELECT
category,
sentiment,
sentiment_polarity
FROM gstore_apps_df INNER JOIN gstore_reviews_df
ON gstore_apps_df.app = gstore_reviews_df.app """,
gstore_apps_df = gstore_apps_df,
gstore_reviews_df = gstore_reviews_df )
# group by category and calculate averages
result_df = sql.query( """
SELECT
category,
avg(sentiment_polarity) as avg_sentiment_polarity,
avg(sentiment_subjectivity) as avg_sentiment_subjectivity
GROUP BY category
HAVING CAST(avg_sentiment_subjectivity as float) < 0.5
ORDER BY avg_sentiment_polarity DESC
LIMIT 10""",
from = result_df)
If you have a cluster or even a computer with more than one CPU core, you can write SQL and Modin will run those queries in a distributed and optimized way.
Further Examples and Full Documentation¶
In the meantime, you can check out our Example Notebook that contains more examples and ideas, as well as this blog explaining Modin SQL usage.
Modin Spreadsheets API¶
Getting started¶
Install Modin-spreadsheet using pip:
pip install modin[spreadsheet]
The following code snippet creates a spreadsheet using the FiveThirtyEight dataset on labor force information by college majors (licensed under CC BY 4.0):
import modin.pandas as pd
import modin.spreadsheet as mss
df = pd.read_csv('https://raw.githubusercontent.com/fivethirtyeight/data/master/college-majors/all-ages.csv')
spreadsheet = mss.from_dataframe(df)
spreadsheet

Basic Manipulations through User Interface¶
The Spreadsheet API allows users to manipulate the DataFrame with simple graphical controls for sorting, filtering, and editing.
- Here are the instructions for each operation:
Sort: Click on the column header of the column to sort on.
Filter: Click on the filter button on the column header and apply the desired filter to the column. The filter dropdown changes depending on the type of the column. Multiple filters are automatically combined.
Edit Cell: Double click on a cell and enter the new value.
Add Rows: Click on the “Add Row” button in the toolbar to duplicate the last row in the DataFrame. The duplicated values provide a convenient default and can be edited as necessary.
Remove Rows: Select row(s) and click the “Remove Row” button. Select a single row by clicking on it. Multiple rows can be selected with Cmd+Click (Windows: Ctrl+Click) on the desired rows or with Shift+Click to specify a range of rows.
Some of these operations can also be done through the spreadsheet’s programmatic interface. Sorts and filters can be reset using the toolbar buttons. Edits and adding/removing rows can only be undone manually.
Virtual Rendering¶
The spreadsheet will only render data based on the user’s viewport. This allows for quick rendering even on very large DataFrames because only a handful of rows are loaded at any given time. As a result, scrolling and viewing your data is smooth and responsive.
Transformation History and Exporting Code¶
All operations on the spreadsheet are recorded and are easily exported as code for sharing or reproducibility. This history is automatically displayed in the history cell, which is generated below the spreadsheet whenever the spreadsheet widget is displayed. The history cell is displayed on default, but this can be turned off. Modin Spreadsheet API provides a few methods for interacting with the history:
SpreadsheetWidget.get_history(): Retrieves the transformation history in the form of reproducible code.
SpreadsheetWidget.filter_relevant_history(persist=True): Returns the transformation history that contains only code relevant to the final state of the spreadsheet. The persist parameter determines whether the internal state and the displayed history is also filtered.
SpreadsheetWidget.reset_history(): Clears the history of transformation.
Customizable Interface¶
The spreadsheet widget provides a number of options that allows the user to change the appearance and the interactivity of the spreadsheet. Options include:
Row height/Column width
Preventing edits, sorts, or filters on the whole spreadsheet or on a per-column basis
Hiding the toolbar and history cell
Float precision
Highlighting of cells and rows
Viewport size
Converting Spreadsheets To and From Dataframes¶
- modin.spreadsheet.general.from_dataframe(dataframe, show_toolbar=None, show_history=None, precision=None, grid_options=None, column_options=None, column_definitions=None, row_edit_callback=None)
Renders a DataFrame or Series as an interactive spreadsheet, represented by an instance of the
SpreadsheetWidget
class. TheSpreadsheetWidget
instance is constructed using the options passed in to this function. Thedataframe
argument to this function is used as thedf
kwarg in call to the SpreadsheetWidget constructor, and the rest of the parameters are passed through as is.If the
dataframe
argument is a Series, it will be converted to a DataFrame before being passed in to the SpreadsheetWidget constructor as thedf
kwarg.- Return type
SpreadsheetWidget
- Parameters
dataframe (DataFrame) – The DataFrame that will be displayed by this instance of SpreadsheetWidget.
grid_options (dict) – Options to use when creating the SlickGrid control (i.e. the interactive grid). See the Notes section below for more information on the available options, as well as the default options that this widget uses.
precision (integer) – The number of digits of precision to display for floating-point values. If unset, we use the value of pandas.get_option(‘display.precision’).
show_toolbar (bool) – Whether to show a toolbar with options for adding/removing rows. Adding/removing rows is an experimental feature which only works with DataFrames that have an integer index.
show_history (bool) – Whether to show the cell containing the spreadsheet transformation history.
column_options (dict) – Column options that are to be applied to every column. See the Notes section below for more information on the available options, as well as the default options that this widget uses.
column_definitions (dict) – Column options that are to be applied to individual columns. The keys of the dict should be the column names, and each value should be the column options for a particular column, represented as a dict. The available options for each column are the same options that are available to be set for all columns via the
column_options
parameter. See the Notes section below for more information on those options.row_edit_callback (callable) – A callable that is called to determine whether a particular row should be editable or not. Its signature should be
callable(row)
, whererow
is a dictionary which contains a particular row’s values, keyed by column name. The callback should return True if the provided row should be editable, and False otherwise.
Notes
The following dictionary is used for
grid_options
if none are provided explicitly:{ # SlickGrid options 'fullWidthRows': True, 'syncColumnCellResize': True, 'forceFitColumns': False, 'defaultColumnWidth': 150, 'rowHeight': 28, 'enableColumnReorder': False, 'enableTextSelectionOnCells': True, 'editable': True, 'autoEdit': False, 'explicitInitialization': True, # Modin-spreadsheet options 'maxVisibleRows': 15, 'minVisibleRows': 8, 'sortable': True, 'filterable': True, 'highlightSelectedCell': False, 'highlightSelectedRow': True }
The first group of options are SlickGrid “grid options” which are described in the SlickGrid documentation.
The second group of option are options that were added specifically for modin-spreadsheet and therefore are not documented in the SlickGrid documentation. The following bullet points describe these options.
maxVisibleRows The maximum number of rows that modin-spreadsheet will show.
minVisibleRows The minimum number of rows that modin-spreadsheet will show
sortable Whether the modin-spreadsheet instance will allow the user to sort columns by clicking the column headers. When this is set to
False
, nothing will happen when users click the column headers.filterable Whether the modin-spreadsheet instance will allow the user to filter the grid. When this is set to
False
the filter icons won’t be shown for any columns.highlightSelectedCell If you set this to True, the selected cell will be given a light blue border.
highlightSelectedRow If you set this to False, the light blue background that’s shown by default for selected rows will be hidden.
The following dictionary is used for
column_options
if none are provided explicitly:{ # SlickGrid column options 'defaultSortAsc': True, 'maxWidth': None, 'minWidth': 30, 'resizable': True, 'sortable': True, 'toolTip': "", 'width': None # Modin-spreadsheet column options 'editable': True, }
The first group of options are SlickGrid “column options” which are described in the SlickGrid documentation.
The
editable
option was added specifically for modin-spreadsheet and therefore is not documented in the SlickGrid documentation. This option specifies whether a column should be editable or not.See also
set_defaults
Permanently set global defaults for the parameters of
show_grid
, with the exception of thedataframe
andcolumn_definitions
parameters, since those depend on the particular set of data being shown by an instance, and therefore aren’t parameters we would want to set for all SpreadsheetWidget instances.set_grid_option
Permanently set global defaults for individual grid options. Does so by changing the defaults that the
show_grid
method uses for thegrid_options
parameter.SpreadsheetWidget
The widget class that is instantiated and returned by this method.
- modin.spreadsheet.general.to_dataframe(spreadsheet)
Get a copy of the DataFrame that reflects the current state of the
spreadsheet
SpreadsheetWidget instance UI. This includes any sorting or filtering changes, as well as edits that have been made by double clicking cells.- Return type
- Parameters
spreadsheet (SpreadsheetWidget) – The SpreadsheetWidget instance that DataFrame that will be displayed by this instance of SpreadsheetWidget.
Further API Documentation¶
- class modin_spreadsheet.grid.SpreadsheetWidget(**kwargs)
The widget class which is instantiated by the
show_grid
method. This class can be constructed directly but that’s not recommended because then default options have to be specified explicitly (since default options are normally provided by theshow_grid
method).The constructor for this class takes all the same parameters as
show_grid
, with one exception, which is that the requireddata_frame
parameter is replaced by an optional keyword argument calleddf
.See also
show_grid
The method that should be used to construct SpreadsheetWidget instances, because it provides reasonable defaults for all of the modin-spreadsheet options.
- df
Get/set the DataFrame that’s being displayed by the current instance. This DataFrame will NOT reflect any sorting/filtering/editing changes that are made via the UI. To get a copy of the DataFrame that does reflect sorting/filtering/editing changes, use the
get_changed_df()
method.- Type
- grid_options
Get/set the grid options being used by the current instance.
- Type
dict
- precision
Get/set the precision options being used by the current instance.
- Type
integer
- show_toolbar
Get/set the show_toolbar option being used by the current instance.
- Type
bool
- show_history
Get/set the show_history option being used by the current instance.
- Type
bool
- column_options
Get/set the column options being used by the current instance.
- Type
bool
- column_definitions
Get/set the column definitions (column-specific options) being used by the current instance.
- Type
bool
- add_row(row=None)
Append a row at the end of the DataFrame. Values for the new row can be provided via the
row
argument, which is optional for DataFrames that have an integer index, and required otherwise. If therow
argument is not provided, the last row will be duplicated and the index of the new row will be the index of the last row plus one.- Parameters
row (list (default: None)) – A list of 2-tuples of (column name, column value) that specifies the values for the new row.
See also
SpreadsheetWidget.remove_rows
The method for removing a row (or rows).
- change_grid_option(option_name, option_value)
Change a SlickGrid grid option without rebuilding the entire grid widget. Not all options are supported at this point so this method should be considered experimental.
- Parameters
option_name (str) – The name of the grid option to be changed.
option_value (str) – The new value for the grid option.
- change_selection(rows=[])
Select a row (or rows) in the UI. The indices of the rows to select are provided via the optional
rows
argument.- Parameters
rows (list (default: [])) – A list of indices of the rows to select. For a multi-indexed DataFrame, each index in the list should be a tuple, with each value in each tuple corresponding to a level of the MultiIndex. The default value of
[]
results in the no rows being selected (i.e. it clears the selection).
- edit_cell(index, column, value)
Edit a cell of the grid, given the index and column of the cell to edit, as well as the new value of the cell. Results in a
cell_edited
event being fired.- Parameters
index (object) – The index of the row containing the cell that is to be edited.
column (str) – The name of the column containing the cell that is to be edited.
value (object) – The new value for the cell.
- get_changed_df()
Get a copy of the DataFrame that was used to create the current instance of SpreadsheetWidget which reflects the current state of the UI. This includes any sorting or filtering changes, as well as edits that have been made by double clicking cells.
- Return type
- get_selected_df()
Get a DataFrame which reflects the current state of the UI and only includes the currently selected row(s). Internally it calls
get_changed_df()
and then filters down to the selected rows usingiloc
.- Return type
- get_selected_rows()
Get the currently selected rows.
- Return type
List of integers
- off(names, handler)
Remove a modin-spreadsheet event handler that was registered with the current instance’s
on
method.- Parameters
names (list, str, All (default: All)) – The names of the events for which the specified handler should be uninstalled. If names is All, the specified handler is uninstalled from the list of notifiers corresponding to all events.
handler (callable) – A callable that was previously registered with the current instance’s
on
method.
See also
SpreadsheetWidget.on
The method for hooking up instance-level handlers that this
off
method can remove.
- on(names, handler)
Setup a handler to be called when a user interacts with the current instance.
- Parameters
names (list, str, All) – If names is All, the handler will apply to all events. If a list of str, handler will apply to all events named in the list. If a str, the handler will apply just the event with that name.
handler (callable) – A callable that is called when the event occurs. Its signature should be
handler(event, spreadsheet_widget)
, whereevent
is a dictionary andspreadsheet_widget
is the SpreadsheetWidget instance that fired the event. Theevent
dictionary at least holds aname
key which specifies the name of the event that occurred.
Notes
Here’s the list of events that you can listen to on SpreadsheetWidget instances via the
on
method:[ 'cell_edited', 'selection_changed', 'viewport_changed', 'row_added', 'row_removed', 'filter_dropdown_shown', 'filter_changed', 'sort_changed', 'text_filter_viewport_changed', 'json_updated' ]
The following bullet points describe the events listed above in more detail. Each event bullet point is followed by sub-bullets which describe the keys that will be included in the
event
dictionary for each event.cell_edited The user changed the value of a cell in the grid.
index The index of the row that contains the edited cell.
column The name of the column that contains the edited cell.
old The previous value of the cell.
new The new value of the cell.
filter_changed The user changed the filter setting for a column.
column The name of the column for which the filter setting was changed.
filter_dropdown_shown The user showed the filter control for a column by clicking the filter icon in the column’s header.
column The name of the column for which the filter control was shown.
json_updated A user action causes SpreadsheetWidget to send rows of data (in json format) down to the browser. This happens as a side effect of certain actions such as scrolling, sorting, and filtering.
triggered_by The name of the event that resulted in rows of data being sent down to the browser. Possible values are
change_viewport
,change_filter
,change_sort
,add_row
,remove_row
, andedit_cell
.range A tuple specifying the range of rows that have been sent down to the browser.
row_added The user added a new row using the “Add Row” button in the grid toolbar.
index The index of the newly added row.
source The source of this event. Possible values are
api
(an api method call) andgui
(the grid interface).
row_removed The user added removed one or more rows using the “Remove Row” button in the grid toolbar.
indices The indices of the removed rows, specified as an array of integers.
source The source of this event. Possible values are
api
(an api method call) andgui
(the grid interface).
selection_changed The user changed which rows were highlighted in the grid.
old An array specifying the indices of the previously selected rows.
new The indices of the rows that are now selected, again specified as an array.
source The source of this event. Possible values are
api
(an api method call) andgui
(the grid interface).
sort_changed The user changed the sort setting for the grid.
old The previous sort setting for the grid, specified as a dict with the following keys:
column The name of the column that the grid was sorted by
ascending Boolean indicating ascending/descending order
new The new sort setting for the grid, specified as a dict with the following keys:
column The name of the column that the grid is currently sorted by
ascending Boolean indicating ascending/descending order
text_filter_viewport_changed The user scrolled the new rows into view in the filter dropdown for a text field.
column The name of the column whose filter dropdown is visible
old A tuple specifying the previous range of visible rows in the filter dropdown.
new A tuple specifying the range of rows that are now visible in the filter dropdown.
viewport_changed The user scrolled the new rows into view in the grid.
old A tuple specifying the previous range of visible rows.
new A tuple specifying the range of rows that are now visible.
The
event
dictionary for every type of event will contain aname
key specifying the name of the event that occurred. That key is excluded from the lists of keys above to avoid redundacy.See also
on
Same as the instance-level
on
method except it listens for events on all instances rather than on an individual SpreadsheetWidget instance.SpreadsheetWidget.off
Unhook a handler that was hooked up using the instance-level
on
method.
- remove_row(rows=None)
Alias for
remove_rows
, which is provided for convenience because this was the previous name of that method.
- remove_rows(rows=None)
Remove a row (or rows) from the DataFrame. The indices of the rows to remove can be provided via the optional
rows
argument. If therows
argument is not provided, the row (or rows) that are currently selected in the UI will be removed.- Parameters
rows (list (default: None)) – A list of indices of the rows to remove from the DataFrame. For a multi-indexed DataFrame, each index in the list should be a tuple, with each value in each tuple corresponding to a level of the MultiIndex.
See also
SpreadsheetWidget.add_row
The method for adding a row.
SpreadsheetWidget.remove_row
Alias for this method.
- toggle_editable()
Change whether the grid is editable or not, without rebuilding the entire grid widget.
Progress Bar¶
The progress bar allows users to see the estimated progress and completion time of each line they run, in environments such as a shell or Jupyter notebook.

Quickstart¶
The progress bar uses the tqdm library to visualize displays:
pip install tqdm
Import the progress bar into your notebook by running the following:
import modin.pandas as pd
from tqdm import tqdm
from modin.config import ProgressBar
ProgressBar.enable()
Distributed XGBoost on Modin¶
Modin provides an implementation of distributed XGBoost machine learning algorithm on Modin DataFrames. Please note that this feature is experimental and behavior or interfaces could be changed.
Install XGBoost on Modin¶
Modin comes with all the dependencies except xgboost
package by default.
Currently, distributed XGBoost on Modin is only supported on the Ray backend, therefore, see
the installation page for more information on installing Modin with the Ray backend.
To install xgboost
package you can use pip
:
pip install xgboost
XGBoost Train and Predict¶
Distributed XGBoost functionality is placed in modin.experimental.xgboost
module.
modin.experimental.xgboost
provides a drop-in replacement API for train
and Booster.predict
xgboost functions.
Module holds public interfaces for Modin XGBoost.
- modin.experimental.xgboost.train(params: Dict, dtrain: modin.experimental.xgboost.xgboost.DMatrix, *args, evals=(), num_actors: Optional[int] = None, evals_result: Optional[Dict] = None, **kwargs)
Run distributed training of XGBoost model.
During work it evenly distributes dtrain between workers according to IP addresses partitions (in case of not even distribution of dtrain over nodes, some partitions will be re-distributed between nodes), runs xgb.train on each worker for subset of dtrain and reduces training results of each worker using Rabit Context.
- Parameters
params (dict) – Booster params.
dtrain (modin.experimental.xgboost.DMatrix) – Data to be trained against.
*args (iterable) – Other parameters for xgboost.train.
evals (list of pairs (modin.experimental.xgboost.DMatrix, str), default: empty) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.
num_actors (int, optional) – Number of actors for training. If unspecified, this value will be computed automatically.
evals_result (dict, optional) – Dict to store evaluation results in.
**kwargs (dict) – Other parameters are the same as xgboost.train.
- Returns
A trained booster.
- Return type
- class modin.experimental.xgboost.Booster(params=None, cache=(), model_file=None)
A Modin Booster of XGBoost.
Booster is the model of XGBoost, that contains low level routines for training, prediction and evaluation.
- Parameters
params (dict, optional) – Parameters for boosters.
cache (list, default: empty) – List of cache items.
model_file (string/os.PathLike/xgb.Booster/bytearray, optional) – Path to the model file if it’s string or PathLike or xgb.Booster.
- predict(data: modin.experimental.xgboost.xgboost.DMatrix, **kwargs)
Run distributed prediction with a trained booster.
During execution it runs
xgb.predict
on each worker for subset of data and creates Modin DataFrame with prediction results.- Parameters
data (modin.experimental.xgboost.DMatrix) – Input data used for prediction.
**kwargs (dict) – Other parameters are the same as for
xgboost.Booster.predict
.
- Returns
Modin DataFrame with prediction results.
- Return type
modin.pandas.DataFrame
ModinDMatrix¶
Data is passed to modin.experimental.xgboost
functions via a Modin DMatrix
object.
Module holds public interfaces for Modin XGBoost.
- class modin.experimental.xgboost.DMatrix(data, label=None)
DMatrix holds references to partitions of Modin DataFrame.
On init stage unwrapping partitions of Modin DataFrame is started.
- Parameters
data (modin.pandas.DataFrame) – Data source of DMatrix.
label (modin.pandas.DataFrame or modin.pandas.Series, optional) – Labels used for training.
Notes
Currently DMatrix supports only data and label parameters.
Currently, the Modin DMatrix
supports modin.pandas.DataFrame
only as an input.
A Single Node / Cluster setup¶
The XGBoost part of Modin uses a Ray resources by similar way as all Modin functions.
To start the Ray runtime on a single node:
import ray
ray.init()
If you already had the Ray cluster you can connect to it by next way:
import ray
ray.init(address='auto')
A detailed information about initializing the Ray runtime you can find in starting ray page.
Usage example¶
In example below we train XGBoost model using the Iris Dataset and get prediction on the same data. All processing will be in a single node mode.
from sklearn import datasets
import ray
ray.init() # Start the Ray runtime for single-node
import modin.pandas as pd
import modin.experimental.xgboost as xgb
# Load iris dataset from sklearn
iris = datasets.load_iris()
# Create Modin DataFrames
X = pd.DataFrame(iris.data)
y = pd.DataFrame(iris.target)
# Create DMatrix
dtrain = xgb.DMatrix(X, y)
dtest = xgb.DMatrix(X, y)
# Set training parameters
xgb_params = {
"eta": 0.3,
"max_depth": 3,
"objective": "multi:softprob",
"num_class": 3,
"eval_metric": "mlogloss",
}
steps = 20
# Create dict for evaluation results
evals_result = dict()
# Run training
model = xgb.train(
xgb_params,
dtrain,
steps,
evals=[(dtrain, "train")],
evals_result=evals_result
)
# Print evaluation results
print(f'Evals results:\n{evals_result}')
# Predict results
prediction = model.predict(dtest)
# Print prediction results
print(f'Prediction results:\n{prediction}')
Modin in the Cloud¶
Modin implements functionality that allows to transfer computing to the cloud with minimal effort. Please note that this feature is experimental and behavior or interfaces could be changed.
Prerequisites¶
Sign up with a cloud provider and get credentials file. Note that we supported only AWS currently, more are planned. (AWS credentials file format)
Setup environment¶
pip install modin[remote]
This command install the following dependencies:
RPyC - allows to perform remote procedure calls.
Cloudpickle - allows pickling of functions and classes, which is used in our distributed runtime.
Boto3 - allows to create and setup AWS cloud machines. Optional library for Ray Autoscaler.
- Notes:
It also needs Ray Autoscaler component, which is implicitly installed with Ray (note that Ray from
conda
is now missing that component!). More information in Ray docs.
Architecture¶

- Notes:
To get maximum performance, you need to try to reduce the amount of data transferred between local and remote environments as much as possible.
To ensure correct operation, it is necessary to ensure the equivalence of versions of all Python libraries (including the interpreter) in the local and remote environments.
Public interface¶
- exception modin.experimental.cloud.CannotDestroyCluster(*args, cause: Optional[BaseException] = None, traceback: Optional[str] = None, **kw)¶
Raised when cluster cannot be destroyed in the cloud
- exception modin.experimental.cloud.CannotSpawnCluster(*args, cause: Optional[BaseException] = None, traceback: Optional[str] = None, **kw)¶
Raised when cluster cannot be spawned in the cloud
- exception modin.experimental.cloud.ClusterError(*args, cause: Optional[BaseException] = None, traceback: Optional[str] = None, **kw)¶
Generic cluster operating exception
- modin.experimental.cloud.create_cluster(provider: Union[modin.experimental.cloud.cluster.Provider, str], credentials: Optional[str] = None, region: Optional[str] = None, zone: Optional[str] = None, image: Optional[str] = None, project_name: Optional[str] = None, cluster_name: str = 'modin-cluster', workers: int = 4, head_node: Optional[str] = None, worker_node: Optional[str] = None, add_conda_packages: Optional[list] = None, cluster_type: str = 'rayscale') modin.experimental.cloud.cluster.BaseCluster ¶
Creates an instance of a cluster with desired characteristics in a cloud. Upon entering a context via with statement Modin will redirect its work to the remote cluster. Spawned cluster can be destroyed manually, or it will be destroyed when the program exits.
- Parameters
provider (str or instance of Provider class) – Specify the name of the provider to use or a Provider object. If Provider object is given, then credentials, region and zone are ignored.
credentials (str, optional) – Path to the file which holds credentials used by given cloud provider. If not specified, cloud provider will use its default means of finding credentials on the system.
region (str, optional) – Region in the cloud where to spawn the cluster. If omitted a default for given provider will be taken.
zone (str, optional) – Availability zone (part of region) where to spawn the cluster. If omitted a default for given provider and region will be taken.
image (str, optional) – Image to use for spawning head and worker nodes. If omitted a default for given provider will be taken.
project_name (str, optional) – Project name to assign to the cluster in cloud, for easier manual tracking.
cluster_name (str, optional) – Name to be given to the cluster. To spawn multiple clusters in single region and zone use different names.
workers (int, optional) – How many worker nodes to spawn in the cluster. Head node is not counted for here.
head_node (str, optional) – What machine type to use for head node in the cluster.
worker_node (str, optional) – What machine type to use for worker nodes in the cluster.
add_conda_packages (list, optional) – Custom conda packages for remote environments. By default remote modin version is the same as local version.
cluster_type (str, optional) – How to spawn the cluster. Currently spawning by Ray autoscaler (“rayscale” for general and “omnisci” for Omnisci-based) is supported
- Returns
The object that knows how to destroy the cluster and how to activate it as remote context. Note that by default spawning and destroying of the cluster happens in the background, as it’s usually a rather lengthy process.
- Return type
BaseCluster descendant
Notes
Cluster computation actually can work when proxies are required to access the cloud. You should set normal “http_proxy”/”https_proxy” variables for HTTP/HTTPS proxies and set “MODIN_SOCKS_PROXY” variable for SOCKS proxy before calling the function.
Using SOCKS proxy requires Ray newer than 0.8.6, which might need to be installed manually.
- modin.experimental.cloud.get_connection()¶
Returns an RPyC connection object to execute Python code remotely on the active cluster.
Usage examples¶
"""
This is a very basic sample script for running things remotely.
It requires `aws_credentials` file to be present in current working directory.
On credentials file format see https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-where
"""
import logging
import modin.pandas as pd
from modin.experimental.cloud import cluster
# set up verbose logging so Ray autoscaler would print a lot of things
# and we'll see that stuff is alive and kicking
logging.basicConfig(format="%(asctime)s %(message)s")
logger = logging.getLogger()
logger.setLevel(logging.DEBUG)
example_cluster = cluster.create("aws", "aws_credentials")
with example_cluster:
remote_df = pd.DataFrame([1, 2, 3, 4])
print(len(remote_df)) # len() is executed remotely
Some more examples can be found in examples/cluster
folder.
Modin vs. pandas¶
Modin exposes the pandas API through modin.pandas
, but it does not inherit the same
pitfalls and design decisions that make it difficult to scale. This page will discuss
how Modin’s dataframe implementation differs from pandas, and how Modin scales pandas.
Scalablity of implementation¶
The pandas implementation is inherently single-threaded. This means that only one of your CPU cores can be utilized at any given time. In a laptop, it would look something like this with pandas:

However, Modin’s implementation enables you to use all of the cores on your machine, or all of the cores in an entire cluster. On a laptop, it will look something like this:

The additional utilization leads to improved performance, however if you want to scale to an entire cluster, Modin suddenly looks something like this:

Modin is able to efficiently make use of all of the hardware available to it!
Memory usage and immutability¶
The pandas API contains many cases of “inplace” updates, which are known to be controversial. This is due in part to the way pandas manages memory: the user may think they are saving memory, but pandas is usually copying the data whether an operation was inplace or not.
Modin allows for inplace semantics, but the underlying data structures within Modin’s implementation are immutable, unlike pandas. This immutability gives Modin the ability to internally chain operators and better manage memory layouts, because they will not be changed. This leads to improvements over pandas in memory usage in many common cases, due to the ability to share common memory blocks among all dataframes.
Modin provides the inplace semantics by having a mutable pointer to the immutable internal Modin dataframe. This pointer can change, but the underlying data cannot, so when an inplace update is triggered, Modin will treat it as if it were not inplace and just update the pointer to the resulting Modin dataframe.
API vs implementation¶
It is well known that the pandas API contains many duplicate ways of performing the same operation. Modin instead enforces that any one behavior have one and only one implementation internally. This guarantee enables Modin to focus on and optimize a smaller code footprint while still guaranteeing that it covers the entire pandas API. Modin has an internal algebra, which is roughly 15 operators, narrowed down from the original >200 that exist in pandas. The algebra is grounded in both practical and theoretical work. Learn more in our VLDB 2020 paper. More information about this algebra can be found in the System Architecture documentation.
Modin vs. Dask Dataframe¶
Dask’s Dataframe is effectively a meta-frame, partitioning and scheduling many smaller
pandas.DataFrame
objects. The Dask DataFrame does not implement the entire pandas
API, and it isn’t trying to. See this explained in the Dask DataFrame documentation.
The TL;DR is that Modin’s API is identical to pandas, whereas Dask’s is not. Note: The projects are fundamentally different in their aims, so a fair comparison is challenging.
API¶
The API of Modin and Dask are different in several ways, explained here.
Dask DataFrame¶
Dask is currently missing multiple APIs from pandas that Modin has implemented. Of note:
Dask does not implement iloc
, MultiIndex
, apply(axis=0)
, quantile
(approximate quantile is available), median
, and more. Some of these APIs cannot be
implemented efficiently or at all given the architecture design tradeoffs made in Dask’s
implementation, and others simply require engineering effort. iloc
, for example, can
be implemented, but it would be inefficient, and apply(axis=0)
cannot be implemented
at all in Dask’s architecture.
Dask DataFrames API is also different from the pandas API in that it is lazy and needs
.compute()
calls to materialize the DataFrame. This makes the API less convenient
but allows Dask to do certain query optimizations/rearrangement, which can give speedups
in certain situations. Several additional APIs exist in the Dask DataFrame API that
expose internal state about how the data is chunked and other data layout details, and
ways to manipulate that state.
Semantically, Dask sorts the index
, which does not allow for user-specified order.
In Dask’s case, this was done for optimization purposes, to speed up other computations
which involve the row index.
Modin¶
Modin is targeted toward parallelizing the entire pandas API, without exception. As the pandas API continues to evolve, so will Modin’s pandas API. Modin is intended to be used as a drop-in replacement for pandas, such that even if the API is not yet parallelized, it still works by falling back to running pandas. One of the key features of being a drop-in replacement is that not only will it work for existing code, if a user wishing to go back to running pandas directly, they may at no cost. There’s no lock-in: Modin notebooks can be converted to and from pandas as the user prefers.
In the long-term, Modin is planned to become a data science framework that supports all popular APIs (SQL, pandas, etc.) with the same underlying execution.
Architecture¶
The differences in Modin and Dask’s architectures are explained in this section.
Dask DataFrame¶
Dask DataFrame uses row-based partitioning, similar to Spark. This can be seen in their documentation. They also have a custom index object for indexing into the object, which is not pandas compatible. Dask DataFrame seems to treat operations on the DataFrame as MapReduce operations, which is a good paradigm for the subset of the pandas API they have chosen to implement, but makes certain operations impossible. Dask Dataframe is also lazy and places a lot of partitioning responsibility on the user.
Modin¶
Modin’s partition is much more flexible, so the system can scale in both directions and have finer grained partitioning. This is explained at a high level in Modin’s documentation. Because we have this finer grained control over the partitioning, we can support a number of operations that are very challenging in MapReduce systems (e.g. transpose, median, quantile). This flexibility in partitioning also gives Modin tremendous power to implement efficient straggler mitigation and improvements in utilization over the entire cluster.
Modin is also architected to run on a variety of systems. The goal here is that users can take the same notebook to different clusters or different environments and it will still just work, run on what you have! Modin does support running on Dask’s compute engine in addition to Ray. The architecture of Modin is extremely modular, we are able to add different execution engines or compile to different memory formats because of this modularity. Modin can run on a Dask cluster in the same way that Dask Dataframe can, but they will still be different in all of the ways described above.
Modin’s implementation is grounded in theory, which is what enables us to implement the entire pandas API.
Modin vs. Koalas and Spark¶
Coming Soon…
Supported APIs and Defaulting to pandas¶
For your convenience, we have compiled a list of currently implemented APIs and methods available in Modin. This documentation is updated as new methods and APIs are merged into the master branch, and not necessarily correct as of the most recent release. In order to install the latest version of Modin, follow the directions found on the installation page.
Questions on implementation details¶
If you have a question about the implementation details or would like more information about an API or method in Modin, please contact the Modin developer mailing list.
Defaulting to pandas¶
The remaining unimplemented methods default to pandas. This allows users to continue using Modin even though their workloads contain functions not yet implemented in Modin. Here is a diagram of how we convert to pandas and perform the operation:

We first convert to a pandas DataFrame, then perform the operation. There is a performance penalty for going from a partitioned Modin DataFrame to pandas because of the communication cost and single-threaded nature of pandas. Once the pandas operation has completed, we convert the DataFrame back into a partitioned Modin DataFrame. This way, operations performed after something defaults to pandas will be optimized with Modin.
The exact methods we have implemented are listed in the respective subsections:
We have taken a community-driven approach to implementing new methods. We did a study on pandas usage to learn what the most-used APIs are. Modin currently supports 93% of the pandas API based on our study of pandas usage, and we are actively expanding the API.
pd.DataFrame
supported APIs¶
The following table lists both implemented and not implemented methods. If you have need of an operation that is listed as not implemented, feel free to open an issue on the GitHub repository, or give a thumbs up to already created issues. Contributions are also welcome!
The following table is structured as follows: The first column contains the method name.
The second column is a flag for whether or not there is an implementation in Modin for
the method in the left column. Y
stands for yes, N
stands for no, P
stands
for partial (meaning some parameters may not be supported yet), and D
stands for
default to pandas.
DataFrame method |
pandas Doc link |
Implemented? (Y/N/P/D) |
Notes for Current implementation |
|
Y |
||
|
Y |
||
|
Y |
Shuffles data in operations between DataFrames |
|
|
Y |
||
|
Y |
||
|
P |
|
|
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
Y |
||
|
D |
Becomes a non-parallel object |
|
|
D |
Becomes a non-parallel object |
|
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
Correlation floating point precision may slightly differ from pandas. For now pearson method is available only. For other methods defaults to pandas. |
|
|
D |
||
|
Y |
||
|
Y |
Covariance floating point precision may slightly differ from pandas. |
|
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
Y |
See |
|
|
Y |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
Y |
Requires shuffle, can be further optimized |
|
|
Y |
||
|
D |
||
|
D |
||
|
D |
||
|
Y |
||
|
P |
|
|
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
D |
||
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
See |
|
|
Y |
||
|
Y |
Not yet optimized for all operations |
|
|
Y |
See |
|
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
P |
Modin does not parallelize iteration in Python |
|
|
P |
Modin does not parallelize iteration in Python |
|
|
P |
Modin does not parallelize iteration in Python |
|
|
P |
When |
|
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
Y |
We do not support: boolean array, callable |
|
|
D |
||
|
Y |
See |
|
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
P |
Implemented the following cases:
|
|
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
Y |
See |
|
|
Y |
||
|
Y |
See |
|
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
See |
|
|
Y |
||
|
Y |
||
|
Y |
||
|
P |
Local variables not yet supported |
|
|
Y |
See |
|
|
Y |
||
|
Y |
See |
|
|
Y |
Shuffles data |
|
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
Y |
See |
|
|
Y |
See |
|
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
Y |
See |
|
|
Y |
See |
|
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
Shuffles data |
|
|
N |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
See |
|
|
Y |
See |
|
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
Experimental implementation: to_pickle_distributed |
|
|
D |
||
|
D |
||
|
Y |
||
|
D |
||
|
D |
||
|
D |
||
|
D |
||
|
Y |
||
|
Y |
||
|
Y |
See |
|
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
Y |
||
|
D |
||
|
Y |
||
|
Y |
pd.Series
supported APIs¶
The following table lists both implemented and not implemented methods. If you have need of an operation that is listed as not implemented, feel free to open an issue on the GitHub repository, or give a thumbs up to already created issues. Contributions are also welcome!
The following table is structured as follows: The first column contains the method name.
The second column is a flag for whether or not there is an implementation in Modin for
the method in the left column. Y
stands for yes, N
stands for no, P
stands
for partial (meaning some parameters may not be supported yet), and D
stands for
default to pandas. To learn more about the implementations that default to pandas, see
the related section on Defaulting to pandas.
Series method |
Modin Implementation? (Y/N/P/D) |
Notes for Current implementation |
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
D |
|
|
D |
|
|
Y |
|
|
D |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
D |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
Correlation floating point precision may slightly differ from pandas. For now pearson method is available only. For other methods defaults to pandas. |
|
Y |
|
|
Y |
Covariance floating point precision may slightly differ from pandas. |
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
D |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
D |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
Y |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
Y |
|
|
D |
|
|
D |
|
|
D |
|
|
D |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
The indices order of resulting object may differ from pandas. |
|
Y |
|
|
Y |
|
|
D |
|
|
Y |
pandas Utilities Supported¶
If you import modin.pandas as pd
the following operations are available from
pd.<op>
, e.g. pd.concat
. If you do not see an operation that pandas enables and
would like to request it, feel free to open an issue. Make sure you tell us your
primary use-case so we can make it happen faster!
The following table is structured as follows: The first column contains the method name.
The second column is a flag for whether or not there is an implementation in Modin for
the method in the left column. Y
stands for yes, N
stands for no, P
stands
for partial (meaning some parameters may not be supported yet), and D
stands for
default to pandas.
Utility method |
Modin Implementation? (Y/N/P/D) |
Notes for Current implementation |
Y |
||
Y |
||
Y |
||
|
Y |
The indices order of resulting object may differ from pandas. |
D |
||
D |
||
D |
||
D |
||
|
D |
|
D |
||
Y |
||
D |
||
D |
||
D |
||
|
Y |
|
|
D |
Other objects & structures¶
This list is a list of objects not currently distributed by Modin. All of these objects are compatible with the distributed components of Modin. If you are interested in contributing a distributed version of any of these objects, feel free to open a pull request.
Panel
Index
MultiIndex
CategoricalIndex
DatetimeIndex
Timedelta
Timestamp
NaT
PeriodIndex
Categorical
Interval
UInt8Dtype
UInt16Dtype
UInt32Dtype
UInt64Dtype
SparseDtype
Int8Dtype
Int16Dtype
Int32Dtype
Int64Dtype
CategoricalDtype
DatetimeTZDtype
IntervalDtype
PeriodDtype
RangeIndex
Int64Index
UInt64Index
Float64Index
TimedeltaIndex
IntervalIndex
IndexSlice
TimeGrouper
Grouper
array
Period
DateOffset
ExcelWriter
SparseArray
SparseSeries
SparseDataFrame
pd.read_<file>
and I/O APIs¶
A number of IO methods default to pandas. We have parallelized read_csv
and
read_parquet
, though many of the remaining methods can be relatively easily
parallelized. Some of the operations default to the pandas implementation, meaning it
will read in serially as a single, non-distributed DataFrame and distribute it.
Performance will be affected by this.
The following table is structured as follows: The first column contains the method name.
The second column is a flag for whether or not there is an implementation in Modin for
the method in the left column. Y
stands for yes, N
stands for no, P
stands
for partial (meaning some parameters may not be supported yet), and D
stands for
default to pandas.
IO method |
Modin Implementation? (Y/N/P/D) |
Notes for Current implementation |
Y |
||
Y |
||
Y |
||
P |
Implemented for |
|
D |
||
D |
||
D |
||
Y |
||
Y |
||
D |
||
D |
||
D |
||
D |
Experimental implementation: read_pickle_distributed |
|
Y |
Contributing¶
Getting Started¶
If you’re interested in getting involved in the development of Modin, but aren’t sure where start, take a look at the issues tagged Good first issue or Documentation. These are issues that would be good for getting familiar with the codebase and better understanding some of the more complex components of the architecture. There is documentation here about the architecture that you will want to review in order to get started.
Also, feel free to join the discussions on the developer mailing list.
Certificate of Origin¶
To keep a clear track of who did what, we use a sign-off procedure (same requirements for using the signed-off-by process as the Linux kernel has https://www.kernel.org/doc/html/v4.17/process/submitting-patches.html) on patches or pull requests that are being sent. The sign-off is a simple line at the end of the explanation for the patch, which certifies that you wrote it or otherwise have the right to pass it on as an open-source patch. The rules are pretty simple: if you can certify the below:
CERTIFICATE OF ORIGIN V 1.1¶
“By making a contribution to this project, I certify that:
1.) The contribution was created in whole or in part by me and I have the right to submit it under the open source license indicated in the file; or 2.) The contribution is based upon previous work that, to the best of my knowledge, is covered under an appropriate open source license and I have the right under that license to submit that work with modifications, whether created in whole or in part by me, under the same open source license (unless I am permitted to submit under a different license), as indicated in the file; or 3.) The contribution was provided directly to me by some other person who certified (a), (b) or (c) and I have not modified it. 4.) I understand and agree that this project and the contribution are public and that a record of the contribution (including all personal information I submit with it, including my sign-off) is maintained indefinitely and may be redistributed consistent with this project or the open source license(s) involved.”
This is my commit message
Signed-off-by: Awesome Developer <developer@example.org>
Code without a proper signoff cannot be merged into the master branch. Note: You must use your real name (sorry, no pseudonyms or anonymous contributions.)
The text can either be manually added to your commit body, or you can add either -s
or --signoff
to your usual git commit
commands:
git commit --signoff
git commit -s
This will use your default git configuration which is found in .git/config. To change this, you can use the following commands:
git config --global user.name "Awesome Developer"
git config --global user.email "awesome.developer.@example.org"
If you have authored a commit that is missing the signed-off-by line, you can amend your commits and push them to GitHub.
git commit --amend --signoff
If you’ve pushed your changes to GitHub already you’ll need to force push your branch
after this with git push -f
.
Commit Message formatting¶
To ensure that all commit messages in the master branch follow a specific format, we enforce that all commit messages must follow the following format:
FEAT-#9999: Add `DataFrame.rolling` functionality, to enable rolling window operations
The FEAT
component represents the type of commit. This component of the commit
message can be one of the following:
FEAT: A new feature that is added
DOCS: Documentation improvements or updates
FIX: A bugfix contribution
REFACTOR: Moving or removing code without change in functionality
TEST: Test updates or improvements
The #9999
component of the commit message should be the issue number in the Modin
GitHub issue tracker: https://github.com/modin-project/modin/issues. This is important
because it links commits to their issues.
The commit message should follow a colon (:) and be descriptive and succinct.
General Rules for committers¶
Try to write a PR name as descriptive as possible.
Try to keep PRs as small as possible. One PR should be making one semantically atomic change.
Don’t merge your own PRs even if you are technically able to do it.
Development Dependencies¶
We recommend doing development in a virtualenv or conda environment, though this decision is ultimately yours. You will want to run the following in order to install all of the required dependencies for running the tests and formatting the code:
conda env create --file environment-dev.yml
# or
pip install -r requirements-dev.txt
Code Formatting and Lint¶
We use black for code formatting. Before you submit a pull request, please make sure that you run the following from the project root:
black modin/
We also use flake8 to check linting errors. Running the following from the project root will ensure that it passes the lint checks on Github Actions:
flake8 .
We test that this has been run on our Github Actions test suite. If you do this and find that the tests are still failing, try updating your version of black and flake8.
Adding a test¶
If you find yourself fixing a bug or adding a new feature, don’t forget to add a test to the test suite to verify its correctness! More on testing and the layout of the tests can be found in our testing documentation. We ask that you follow the existing structure of the tests for ease of maintenance.
Running the tests¶
To run the entire test suite, run the following from the project root:
pytest modin/pandas/test
The test suite is very large, and may take a long time if you run every test. If you’ve only modified a small amount of code, it may be sufficient to run a single test or some subset of the test suite. In order to run a specific test run:
pytest modin/pandas/test::test_new_functionality
The entire test suite is automatically run for each pull request.
Performance measurement¶
We use Asv tool for performance tracking of various Modin functionality. The results can be viewed here: Asv dashboard.
More information can be found in the Asv readme.
Building documentation¶
To build the documentation, please follow the steps below from the project root:
cd docs
pip install -r requirements-doc.txt
sphinx-build -b html . build
To visualize the documentation locally, run the following from build folder:
python -m http.server <port>
# python -m http.server 1234
then open the browser at 0.0.0.0:<port> (e.g. 0.0.0.0:1234).
Contributing a new execution framework or in-memory format¶
If you are interested in contributing support for a new execution framework or in-memory format, please make sure you understand the architecture of Modin.
The best place to start the discussion for adding a new execution framework or in-memory format is the developer mailing list.
More docs on this coming soon…
System Architecture¶
In this section, we will lay out the overall system architecture for Modin, as well as go into detail about the component design, implementation and other important details. This document also contains important reference information for those interested in contributing new functionality, bugfixes and enhancements.
High-Level Architectural View¶
The diagram below outlines the general layered view to the components of Modin with a short description of each major section of the documentation following.

Modin is logically separated into different layers that represent the hierarchy of a typical Database Management System. Abstracting out each component allows us to individually optimize and swap out components without affecting the rest of the system. We can implement, for example, new compute kernels that are optimized for a certain type of data and can simply plug it in to the existing infrastructure by implementing a small interface. It can still be distributed by our choice of compute engine with the logic internally.
System View¶
If we look to the overall class structure of the Modin system from very top, it will look to something like this:

The user - Data Scientist interacts with the Modin system by sending interactive or batch commands through API and Modin executes them using various backend execution engines: Ray, Dask and MPI are currently supported.
Subsystem/Container View¶
If we click down to the next level of details we will see that inside Modin the layered architecture is implemented using several interacting components:

For the simplicity the other backend systems - Dask and MPI are omitted and only Ray backend is shown.
Dataframe subsystem is the backbone of the dataframe holding and query compilation. It is responsible for dispatching the ingress/egress to the appropriate module, getting the Pandas API and calling the query compiler to convert calls to the internal intermediate Dataframe Algebra.
Data Ingress/Egress Module is working in conjunction with Dataframe and Partitions subsystem to read data split into partitions and send data into the appropriate node for storing.
Query Planner is subsystem that translates the Pandas API to intermediate Dataframe Algebra representation DAG and performs an initial set of optimizations.
Query Executor is responsible for getting the Dataframe Algebra DAG, performing further optimizations based on a selected backend execution subsystem and mapping or compiling the Dataframe Algebra DAG to and actual execution sequence.
Backends module is responsible for mapping the abstract operation to an actual executor call, e.g. Pandas, PyArrow, custom backend.
Orchestration subsystem is responsible for spawning and controlling the actual execution environment for the selected backend. It spawns the actual nodes, fires up the execution environment, e.g. Ray, monitors the state of executors and provides telemetry
Component View¶
Base Frame Objects¶
Modin paritions data to scale efficiently.
To keep track of everything a few key classes are introduced: Frame
, Partition
, AxisPartiton
and PartitionManager
.
Frame is the class conforming to DataFrame Algebra.
Partition is an element of a NxM grid which, when combined, represents the
Frame
AxisPartition is a joined group of
Parition
-s along some axis (either rows or labels)PartitionManager is the manager that implements the primitives used for DataFrame Algebra operations over
Partition
-s
PandasFrame¶
The class is base for any frame class of pandas
backend and serves as the intermediate level
between pandas
query compiler and conforming partition manager. All queries formed
at the query compiler layer are ingested by this class and then conveyed jointly with the stored partitions
into the partition manager for processing. Direct partitions manipulation by this class is prohibited except
cases if an operation is striclty private or protected and called inside of the class only. The class provides
significantly reduced set of operations that fit plenty of pandas operations.
Main tasks of PandasFrame
are storage of partitions, manipulation with labels of axes and
providing set of methods to perform operations on the internal data.
As mentioned above, PandasFrame
shouldn’t work with stored partitions directly and
the responsibility for modifying partitions array has to lay on PandasFramePartitionManager. For example, method
broadcast_apply_full_axis()
redirects applying
function to PandasFramePartitionManager.broadcast_axis_partitions
method.
PandasFrame
can be created from pandas.DataFrame
, pyarrow.Table
(methods from_pandas()
,
from_arrow()
are used respectively). Also,
PandasFrame
can be converted to np.array
, pandas.DataFrame
(methods to_numpy()
,
to_pandas()
are used respectively).
Manipulation with labels of axes happens using internal methods for changing labels on the new, adding prefixes/suffixes etc.
Public API¶
- class modin.engines.base.frame.data.PandasFrame(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)¶
An abstract class that represents the parent class for any pandas backend dataframe class.
This class provides interfaces to run operations on dataframe partitions.
- Parameters
partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence) – The index for the dataframe. Converted to a
pandas.Index
.columns (sequence) – The columns object for the dataframe. Converted to a
pandas.Index
.row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.
- add_prefix(prefix, axis)¶
Add a prefix to the current row or column labels.
- Parameters
prefix (str) – The prefix to add.
axis (int) – The axis to update.
- Returns
A new dataframe with the updated labels.
- Return type
- add_suffix(suffix, axis)¶
Add a suffix to the current row or column labels.
- Parameters
suffix (str) – The suffix to add.
axis (int) – The axis to update.
- Returns
A new dataframe with the updated labels.
- Return type
- apply_full_axis(axis, func, new_index=None, new_columns=None, dtypes=None)¶
Perform a function across an entire axis.
- Parameters
axis ({0, 1}) – The axis to apply over (0 - rows, 1 - columns).
func (callable) – The function to apply.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
dtypes (list-like, optional) – The data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.
- Returns
A new dataframe.
- Return type
Notes
The data shape may change as a result of the function.
- apply_full_axis_select_indices(axis, func, apply_indices=None, numeric_indices=None, new_index=None, new_columns=None, keep_remaining=False)¶
Apply a function across an entire axis for a subset of the data.
- Parameters
axis (int) – The axis to apply over.
func (callable) – The function to apply.
apply_indices (list-like, default: None) – The labels to apply over.
numeric_indices (list-like, default: None) – The indices to apply over.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.
- Returns
A new dataframe.
- Return type
- apply_select_indices(axis, func, apply_indices=None, row_indices=None, col_indices=None, new_index=None, new_columns=None, keep_remaining=False, item_to_distribute=None)¶
Apply a function for a subset of the data.
- Parameters
axis ({0, 1}) – The axis to apply over.
func (callable) – The function to apply.
apply_indices (list-like, default: None) – The labels to apply over. Must be given if axis is provided.
row_indices (list-like, default: None) – The row indices to apply over. Must be provided with col_indices to apply over both axes.
col_indices (list-like, default: None) – The column indices to apply over. Must be provided with row_indices to apply over both axes.
new_index (list-like, optional) – The index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – The columns of the result. We may know this in advance, and if not provided it must be computed.
keep_remaining (boolean, default: False) – Whether or not to drop the data that is not computed over.
item_to_distribute ((optional)) – The item to split up so it can be applied over both axes.
- Returns
A new dataframe.
- Return type
- astype(col_dtypes)¶
Convert the columns dtypes to given dtypes.
- Parameters
col_dtypes (dictionary of {col: dtype,...}) – Where col is the column name and dtype is a NumPy dtype.
- Returns
Dataframe with updated dtypes.
- Return type
BaseDataFrame
- property axes¶
Get index and columns that can be accessed with an axis integer.
- Returns
List with two values: index and columns.
- Return type
list
- binary_op(op, right_frame, join_type='outer')¶
Perform an operation that requires joining with another Modin DataFrame.
- Parameters
op (callable) – Function to apply after the join.
right_frame (PandasFrame) – Modin DataFrame to join with.
join_type (str, default: "outer") – Type of join to apply.
- Returns
New Modin DataFrame.
- Return type
- broadcast_apply(axis, func, other, join_type='left', preserve_labels=True, dtypes=None)¶
Broadcast axis partitions of other to partitions of self and apply a function.
- Parameters
axis ({0, 1}) – Axis to broadcast over.
func (callable) – Function to apply.
other (PandasFrame) – Modin DataFrame to broadcast.
join_type (str, default: "left") – Type of join to apply.
preserve_labels (bool, default: True) – Whether keep labels from self Modin DataFrame or not.
dtypes ("copy" or None, default: None) – Whether keep old dtypes or infer new dtypes from data.
- Returns
New Modin DataFrame.
- Return type
- broadcast_apply_full_axis(axis, func, other, new_index=None, new_columns=None, apply_indices=None, enumerate_partitions=False, dtypes=None)¶
Broadcast partitions of other Modin DataFrame and apply a function along full axis.
- Parameters
axis ({0, 1}) – Axis to apply over (0 - rows, 1 - columns).
func (callable) – Function to apply.
other (PandasFrame or list) – Modin DataFrame(s) to broadcast.
new_index (list-like, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (list-like, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply function over.
enumerate_partitions (bool, default: False) – Whether pass partition index into applied func or not. Note that func must be able to obtain partition_idx kwarg.
dtypes (list-like, default: None) – Data types of the result. This is an optimization because there are functions that always result in a particular data type, and allows us to avoid (re)computing it.
- Returns
New Modin DataFrame.
- Return type
- broadcast_apply_select_indices(axis, func, other, apply_indices=None, numeric_indices=None, keep_remaining=False, broadcast_all=True, new_index=None, new_columns=None)¶
Apply a function to select indices at specified axis and broadcast partitions of other Modin DataFrame.
- Parameters
axis ({0, 1}) – Axis to apply function along.
func (callable) – Function to apply.
other (PandasFrame) – Partitions of which should be broadcasted.
apply_indices (list, default: None) – List of labels to apply (if numeric_indices are not specified).
numeric_indices (list, default: None) – Numeric indices to apply (if apply_indices are not specified).
keep_remaining (bool, default: False) – Whether drop the data that is not computed over or not.
broadcast_all (bool, default: True) – Whether broadcast the whole axis of right frame to every partition or just a subset of it.
new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
- Returns
New Modin DataFrame.
- Return type
- property columns¶
Get the columns from the cache object.
- Returns
An index object containing the column labels.
- Return type
pandas.Index
- classmethod combine_dtypes(list_of_dtypes, column_names)¶
Describe how data types should be combined when they do not match.
- Parameters
list_of_dtypes (list) – A list of pandas Series with the data types.
column_names (list) – The names of the columns that the data types map to.
- Returns
A pandas Series containing the finalized data types.
- Return type
pandas.Series
- concat(axis, others, how, sort)¶
Concatenate self with one or more other Modin DataFrames.
- Parameters
axis ({0, 1}) – Axis to concatenate over.
others (list) – List of Modin DataFrames to concatenate with.
how (str) – Type of join to use for the axis.
sort (bool) – Whether sort the result or not.
- Returns
New Modin DataFrame.
- Return type
- copy()¶
Copy this object.
- Returns
A copied version of this object.
- Return type
- property dtypes¶
Compute the data types if they are not cached.
- Returns
A pandas Series containing the data types for this dataframe.
- Return type
pandas.Series
- filter_full_axis(axis, func)¶
Filter data based on the function provided along an entire axis.
- Parameters
axis (int) – The axis to filter over.
func (callable) – The function to use for the filter. This function should filter the data itself.
- Returns
A new filtered dataframe.
- Return type
- finalize()¶
Perform all deferred calls on partitions.
This makes self Modin Dataframe independent of a history of queries that were used to build it.
- fold(axis, func)¶
Perform a function across an entire axis.
- Parameters
axis (int) – The axis to apply over.
func (callable) – The function to apply.
- Returns
A new dataframe.
- Return type
Notes
The data shape is not changed (length and width of the table).
- fold_reduce(axis, func)¶
Apply function that reduces Frame Manager to series but requires knowledge of full axis.
- Parameters
axis ({0, 1}) – The axis to apply the function to (0 - index, 1 - columns).
func (callable) – The function to reduce the Manager by. This function takes in a Manager.
- Returns
Modin series (1xN frame) containing the reduced data.
- Return type
- classmethod from_arrow(at)¶
Create a Modin DataFrame from an Arrow Table.
- Parameters
at (pyarrow.table) – Arrow Table.
- Returns
New Modin DataFrame.
- Return type
- from_labels() modin.engines.base.frame.data.PandasFrame ¶
Convert the row labels to a column of data, inserted at the first position.
Gives result by similar way as pandas.DataFrame.reset_index. Each level of self.index will be added as separate column of data.
- Returns
A PandasFrame with new columns from index labels.
- Return type
- classmethod from_pandas(df)¶
Create a Modin DataFrame from a pandas DataFrame.
- Parameters
df (pandas.DataFrame) – A pandas DataFrame.
- Returns
New Modin DataFrame.
- Return type
- groupby_reduce(axis, by, map_func, reduce_func, new_index=None, new_columns=None, apply_indices=None)¶
Groupby another Modin DataFrame dataframe and aggregate the result.
- Parameters
axis ({0, 1}) – Axis to groupby and aggregate over.
by (PandasFrame or None) – A Modin DataFrame to group by.
map_func (callable) – Map component of the aggregation.
reduce_func (callable) – Reduce component of the aggregation.
new_index (pandas.Index, optional) – Index of the result. We may know this in advance, and if not provided it must be computed.
new_columns (pandas.Index, optional) – Columns of the result. We may know this in advance, and if not provided it must be computed.
apply_indices (list-like, default: None) – Indices of axis ^ 1 to apply groupby over.
- Returns
New Modin DataFrame.
- Return type
- property index¶
Get the index from the cache object.
- Returns
An index object containing the row labels.
- Return type
pandas.Index
- map(func, dtypes=None)¶
Perform a function that maps across the entire dataset.
- Parameters
func (callable) – The function to apply.
dtypes (dtypes of the result, default: None) – The data types for the result. This is an optimization because there are functions that always result in a particular data type, and this allows us to avoid (re)computing it.
- Returns
A new dataframe.
- Return type
- map_reduce(axis, map_func, reduce_func=None)¶
Apply function that will reduce the data to a pandas Series.
- Parameters
axis ({0, 1}) – 0 for columns and 1 for rows.
map_func (callable) – Callable function to map the dataframe.
reduce_func (callable, default: None) – Callable function to reduce the dataframe. If none, then apply map_func twice.
- Returns
A new dataframe.
- Return type
- mask(row_indices=None, row_numeric_idx=None, col_indices=None, col_numeric_idx=None)¶
Lazily select columns or rows from given indices.
- Parameters
row_indices (list of hashable, optional) – The row labels to extract.
row_numeric_idx (list of int, optional) – The row indices to extract.
col_indices (list of hashable, optional) – The column labels to extract.
col_numeric_idx (list of int, optional) – The column indices to extract.
- Returns
A new PandasFrame from the mask provided.
- Return type
Notes
If both row_indices and row_numeric_idx are set, row_indices will be used. The same rule applied to col_indices and col_numeric_idx.
- numeric_columns(include_bool=True)¶
Return the names of numeric columns in the frame.
- Parameters
include_bool (bool, default: True) – Whether to consider boolean columns as numeric.
- Returns
List of column names.
- Return type
list
- synchronize_labels(axis=None)¶
Synchronize labels by applying the index object for specific axis to the self._partitions lazily.
Adds set_axis function to call-queue of each partition from self._partitions to apply new axis.
- Parameters
axis (int, default: None) – The axis to apply to. If it’s None applies to both axes.
- to_labels(column_list: List[Hashable]) modin.engines.base.frame.data.PandasFrame ¶
Move one or more columns into the row labels. Previous labels are dropped.
- Parameters
column_list (list of hashable) – The list of column names to place as the new row labels.
- Returns
A new PandasFrame that has the updated labels.
- Return type
- to_numpy(**kwargs)¶
Convert this Modin DataFrame to a NumPy array.
- Parameters
**kwargs (dict) – Additional keyword arguments to be passed in to_numpy.
- Returns
- Return type
np.ndarray
- to_pandas()¶
Convert this Modin DataFrame to a pandas DataFrame.
- Returns
- Return type
- transpose()¶
Transpose the index and columns of this Modin DataFrame.
Reflect this Modin DataFrame over its main diagonal by writing rows as columns and vice-versa.
- Returns
New Modin DataFrame.
- Return type
PandasFramePartition¶
The class is base for any partition class of pandas
backend and serves as the last level
on which operations that were conveyed from the partition manager are being performed on an
individual block partition.
The class provides an API that has to be overridden by child classes in order to manipulate on data and metadata they store.
The public API exposed by the children of this class is used in PandasFramePartitionManager
.
The objects wrapped by the child classes are treated as immutable by PandasFramePartitionManager
subclasses
and no logic for updating inplace.
Public API¶
- class modin.engines.base.frame.partition.PandasFramePartition¶
An abstract class that is base for any partition class of
pandas
backend.The class providing an API that has to be overridden by child classes.
- add_to_apply_calls(func, *args, **kwargs)¶
Add a function to the call queue.
- Parameters
func (callable) – Function to be added to the call queue.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns
New PandasFramePartition object with the function added to the call queue.
- Return type
Notes
This function will be executed when apply is called. It will be executed in the order inserted; apply’s func operates the last and return.
- apply(func, *args, **kwargs)¶
Apply a function to the object wrapped by this partition.
- Parameters
func (callable) – Function to apply.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns
New PandasFramePartition object.
- Return type
Notes
It is up to the implementation how kwargs are handled. They are an important part of many implementations. As of right now, they are not serialized.
- drain_call_queue()¶
Execute all operations stored in the call queue on the object wrapped by this partition.
- classmethod empty()¶
Create a new partition that wraps an empty pandas DataFrame.
- Returns
New PandasFramePartition object.
- Return type
- get()¶
Get the object wrapped by this partition.
- Returns
The object that was wrapped by this partition.
- Return type
object
Notes
This is the opposite of the classmethod put. E.g. if you assign x = PandasFramePartition.put(1), x.get() should always return 1.
- length()¶
Get the length of the object wrapped by this partition.
- Returns
The length of the object.
- Return type
int
- mask(row_indices, col_indices)¶
Lazily create a mask that extracts the indices provided.
- Parameters
row_indices (list-like, slice or label) – The indices for the rows to extract.
col_indices (list-like, slice or label) – The indices for the columns to extract.
- Returns
New PandasFramePartition object.
- Return type
- classmethod preprocess_func(func)¶
Preprocess a function before an apply call.
- Parameters
func (callable) – Function to preprocess.
- Returns
An object that can be accepted by apply.
- Return type
callable
Notes
This is a classmethod because the definition of how to preprocess should be class-wide. Also, we may want to use this before we deploy a preprocessed function to multiple PandasFramePartition objects.
- classmethod put(obj)¶
Put an object into a store and wrap it with partition object.
- Parameters
obj (object) – An object to be put.
- Returns
New PandasFramePartition object.
- Return type
- to_numpy(**kwargs)¶
Convert the object wrapped by this partition to a NumPy array.
- Parameters
**kwargs (dict) – Additional keyword arguments to be passed in to_numpy.
- Returns
- Return type
np.ndarray
Notes
If the underlying object is a pandas DataFrame, this will return a 2D NumPy array.
- to_pandas()¶
Convert the object wrapped by this partition to a pandas DataFrame.
- Returns
- Return type
Notes
If the underlying object is a pandas DataFrame, this will likely only need to call get.
- wait()¶
Wait for completion of computations on the object wrapped by the partition.
- width()¶
Get the width of the object wrapped by the partition.
- Returns
The width of the object.
- Return type
int
BaseFrameAxisPartition¶
The class is base for any axis partition class and serves as the last level on which operations that were conveyed from the partition manager are being performed on an entire column or row.
The class provides an API that has to be overridden by the child classes in order to manipulate on a list of block partitions (making up column or row partition) they store.
The procedures that use this class and its methods assume that they have some global knowledge about the entire axis. This may require the implementation to use concatenation or append on the list of block partitions.
The PandasFramePartitionManager
object that controls these objects (through the API exposed here) has an invariant
that requires that this object is never returned from a function. It assumes that there will always be
PandasFramePartition
object stored and structures itself accordingly.
Public API¶
- class modin.engines.base.frame.axis_partition.BaseFrameAxisPartition¶
An abstract class that represents the parent class for any axis partition class.
This class is intended to simplify the way that operations are performed.
- apply(func, num_splits=None, other_axis_partition=None, maintain_partitioning=True, **kwargs)¶
Apply a function to this axis partition along full axis.
- Parameters
func (callable) – The function to apply. This will be preprocessed according to the corresponding PandasFramePartition objects.
num_splits (int, default: None) – The number of times to split the result object.
other_axis_partition (BaseFrameAxisPartition, default: None) – Another BaseFrameAxisPartition object to be applied to func. This is for operations that are between two data sets.
maintain_partitioning (bool, default: True) – Whether to keep the partitioning in the same orientation as it was previously or not. This is important because we may be operating on an individual axis partition and not touching the rest. In this case, we have to return the partitioning to its previous orientation (the lengths will remain the same). This is ignored between two axis partitions.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
A list of PandasFramePartition objects.
- Return type
list
Notes
The procedures that invoke this method assume full axis knowledge. Implement this method accordingly.
You must return a list of PandasFramePartition objects from this method.
- force_materialization(get_ip=False)¶
Materialize axis partitions into a single partition.
- Parameters
get_ip (bool, default: False) – Whether to get node ip address to a single partition or not.
- Returns
An axis partition containing only a single materialized partition.
- Return type
- shuffle(func, lengths, **kwargs)¶
Shuffle the order of the data in this axis partition based on the lengths.
- Parameters
func (callable) – The function to apply before splitting.
lengths (list) – The list of partition lengths to split the result into.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
A list of PandasFramePartition objects split by lengths.
- Return type
list
- unwrap(squeeze=False, get_ip=False)¶
Unwrap partitions from this axis partition.
- Parameters
squeeze (bool, default: False) – Flag used to unwrap only one partition.
get_ip (bool, default: False) – Whether to get node ip address to each partition or not.
- Returns
List of partitions from this axis partition.
- Return type
list
Notes
If get_ip=True, a list of tuples of Ray.ObjectRef/Dask.Future to node ip addresses and unwrapped partitions, respectively, is returned if Ray/Dask is used as an engine (i.e. [(Ray.ObjectRef/Dask.Future, Ray.ObjectRef/Dask.Future), …]).
PandasFrameAxisPartition¶
The class is base for any axis partition class of pandas
backend.
Subclasses must implement list_of_blocks
which represents data wrapped by the PandasFramePartition
objects and creates something interpretable as a pandas DataFrame.
See modin.engines.ray.pandas_on_ray.axis_partition.PandasOnRayFrameAxisPartition
for an example on how to override/use this class when the implementation needs to be augmented.
Public API¶
- class modin.engines.base.frame.axis_partition.PandasFrameAxisPartition¶
An abstract class is created to simplify and consolidate the code for axis partition that run pandas.
Because much of the code is similar, this allows us to reuse this code.
- apply(func, num_splits=None, other_axis_partition=None, maintain_partitioning=True, **kwargs)¶
Apply a function to this axis partition along full axis.
- Parameters
func (callable) – The function to apply.
num_splits (int, default: None) – The number of times to split the result object.
other_axis_partition (PandasFrameAxisPartition, default: None) – Another PandasFrameAxisPartition object to be applied to func. This is for operations that are between two data sets.
maintain_partitioning (bool, default: True) – Whether to keep the partitioning in the same orientation as it was previously or not. This is important because we may be operating on an individual AxisPartition and not touching the rest. In this case, we have to return the partitioning to its previous orientation (the lengths will remain the same). This is ignored between two axis partitions.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
A list of PandasFramePartition objects.
- Return type
list
- classmethod deploy_axis_func(axis, func, num_splits, kwargs, maintain_partitioning, *partitions)¶
Deploy a function along a full axis.
- Parameters
axis ({0, 1}) – The axis to perform the function along.
func (callable) – The function to perform.
num_splits (int) – The number of splits to return (see split_result_of_axis_func_pandas).
kwargs (dict) – Additional keywords arguments to be passed in func.
maintain_partitioning (bool) – If True, keep the old partitioning if possible. If False, create a new partition layout.
*partitions (iterable) – All partitions that make up the full axis (row or column).
- Returns
A list of pandas DataFrames.
- Return type
list
- classmethod deploy_func_between_two_axis_partitions(axis, func, num_splits, len_of_left, other_shape, kwargs, *partitions)¶
Deploy a function along a full axis between two data sets.
- Parameters
axis ({0, 1}) – The axis to perform the function along.
func (callable) – The function to perform.
num_splits (int) – The number of splits to return (see split_result_of_axis_func_pandas).
len_of_left (int) – The number of values in partitions that belong to the left data set.
other_shape (np.ndarray) – The shape of right frame in terms of partitions, i.e. (other_shape[i-1], other_shape[i]) will indicate slice to restore i-1 axis partition.
kwargs (dict) – Additional keywords arguments to be passed in func.
*partitions (iterable) – All partitions that make up the full axis (row or column) for both data sets.
- Returns
A list of pandas DataFrames.
- Return type
list
- shuffle(func, lengths, **kwargs)¶
Shuffle the order of the data in this axis partition based on the lengths.
- Parameters
func (callable) – The function to apply before splitting.
lengths (list) – The list of partition lengths to split the result into.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
A list of PandasFramePartition objects split by lengths.
- Return type
list
PandasFramePartitionManager¶
The class is base for any partition manager class of pandas
backend and serves as
intermediate level between pandas
base frame and conforming partition class.
The class is responsible for partitions manipulation and applying a function to individual partitions:
block partitions, row partitions or column partitions, i.e. the class can form axis partitions from
block partitions to apply a function if an operation requires access to an entire column or row.
The class translates frame API into partition API and also can have some preprocessing operations
depending on the partition type for improving performance (for example,
preprocess_func()
).
Main task of partition manager is to keep knowledge of how partitions are stored and managed internal to itself, so surrounding code could use it via lean enough API without worrying about implementation details.
Partition manager can apply user-passed (arbitrary) function in different modes:
block-wise (apply a function to individual block partitions):
optinally accepting partition indices along each axis
optionally accepting an item to be split so parts of it would be sent to each partition
along a full axis (apply a function to an entire column or row made up of block partitions when user function needs information about the whole axis)
It can also broadcast partitions from right to left when executing certain operations making right partitions available for functions executed where left live.
Partition manager also is used to create “logical” partitions, or axis partitions by joining existing partitions along specified axis (either rows or labels), and to concatenate different partition sets along given axis.
It also maintains mapping from “external” (end user-visible) indices along all axes to internal indices which are actually pairs of indices of partitions and indices inside the partitions, as well as manages conversion to numpy and pandas representations.
Public API¶
- class modin.engines.base.frame.partition_manager.PandasFramePartitionManager¶
Base class for managing the dataframe data layout and operators across the distribution of partitions.
Partition class is the class to use for storing each partition. Each partition must extend the PandasFramePartition class.
- classmethod apply_func_to_indices_both_axis(partitions, func, row_partitions_list, col_partitions_list, item_to_distribute=None)¶
Apply a function along both axes.
- Parameters
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply.
row_partitions_list (list) – List of row partitions.
col_partitions_list (list) – List of column partitions.
item_to_distribute (item, default: None) – The item to split up so it can be applied over both axes.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
For your func to operate directly on the indices provided, it must use row_internal_indices, col_internal_indices as keyword arguments.
- classmethod apply_func_to_select_indices(axis, partitions, func, indices, keep_remaining=False)¶
Apply a function to select indices.
- Parameters
axis ({0, 1}) – Axis to apply the func over.
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply to these indices of partitions.
indices (dict) – The indices to apply the function to.
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
Your internal function must take a kwarg internal_indices for this to work correctly. This prevents information leakage of the internal index to the external representation.
- classmethod apply_func_to_select_indices_along_full_axis(axis, partitions, func, indices, keep_remaining=False)¶
Apply a function to a select subset of full columns/rows.
- Parameters
axis ({0, 1}) – The axis to apply the function over.
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply.
indices (list-like) – The global indices to apply the func to.
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
This should be used when you need to apply a function that relies on some global information for the entire column/row, but only need to apply a function to a subset. For your func to operate directly on the indices provided, it must use internal_indices as a keyword argument.
- classmethod axis_partition(partitions, axis)¶
Logically partition along given axis (columns or rows).
- Parameters
partitions (list-like) – List of partitions to be combined.
axis ({0, 1}) – 0 for column partitions, 1 for row partitions.
- Returns
A list of BaseFrameAxisPartition objects.
- Return type
list
- classmethod binary_operation(axis, left, func, right)¶
Apply a function that requires two PandasFrame objects.
- Parameters
axis ({0, 1}) – The axis to apply the function over (0 - rows, 1 - columns).
left (np.ndarray) – The partitions of left PandasFrame.
func (callable) – The function to apply.
right (np.ndarray) – The partitions of right PandasFrame.
- Returns
A NumPy array with new partitions.
- Return type
np.ndarray
- classmethod broadcast_apply(axis, apply_func, left, right, other_name='r')¶
Broadcast the right partitions to left and apply apply_func function.
- Parameters
axis ({0, 1}) – Axis to apply and broadcast over.
apply_func (callable) – Function to apply.
left (NumPy 2D array) – Left partitions.
right (NumPy 2D array) – Right partitions.
other_name (str, default: "r") – Name of key-value argument for apply_func that is used to pass right to apply_func.
- Returns
An of partition objects.
- Return type
NumPy array
Notes
This will often be overridden by implementations. It materializes the entire partitions of the right and applies them to the left through apply.
- classmethod broadcast_apply_select_indices(axis, apply_func, left, right, left_indices, right_indices, keep_remaining=False)¶
Broadcast the right partitions to left and apply apply_func to selected indices.
- Parameters
axis ({0, 1}) – Axis to apply and broadcast over.
apply_func (callable) – Function to apply.
left (NumPy 2D array) – Left partitions.
right (NumPy 2D array) – Right partitions.
left_indices (list-like) – Indices to apply function to.
right_indices (dictionary of indices of right partitions) – Indices that you want to bring at specified left partition, for example dict {key: {key1: [0, 1], key2: [5]}} means that in left[key] you want to broadcast [right[key1], right[key2]] partitions and internal indices for right must be [[0, 1], [5]].
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.
- Returns
An array of partition objects.
- Return type
NumPy array
Notes
Your internal function must take these kwargs: [internal_indices, other, internal_other_indices] to work correctly!
- classmethod broadcast_axis_partitions(axis, apply_func, left, right, keep_partitioning=False, apply_indices=None, enumerate_partitions=False, lengths=None)¶
Broadcast the right partitions to left and apply apply_func along full axis.
- Parameters
axis ({0, 1}) – Axis to apply and broadcast over.
apply_func (callable) – Function to apply.
left (NumPy 2D array) – Left partitions.
right (NumPy 2D array) – Right partitions.
keep_partitioning (boolean, default: False) – The flag to keep partition boundaries for Modin Frame. Setting it to True disables shuffling data from one partition to another.
apply_indices (list of ints, default: None) – Indices of axis ^ 1 to apply function over.
enumerate_partitions (bool, default: False) – Whether or not to pass partition index into apply_func. Note that apply_func must be able to accept partition_idx kwarg.
lengths (list of ints, default: None) – The list of lengths to shuffle the object.
- Returns
An array of partition objects.
- Return type
NumPy array
- classmethod column_partitions(partitions)¶
Get the list of BaseFrameAxisPartition objects representing column-wise paritions.
- Parameters
partitions (list-like) – List of (smaller) partitions to be combined to column-wise partitions.
- Returns
A list of BaseFrameAxisPartition objects.
- Return type
list
Notes
Each value in this list will be an BaseFrameAxisPartition object. BaseFrameAxisPartition is located in axis_partition.py.
- classmethod concat(axis, left_parts, right_parts)¶
Concatenate the blocks of partitions with another set of blocks.
- Parameters
axis (int) – The axis to concatenate to.
left_parts (np.ndarray) – NumPy array of partitions to concatenate with.
right_parts (np.ndarray or list) – NumPy array of partitions to be concatenated.
- Returns
A new NumPy array with concatenated partitions.
- Return type
np.ndarray
Notes
Assumes that the blocks are already the same shape on the dimension being concatenated. A ValueError will be thrown if this condition is not met.
- classmethod concatenate(dfs)¶
Concatenate pandas DataFrames with saving ‘category’ dtype.
- Parameters
dfs (list) – List of pandas DataFrames to concatenate.
- Returns
A pandas DataFrame
- Return type
- classmethod finalize(partitions)¶
Perform all deferred calls on partitions.
- Parameters
partitions (np.ndarray) – Partitions of Modin Dataframe on which all deferred calls should be performed.
- classmethod from_arrow(at, return_dims=False)¶
Return the partitions from Apache Arrow (PyArrow).
- Parameters
at (pyarrow.table) – Arrow Table.
return_dims (bool, default: False) – If it’s True, return as (np.ndarray, row_lengths, col_widths), else np.ndarray.
- Returns
A NumPy array with partitions (with dimensions or not).
- Return type
np.ndarray or (np.ndarray, row_lengths, col_widths)
- classmethod from_pandas(df, return_dims=False)¶
Return the partitions from pandas.DataFrame.
- Parameters
df (pandas.DataFrame) – A pandas.DataFrame.
return_dims (bool, default: False) – If it’s True, return as (np.ndarray, row_lengths, col_widths), else np.ndarray.
- Returns
A NumPy array with partitions (with dimensions or not).
- Return type
np.ndarray or (np.ndarray, row_lengths, col_widths)
- classmethod get_indices(axis, partitions, index_func=None)¶
Get the internal indices stored in the partitions.
- Parameters
axis ({0, 1}) – Axis to extract the labels over.
partitions (np.ndarray) – NumPy array with PandasFramePartition’s.
index_func (callable, default: None) – The function to be used to extract the indices.
- Returns
A pandas Index object.
- Return type
pandas.Index
Notes
These are the global indices of the object. This is mostly useful when you have deleted rows/columns internally, but do not know which ones were deleted.
- classmethod groupby_reduce(axis, partitions, by, map_func, reduce_func, apply_indices=None)¶
Groupby data using the map_func provided along the axis over the partitions then reduce using reduce_func.
- Parameters
axis ({0, 1}) – Axis to groupby over.
partitions (NumPy 2D array) – Partitions of the ModinFrame to groupby.
by (NumPy 2D array) – Partitions of ‘by’ to broadcast.
map_func (callable) – Map function.
reduce_func (callable,) – Reduce function.
apply_indices (list of ints, default: None) – Indices of axis ^ 1 to apply function over.
- Returns
Partitions with applied groupby.
- Return type
NumPy array
- classmethod lazy_map_partitions(partitions, map_func)¶
Apply map_func to every partition in partitions lazily.
- Parameters
partitions (NumPy 2D array) – Partitions of Modin Frame.
map_func (callable) – Function to apply.
- Returns
An array of partitions
- Return type
NumPy array
- classmethod map_axis_partitions(axis, partitions, map_func, keep_partitioning=False, lengths=None, enumerate_partitions=False)¶
Apply map_func to every partition in partitions along given axis.
- Parameters
axis ({0, 1}) – Axis to perform the map across (0 - index, 1 - columns).
partitions (NumPy 2D array) – Partitions of Modin Frame.
map_func (callable) – Function to apply.
keep_partitioning (bool, default: False) – Whether to keep partitioning for Modin Frame. Setting it to True stops data shuffling between partitions.
lengths (list of ints, default: None) – List of lengths to shuffle the object.
enumerate_partitions (bool, default: False) – Whether or not to pass partition index into map_func. Note that map_func must be able to accept partition_idx kwarg.
- Returns
An array of new partitions for Modin Frame.
- Return type
NumPy array
Notes
This method should be used in the case when map_func relies on some global information about the axis.
- classmethod map_partitions(partitions, map_func)¶
Apply map_func to every partition in partitions.
- Parameters
partitions (NumPy 2D array) – Partitions housing the data of Modin Frame.
map_func (callable) – Function to apply.
- Returns
An array of partitions
- Return type
NumPy array
- classmethod preprocess_func(map_func)¶
Preprocess a function to be applied to PandasFramePartition objects.
- Parameters
map_func (callable) – The function to be preprocessed.
- Returns
The preprocessed version of the map_func provided.
- Return type
callable
Notes
Preprocessing does not require any specific format, only that the PandasFramePartition.apply method will recognize it (for the subclass being used).
If your PandasFramePartition objects assume that a function provided is serialized or wrapped or in some other format, this is the place to add that logic. It is possible that this can also just return map_func if the apply method of the PandasFramePartition object you are using does not require any modification to a given function.
- classmethod row_partitions(partitions)¶
List of BaseFrameAxisPartition objects representing row-wise partitions.
- Parameters
partitions (list-like) – List of (smaller) partitions to be combined to row-wise partitions.
- Returns
A list of BaseFrameAxisPartition objects.
- Return type
list
Notes
Each value in this list will an BaseFrameAxisPartition object. BaseFrameAxisPartition is located in axis_partition.py.
- classmethod simple_shuffle(axis, partitions, map_func, lengths)¶
Shuffle data so lengths of partitions match given lengths via calling map_func.
- Parameters
axis ({0, 1}) – Axis to perform the map across (0 - index, 1 - columns).
partitions (NumPy 2D array) – Partitions of Modin Frame.
map_func (callable) – Function to apply.
lengths (list(int)) – List of lengths to shuffle the object.
- Returns
An array of new partitions for a Modin Frame.
- Return type
NumPy array
- classmethod to_numpy(partitions, **kwargs)¶
Convert NumPy array of PandasFramePartition to NumPy array of data stored within partitions.
- Parameters
partitions (np.ndarray) – NumPy array of PandasFramePartition.
**kwargs (dict) – Keyword arguments for PandasFramePartition.to_numpy function.
- Returns
A NumPy array.
- Return type
np.ndarray
- classmethod to_pandas(partitions)¶
Convert NumPy array of PandasFramePartition to pandas DataFrame.
- Parameters
partitions (np.ndarray) – NumPy array of PandasFramePartition.
- Returns
A pandas DataFrame
- Return type
Generic Ray-based members¶
Objects which are backend-agnostic but require specific Ray implementation
are placed in modin.engines.ray.generic
.
Their purpose is to implement certain parallel I/O operations and to serve as a foundation for building backend-specific objects:
GenericRayFramePartitionManager
– implements parallelto_numpy()
.
- class modin.engines.ray.generic.io.RayIO¶
Base class for doing I/O operations over Ray.
- classmethod to_csv(qc, **kwargs)¶
Write records stored in the qc to a CSV file.
- Parameters
qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run
to_csv
on.**kwargs (dict) – Parameters for
pandas.to_csv(**kwargs)
.
- classmethod to_sql(qc, **kwargs)¶
Write records stored in the qc to a SQL database.
- Parameters
qc (BaseQueryCompiler) – The query compiler of the Modin dataframe that we want to run
to_sql
on.**kwargs (dict) – Parameters for
pandas.to_sql(**kwargs)
.
- class modin.engines.ray.generic.frame.partition_manager.GenericRayFramePartitionManager¶
The class implements the interface in PandasFramePartitionManager.
- classmethod to_numpy(partitions, **kwargs)¶
Convert partitions into a NumPy array.
- Parameters
partitions (NumPy array) – A 2-D array of partitions to convert to local NumPy array.
**kwargs (dict) – Keyword arguments to pass to each partition
.to_numpy()
call.
- Returns
- Return type
NumPy array
PandasOnRay Frame Implementation¶
Modin implements Frame
, PartitionManager
, AxisPartition
and Partition
classes
specific for PandasOnRay
backend:
PandasOnRayFrame¶
The class is specific implementation of PandasFrame
class using Ray distributed engine. It serves as an intermediate level between
PandasQueryCompiler
and
PandasOnRayFramePartitionManager
.
Public API¶
- class modin.engines.ray.pandas_on_ray.frame.data.PandasOnRayFrame(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)¶
The class implements the interface in
PandasFrame
using Ray.- Parameters
partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence) – The index for the dataframe. Converted to a
pandas.Index
.columns (sequence) – The columns object for the dataframe. Converted to a
pandas.Index
.row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.
- classmethod combine_dtypes(list_of_dtypes, column_names)¶
Describe how data types should be combined when they do not match.
- Parameters
list_of_dtypes (list) – A list of
pandas.Series
with the data types.column_names (list) – The names of the columns that the data types map to.
- Returns
A
pandas.Series
containing the finalized data types.- Return type
pandas.Series
PandasOnRayFramePartition¶
The class is the specific implementation of PandasFramePartition
,
providing the API to perform operations on a block partition, namely, pandas.DataFrame
, using Ray as an execution engine.
In addition to wrapping a pandas DataFrame, the class also holds the following metadata:
length
- length of pandas DataFrame wrappedwidth
- width of pandas DataFrame wrappedip
- node IP address that holds pandas DataFrame wrapped
An operation on a block partition can be performed in two modes:
asynchronously - via
apply()
lazily - via
add_to_apply_calls()
Public API¶
- class modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition(object_id, length=None, width=None, ip=None, call_queue=None)¶
The class implements the interface in
PandasFramePartition
.- Parameters
object_id (ray.ObjectRef) – A reference to
pandas.DataFrame
that need to be wrapped with this class.length (ray.ObjectRef or int, optional) – Length or reference to it of wrapped
pandas.DataFrame
.width (ray.ObjectRef or int, optional) – Width or reference to it of wrapped
pandas.DataFrame
.ip (ray.ObjectRef or str, optional) – Node IP address or reference to it that holds wrapped
pandas.DataFrame
.call_queue (list) – Call queue that needs to be executed on wrapped
pandas.DataFrame
.
- add_to_apply_calls(func, *args, **kwargs)¶
Add a function to the call queue.
- Parameters
func (callable or ray.ObjectRef) – Function to be added to the call queue.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns
A new
PandasOnRayFramePartition
object.- Return type
Notes
It does not matter if func is callable or an
ray.ObjectRef
. Ray will handle it correctly either way. The keyword arguments are sent as a dictionary.
- apply(func, *args, **kwargs)¶
Apply a function to the object wrapped by this partition.
- Parameters
func (callable or ray.ObjectRef) – A function to apply.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns
A new
PandasOnRayFramePartition
object.- Return type
Notes
It does not matter if func is callable or an
ray.ObjectRef
. Ray will handle it correctly either way. The keyword arguments are sent as a dictionary.
- drain_call_queue()¶
Execute all operations stored in the call queue on the object wrapped by this partition.
- classmethod empty()¶
Create a new partition that wraps an empty pandas DataFrame.
- Returns
A new
PandasOnRayFramePartition
object.- Return type
- get()¶
Get the object wrapped by this partition out of the Plasma store.
- Returns
The object from the Plasma store.
- Return type
- ip()¶
Get the node IP address of the object wrapped by this partition.
- Returns
IP address of the node that holds the data.
- Return type
str
- length()¶
Get the length of the object wrapped by this partition.
- Returns
The length of the object.
- Return type
int
- mask(row_indices, col_indices)¶
Lazily create a mask that extracts the indices provided.
- Parameters
row_indices (list-like, slice or label) – The indices for the rows to extract.
col_indices (list-like, slice or label) – The indices for the columns to extract.
- Returns
A new
PandasOnRayFramePartition
object.- Return type
- classmethod preprocess_func(func)¶
Put a function into the Plasma store to use in
apply
.- Parameters
func (callable) – A function to preprocess.
- Returns
A reference to func.
- Return type
ray.ObjectRef
- classmethod put(obj)¶
Put an object into Plasma store and wrap it with partition object.
- Parameters
obj (any) – An object to be put.
- Returns
A new
PandasOnRayFramePartition
object.- Return type
- to_numpy(**kwargs)¶
Convert the object wrapped by this partition to a NumPy array.
- Parameters
**kwargs (dict) – Additional keyword arguments to be passed in
to_numpy
.- Returns
- Return type
np.ndarray
- to_pandas()¶
Convert the object wrapped by this partition to a
pandas.DataFrame
.- Returns
- Return type
pandas DataFrame.
- wait()¶
Wait completing computations on the object wrapped by the partition.
- width()¶
Get the width of the object wrapped by the partition.
- Returns
The width of the object.
- Return type
int
PandasOnRayFrameAxisPartition¶
This class is the specific implementation of PandasFrameAxisPartition
,
providing the API to perform operations on an axis partition, using Ray as an execution engine. The axis partition is
a wrapper over a list of block partitions that are stored in this class.
Public API¶
- class modin.engines.ray.pandas_on_ray.frame.axis_partition.PandasOnRayFrameAxisPartition(list_of_blocks, get_ip=False)¶
The class implements the interface in
PandasFrameAxisPartition
.- Parameters
list_of_blocks (list) – List of
PandasOnRayFramePartition
objects.get_ip (bool, default: False) – Whether to get node IP addresses to conforming partitions or not.
- classmethod deploy_axis_func(axis, func, num_splits, kwargs, maintain_partitioning, *partitions)¶
Deploy a function along a full axis.
- Parameters
axis ({0, 1}) – The axis to perform the function along.
func (callable) – The function to perform.
num_splits (int) – The number of splits to return (see
split_result_of_axis_func_pandas
).kwargs (dict) – Additional keywords arguments to be passed in func.
maintain_partitioning (bool) – If True, keep the old partitioning if possible. If False, create a new partition layout.
*partitions (iterable) – All partitions that make up the full axis (row or column).
- Returns
A list of
pandas.DataFrame
-s.- Return type
list
- classmethod deploy_func_between_two_axis_partitions(axis, func, num_splits, len_of_left, other_shape, kwargs, *partitions)¶
Deploy a function along a full axis between two data sets.
- Parameters
axis ({0, 1}) – The axis to perform the function along.
func (callable) – The function to perform.
num_splits (int) – The number of splits to return (see
split_result_of_axis_func_pandas
).len_of_left (int) – The number of values in partitions that belong to the left data set.
other_shape (np.ndarray) – The shape of right frame in terms of partitions, i.e. (other_shape[i-1], other_shape[i]) will indicate slice to restore i-1 axis partition.
kwargs (dict) – Additional keywords arguments to be passed in func.
*partitions (iterable) – All partitions that make up the full axis (row or column) for both data sets.
- Returns
A list of
pandas.DataFrame
-s.- Return type
list
- instance_type¶
alias of
ray._raylet.ObjectRef
- partition_type¶
alias of
modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition
PandasOnRayFrameColumnPartition¶
Public API¶
- class modin.engines.ray.pandas_on_ray.frame.axis_partition.PandasOnRayFrameColumnPartition(list_of_blocks, get_ip=False)¶
The column partition implementation.
All of the implementation for this class is in the parent class, and this class defines the axis to perform the computation over.
- Parameters
list_of_blocks (list) – List of
PandasOnRayFramePartition
objects.get_ip (bool, default: False) – Whether to get node IP addresses to conforming partitions or not.
PandasOnRayFrameRowPartition¶
Public API¶
- class modin.engines.ray.pandas_on_ray.frame.axis_partition.PandasOnRayFrameRowPartition(list_of_blocks, get_ip=False)¶
The row partition implementation.
All of the implementation for this class is in the parent class, and this class defines the axis to perform the computation over.
- Parameters
list_of_blocks (list) – List of
PandasOnRayFramePartition
objects.get_ip (bool, default: False) – Whether to get node IP addresses to conforming partitions or not.
PandasOnRayFramePartitionManager¶
This class is the specific implementation of PandasFramePartitionManager
using Ray distributed engine. This class is responsible for partition manipulation and applying a funcion to
block/row/column partitions.
Public API¶
- class modin.engines.ray.pandas_on_ray.frame.partition_manager.PandasOnRayFramePartitionManager¶
The class implements the interface in PandasFramePartitionManager.
- classmethod apply_func_to_indices_both_axis(partitions, func, row_partitions_list, col_partitions_list, item_to_distribute=None)¶
Apply a function along both axes.
- Parameters
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply.
row_partitions_list (list) – List of row partitions.
col_partitions_list (list) – List of column partitions.
item_to_distribute (item, optional) – The item to split up so it can be applied over both axes.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
For your func to operate directly on the indices provided, it must use
row_internal_indices
andcol_internal_indices
as keyword arguments.
- classmethod apply_func_to_select_indices(axis, partitions, func, indices, keep_remaining=False)¶
Apply a func to select indices of partitions.
- Parameters
axis ({0, 1}) – Axis to apply the func over.
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply to these indices of partitions.
indices (dict) – The indices to apply the function to.
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
Your internal function must take a kwarg internal_indices for this to work correctly. This prevents information leakage of the internal index to the external representation.
- classmethod apply_func_to_select_indices_along_full_axis(axis, partitions, func, indices, keep_remaining=False)¶
Apply a func to a select subset of full columns/rows.
- Parameters
axis ({0, 1}) – The axis to apply the func over.
partitions (np.ndarray) – The partitions to which the func will apply.
func (callable) – The function to apply.
indices (list-like) – The global indices to apply the func to.
keep_remaining (bool, default: False) – Whether or not to keep the other partitions. Some operations may want to drop the remaining partitions and keep only the results.
- Returns
A NumPy array with partitions.
- Return type
np.ndarray
Notes
This should be used when you need to apply a function that relies on some global information for the entire column/row, but only need to apply a function to a subset. For your func to operate directly on the indices provided, it must use internal_indices as a keyword argument.
- classmethod binary_operation(axis, left, func, right)¶
Apply a function that requires partitions of two
PandasOnRayFrame
objects.- Parameters
axis ({0, 1}) – The axis to apply the function over (0 - rows, 1 - columns).
left (np.ndarray) – The partitions of left
PandasOnRayFrame
.func (callable) – The function to apply.
right (np.ndarray) – The partitions of right
PandasOnRayFrame
.
- Returns
A NumPy array with new partitions.
- Return type
np.ndarray
- classmethod broadcast_apply(axis, apply_func, left, right, other_name='r')¶
Broadcast the right partitions to left and apply apply_func to selected indices.
- Parameters
axis ({0, 1}) – Axis to apply and broadcast over.
apply_func (callable) – Function to apply.
left (np.ndarray) – NumPy 2D array of left partitions.
right (np.ndarray) – NumPy 2D array of right partitions.
other_name (str, default: "r") – Name of key-value argument for apply_func that is used to pass right to apply_func.
- Returns
An array of partition objects.
- Return type
np.ndarray
- classmethod get_indices(axis, partitions, index_func=None)¶
Get the internal indices stored in the partitions.
- Parameters
axis ({0, 1}) – Axis to extract the labels over.
partitions (np.ndarray) – NumPy array with
PandasFramePartition
-s.index_func (callable, default: None) – The function to be used to extract the indices.
- Returns
A
pandas.Index
object.- Return type
pandas.Index
Notes
These are the global indices of the object. This is mostly useful when you have deleted rows/columns internally, but do not know which ones were deleted.
- classmethod lazy_map_partitions(partitions, map_func)¶
Apply map_func to every partition in partitions lazily.
- Parameters
partitions (np.ndarray) – A NumPy 2D array of partitions to perform operation on.
map_func (callable) – Function to apply.
- Returns
A NumPy array of partitions.
- Return type
np.ndarray
- classmethod map_axis_partitions(axis, partitions, map_func, keep_partitioning=False, lengths=None, enumerate_partitions=False)¶
Apply map_func to every partition in partitions along given axis.
- Parameters
axis ({0, 1}) – Axis to perform the map across (0 - index, 1 - columns).
partitions (np.ndarray) – A NumPy 2D array of partitions to perform operation on.
map_func (callable) – Function to apply.
keep_partitioning (bool, default: False) – Whether to keep partitioning for Modin Frame. Setting it to True prevents data shuffling between partitions.
lengths (list of ints, default: None) – List of lengths to shuffle the object.
enumerate_partitions (bool, default: False) – Whether or not to pass partition index into map_func. Note that map_func must be able to accept partition_idx kwarg.
- Returns
A NumPy array of new partitions for Modin Frame.
- Return type
np.ndarray
Notes
This method should be used in the case when map_func relies on some global information about the axis.
- classmethod map_partitions(partitions, map_func)¶
Apply map_func to every partition in partitions.
- Parameters
partitions (np.ndarray) – A NumPy 2D array of partitions to perform operation on.
map_func (callable) – Function to apply.
- Returns
A NumPy array of partitions.
- Return type
np.ndarray
cuDFOnRay Frame Implementation¶
Modin implements Frame
, PartitionManager
, AxisPartition
, Partition
and
GPUManager
classes specific for cuDFOnRay
backend:
cuDFOnRayFrame¶
The class is the specific implementation of PandasFrame
class using Ray distributed engine. It serves as an intermediate level between
cuDFQueryCompiler
and
cuDFOnRayFramePartitionManager
.
Public API¶
- class modin.engines.ray.cudf_on_ray.frame.data.cuDFOnRayFrame(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)¶
The class implements the interface in
PandasOnRayFrame
using cuDF.- Parameters
partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence) – The index for the dataframe. Converted to a
pandas.Index
.columns (sequence) – The columns object for the dataframe. Converted to a
pandas.Index
.row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.
- mask(row_indices=None, row_numeric_idx=None, col_indices=None, col_numeric_idx=None)¶
Lazily select columns or rows from given indices.
- Parameters
row_indices (list of hashable, optional) – The row labels to extract.
row_numeric_idx (list of int, optional) – The row indices to extract.
col_indices (list of hashable, optional) – The column labels to extract.
col_numeric_idx (list of int, optional) – The column indices to extract.
- Returns
A new
cuDFOnRayFrame
from the mask provided.- Return type
Notes
If both row_indices and row_numeric_idx are set, row_indices will be used. The same rule applied to col_indices and col_numeric_idx.
- synchronize_labels(axis=None)¶
Synchronize labels by applying the index object (Index or Columns) to the partitions eagerly.
- Parameters
axis ({0, 1, None}, default: None) – The axis to apply to. If None, it applies to both axes.
cuDFOnRayFramePartition¶
The class is the specific implementation of PandasFramePartition
,
providing the API to perform operations on a block partition, namely, cudf.DataFrame
,
using Ray as an execution engine.
An operation on a block partition can be performed asynchronously in two ways:
apply()
returnsray.ObjectRef
with integer key of operation result from internal storage.add_to_apply_calls()
returns a newcuDFOnRayFramePartition
object that is based on result of operation.
Public API¶
- class modin.engines.ray.cudf_on_ray.frame.partition.cuDFOnRayFramePartition(gpu_manager, key, length=None, width=None)¶
The class implements the interface in
PandasFramePartition
using cuDF on Ray.- Parameters
gpu_manager (modin.engines.ray.cudf_on_ray.frame.GPUManager) – A gpu manager to store cuDF dataframes.
key (ray.ObjectRef or int) – An integer key (or reference to key) associated with
cudf.DataFrame
stored in gpu_manager.length (ray.ObjectRef or int, optional) – Length or reference to it of wrapped
pandas.DataFrame
.width (ray.ObjectRef or int, optional) – Width or reference to it of wrapped
pandas.DataFrame
.
- add_to_apply_calls(func, **kwargs)¶
Apply func to this partition and create new.
- Parameters
func (callable) – A function to apply.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
New partition based on result of func.
- Return type
Notes
We eagerly schedule the apply func and produce a new
cuDFOnRayFramePartition
.
- apply(func, **kwargs)¶
Apply func to this partition.
- Parameters
func (callable) – A function to apply.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
A reference to integer key of result in internal dict-storage of self.gpu_manager.
- Return type
ray.ObjectRef
- apply_result_not_dataframe(func, **kwargs)¶
Apply func to this partition.
- Parameters
func (callable) – A function to apply.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
A reference to integer key of result in internal dict-storage of self.gpu_manager.
- Return type
ray.ObjectRef
- copy()¶
Create a full copy of this object.
- Returns
- Return type
- free()¶
Free the dataFrame and associated self.key out of self.gpu_manager.
- get()¶
Get object stored by this partition from self.gpu_manager.
- Returns
- Return type
ray.ObjectRef
- get_gpu_manager()¶
Get gpu manager associated with this partition.
- Returns
GPUManager
associated with this object.- Return type
modin.engines.ray.cudf_on_ray.frame.GPUManager
- get_key()¶
Get integer key of this partition in dict-storage of self.gpu_manager.
- Returns
- Return type
int
- get_object_id()¶
Get object stored for this partition from self.gpu_manager.
- Returns
- Return type
ray.ObjectRef
- length()¶
Get the length of the object wrapped by this partition.
- Returns
The length (or reference to length) of the object.
- Return type
int or ray.ObjectRef
- mask(row_indices, col_indices)¶
Select columns or rows from given indices.
- Parameters
row_indices (list of hashable) – The row labels to extract.
col_indices (list of hashable) – The column labels to extract.
- Returns
A reference to integer key of result in internal dict-storage of self.gpu_manager.
- Return type
ray.ObjectRef
- classmethod preprocess_func(func)¶
Put func to Ray object store.
- Parameters
func (callable) – Function to put.
- Returns
A reference to func in Ray object store.
- Return type
ray.ObjectRef
- classmethod put(gpu_manager, pandas_dataframe)¶
Put pandas_dataframe to gpu_manager.
- Parameters
gpu_manager (modin.engines.ray.cudf_on_ray.frame.GPUManager) – A gpu manager to store cuDF dataframes.
pandas_dataframe (pandas.DataFrame/pandas.Series) – A
pandas.DataFrame/pandas.Series
to put.
- Returns
A reference to integer key of added pandas.DataFrame to internal dict-storage in gpu_manager.
- Return type
ray.ObjectRef
- to_numpy()¶
Convert this partition to NumPy array.
- Returns
- Return type
NumPy array
- to_pandas()¶
Convert this partition to pandas.DataFrame.
- Returns
- Return type
- width()¶
Get the width of the object wrapped by this partition.
- Returns
The width (or reference to width) of the object.
- Return type
int or ray.ObjectRef
cuDFOnRayFrameAxisPartition¶
The class is a base class for any axis partition class based on Ray engine and cuDF backend. This class provides the API to perform operations on an axis partition, using Ray as the execution engine. The axis partition is made up of list of block partitions that are stored in this class.
Public API¶
- class modin.engines.ray.cudf_on_ray.frame.axis_partition.cuDFOnRayFrameAxisPartition(partitions)¶
Base class for any axis partition class for cuDF backend.
- Parameters
partitions (np.ndarray) – NumPy array with
cuDFOnRayFramePartition
-s.
- partition_type¶
alias of
modin.engines.ray.cudf_on_ray.frame.partition.cuDFOnRayFramePartition
cuOnRayFrameColumnPartition¶
Public API¶
- class modin.engines.ray.cudf_on_ray.frame.axis_partition.cuDFOnRayFrameColumnPartition(partitions)¶
The column partition implementation of
cuDFOnRayFrameAxisPartition
.- Parameters
partitions (np.ndarray) – NumPy array with
cuDFOnRayFramePartition
-s.
- reduce(func)¶
Reduce partitions along self.axis and apply func.
- Parameters
func (callable) – A func to apply.
- Returns
- Return type
cuDFOnRayFrameRowPartition¶
Public API¶
- class modin.engines.ray.cudf_on_ray.frame.axis_partition.cuDFOnRayFrameRowPartition(partitions)¶
The row partition implementation of
cuDFOnRayFrameAxisPartition
.- Parameters
partitions (np.ndarray) – NumPy array with
cuDFOnRayFramePartition
-s.
- reduce(func)¶
Reduce partitions along self.axis and apply func.
- Parameters
func (calalble) – A func to apply.
- Returns
- Return type
Notes
Since we are using row partitions, we can bypass the Ray plasma store during axis reduction functions.
cuDFOnRayFramePartitionManager¶
This class is the specific implementation of GenericRayFramePartitionManager
.
It serves as an intermediate level between cuDFOnRayFrame
and cuDFOnRayFramePartition
class.
This class is responsible for partition manipulation and applying a function to
block/row/column partitions.
Public API¶
- class modin.engines.ray.cudf_on_ray.frame.partition_manager.cuDFOnRayFramePartitionManager¶
The class implements the interface in
GenericRayFramePartitionManager
using cuDF on Ray.- classmethod from_pandas(df, return_dims=False)¶
Create partitions from
pandas.DataFrame/pandas.Series
.- Parameters
df (pandas.DataFrame/pandas.Series) – A
pandas.DataFrame
to add.return_dims (boolean, default: False) – Is return dimensions or not.
- Returns
List of partitions in case return_dims == False, tuple (partitions, row lengths, col widths) in other case.
- Return type
list or tuple
- classmethod lazy_map_partitions(partitions, map_func)¶
Apply map_func to every partition lazily.
Compared to Modin-CPU, Modin-GPU lazy version represents:
A scheduled function in the Ray task graph.
A non-materialized key.
- Parameters
partitions (np.ndarray) – NumPy array with partitions.
map_func (callable) – The function to apply.
- Returns
A NumPy array of
cuDFOnRayFramePartition
objects.- Return type
np.ndarray
GPUManager¶
The Ray actor-class stores cuDF.DataFrame
-s and executes operations on it.
Public API¶
- class modin.engines.ray.cudf_on_ray.frame.gpu_manager.GPUManager(gpu_id)¶
Ray actor-class to store
cudf.DataFrame
-s and execute functions on it.- Parameters
gpu_id (int) – The identifier of GPU.
- apply(first, other, func, **kwargs)¶
Apply func to values associated with first/other keys of self.cudf_dataframe_dict with storing of the result.
Store the return value of func (a new
cudf.DataFrame
) into self.cudf_dataframe_dict.- Parameters
first (int) – The first key associated with dataframe from self.cudf_dataframe_dict.
other (int or ray.ObjectRef) – The second key associated with dataframe from self.cudf_dataframe_dict. If it isn’t a real key, the func will be applied to the first only.
func (callable) – A function to apply.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
The new key of the new dataframe stored in self.cudf_dataframe_dict (will be a
ray.ObjectRef
in outside level).- Return type
int
- apply_non_persistent(first, other, func, **kwargs)¶
Apply func to values associated with first/other keys of self.cudf_dataframe_dict.
- Parameters
first (int) – The first key associated with dataframe from self.cudf_dataframe_dict.
other (int) – The second key associated with dataframe from self.cudf_dataframe_dict. If it isn’t a real key, the func will be applied to the first only.
func (callable) – A function to apply.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
The result of the func (will be a
ray.ObjectRef
in outside level).- Return type
The type of return of func
- free(key)¶
Free the dataFrame and associated key out of self.cudf_dataframe_dict.
- Parameters
key (int) – The key to be deleted.
- get_id()¶
Get the self.gpu_id from this object.
- Returns
The gpu_id from this object (will be a
ray.ObjectRef
in outside level).- Return type
int
- get_oid(key)¶
Get the value from self.cudf_dataframe_dict by key.
- Parameters
key (int) – The key to get value.
- Returns
Dataframe corresponding to key`(will be a ``ray.ObjectRef` in outside level).
- Return type
cudf.DataFrame
- put(pandas_df)¶
Convert pandas_df to
cudf.DataFrame
and put it to self.cudf_dataframe_dict.- Parameters
pandas_df (pandas.DataFrame/pandas.Series) – A pandas DataFrame/Series to be added.
- Returns
The key associated with added dataframe (will be a
ray.ObjectRef
in outside level).- Return type
int
- reduce(first, others, func, axis=0, **kwargs)¶
Apply func to values associated with first key and others keys of self.cudf_dataframe_dict with storing of the result.
Dataframes associated with others keys will be concatenated to one dataframe.
Store the return value of func (a new
cudf.DataFrame
) into self.cudf_dataframe_dict.- Parameters
first (int) – The first key associated with dataframe from self.cudf_dataframe_dict.
others (list of int / list of ray.ObjectRef) – The list of keys associated with dataframe from self.cudf_dataframe_dict.
func (callable) – A function to apply.
axis ({0, 1}, default: 0) – An axis corresponding to a particular row/column of the dataframe.
**kwargs (dict) – Additional keywords arguments to be passed in func.
- Returns
The new key of the new dataframe stored in self.cudf_dataframe_dict (will be a
ray.ObjectRef
in outside level).- Return type
int
Notes
If
len(others) == 0
func should be able to work with 2nd positional argument with None value.
- store_new_df(df)¶
Store df in self.cudf_dataframe_dict.
- Parameters
df (cudf.DataFrame) – The
cudf.DataFrame
to be added.- Returns
The key associated with added dataframe (will be a
ray.ObjectRef
in outside level).- Return type
int
PandasOnDask Frame Objects¶
This page describes the implementation of Base Frame Objects
specific for PandasOnDask
backend.
PandasOnDaskFrame¶
The class is the specific implementation of the dataframe algebra for the PandasOnDask
backend.
It serves as an intermediate level between pandas
query compiler and
PandasOnDaskFramePartitionManager
.
Public API¶
- class modin.engines.dask.pandas_on_dask.frame.data.PandasOnDaskFrame(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)¶
The class implements the interface in
PandasFrame
.- Parameters
partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence) – The index for the dataframe. Converted to a pandas.Index.
columns (sequence) – The columns object for the dataframe. Converted to a pandas.Index.
row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.
PandasOnDaskFramePartition¶
The class is the specific implementation of PandasFramePartition
,
providing the API to perform operations on a block partition, namely, pandas.DataFrame
, using Dask as the execution engine.
In addition to wrapping a pandas DataFrame, the class also holds the following metadata:
length
- length of pandas DataFrame wrappedwidth
- width of pandas DataFrame wrappedip
- node IP address that holds pandas DataFrame wrapped
An operation on a block partition can be performed in two modes:
asynchronously - via
apply()
lazily - via
add_to_apply_calls()
Public API¶
- class modin.engines.dask.pandas_on_dask.frame.partition.PandasOnDaskFramePartition(future, length=None, width=None, ip=None, call_queue=None)¶
The class implements the interface in
PandasFramePartition
.- Parameters
future (distributed.Future) – A reference to pandas DataFrame that need to be wrapped with this class.
length (distributed.Future or int, optional) – Length or reference to it of wrapped pandas DataFrame.
width (distributed.Future or int, optional) – Width or reference to it of wrapped pandas DataFrame.
ip (distributed.Future or str, optional) – Node IP address or reference to it that holds wrapped pandas DataFrame.
call_queue (list, optional) – Call queue that needs to be executed on wrapped pandas DataFrame.
- add_to_apply_calls(func, *args, **kwargs)¶
Add a function to the call queue.
- Parameters
func (callable) – Function to be added to the call queue.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns
A new
PandasOnDaskFramePartition
object.- Return type
Notes
The keyword arguments are sent as a dictionary.
- apply(func, *args, **kwargs)¶
Apply a function to the object wrapped by this partition.
- Parameters
func (callable) – A function to apply.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns
A new
PandasOnDaskFramePartition
object.- Return type
Notes
The keyword arguments are sent as a dictionary.
- drain_call_queue()¶
Execute all operations stored in the call queue on the object wrapped by this partition.
- classmethod empty()¶
Create a new partition that wraps an empty pandas DataFrame.
- Returns
A new
PandasOnDaskFramePartition
object.- Return type
- get()¶
Get the object wrapped by this partition out of the distributed memory.
- Returns
The object from the distributed memory.
- Return type
- ip()¶
Get the node IP address of the object wrapped by this partition.
- Returns
IP address of the node that holds the data.
- Return type
str
- length()¶
Get the length of the object wrapped by this partition.
- Returns
The length of the object.
- Return type
int
- mask(row_indices, col_indices)¶
Lazily create a mask that extracts the indices provided.
- Parameters
row_indices (list-like, slice or label) – The indices for the rows to extract.
col_indices (list-like, slice or label) – The indices for the columns to extract.
- Returns
A new
PandasOnDaskFramePartition
object.- Return type
- classmethod preprocess_func(func)¶
Preprocess a function before an
apply
call.- Parameters
func (callable) – The function to preprocess.
- Returns
An object that can be accepted by
apply
.- Return type
callable
- classmethod put(obj)¶
Put an object into distributed memory and wrap it with partition object.
- Parameters
obj (any) – An object to be put.
- Returns
A new
PandasOnDaskFramePartition
object.- Return type
- to_numpy(**kwargs)¶
Convert the object wrapped by this partition to a NumPy array.
- Parameters
**kwargs (dict) – Additional keyword arguments to be passed in
to_numpy
.- Returns
- Return type
np.ndarray.
- to_pandas()¶
Convert the object wrapped by this partition to a pandas DataFrame.
- Returns
- Return type
- wait()¶
Wait completing computations on the object wrapped by the partition.
- width()¶
Get the width of the object wrapped by the partition.
- Returns
The width of the object.
- Return type
int
PandasOnDaskFrameAxisPartition¶
The class is the specific implementation of PandasFrameAxisPartition
,
providing the API to perform operations on an axis (column or row) partition using Dask as the execution engine.
The axis partition is a wrapper over a list of block partitions that are stored in this class.
Public API¶
- class modin.engines.dask.pandas_on_dask.frame.axis_partition.PandasOnDaskFrameAxisPartition(list_of_blocks, get_ip=False)¶
The class implements the interface in
PandasFrameAxisPartition
.- Parameters
list_of_blocks (list) – List of
PandasOnDaskFramePartition
objects.get_ip (bool, default: False) – Whether to get node IP addresses of conforming partitions or not.
- classmethod deploy_axis_func(axis, func, num_splits, kwargs, maintain_partitioning, *partitions)¶
Deploy a function along a full axis.
- Parameters
axis ({0, 1}) – The axis to perform the function along.
func (callable) – The function to perform.
num_splits (int) – The number of splits to return (see split_result_of_axis_func_pandas).
kwargs (dict) – Additional keywords arguments to be passed in func.
maintain_partitioning (bool) – If True, keep the old partitioning if possible. If False, create a new partition layout.
*partitions (iterable) – All partitions that make up the full axis (row or column).
- Returns
A list of distributed.Future.
- Return type
list
- classmethod deploy_func_between_two_axis_partitions(axis, func, num_splits, len_of_left, other_shape, kwargs, *partitions)¶
Deploy a function along a full axis between two data sets.
- Parameters
axis ({0, 1}) – The axis to perform the function along.
func (callable) – The function to perform.
num_splits (int) – The number of splits to return (see split_result_of_axis_func_pandas).
len_of_left (int) – The number of values in partitions that belong to the left data set.
other_shape (np.ndarray) – The shape of right frame in terms of partitions, i.e. (other_shape[i-1], other_shape[i]) will indicate slice to restore i-1 axis partition.
kwargs (dict) – Additional keywords arguments to be passed in func.
*partitions (iterable) – All partitions that make up the full axis (row or column) for both data sets.
- Returns
A list of distributed.Future.
- Return type
list
- instance_type¶
alias of
distributed.client.Future
- partition_type¶
alias of
modin.engines.dask.pandas_on_dask.frame.partition.PandasOnDaskFramePartition
PandasOnDaskFrameColumnPartition¶
Public API¶
- class modin.engines.dask.pandas_on_dask.frame.axis_partition.PandasOnDaskFrameColumnPartition(list_of_blocks, get_ip=False)¶
The column partition implementation.
All of the implementation for this class is in the parent class, and this class defines the axis to perform the computation over.
- Parameters
list_of_blocks (list) – List of
PandasOnDaskFramePartition
objects.get_ip (bool, default: False) – Whether to get node IP addresses to conforming partitions or not.
PandasOnDaskFrameRowPartition¶
Public API¶
- class modin.engines.dask.pandas_on_dask.frame.axis_partition.PandasOnDaskFrameRowPartition(list_of_blocks, get_ip=False)¶
The row partition implementation.
All of the implementation for this class is in the parent class, and this class defines the axis to perform the computation over.
- Parameters
list_of_blocks (list) – List of
PandasOnDaskFramePartition
objects.get_ip (bool, default: False) – Whether to get node IP addresses to conforming partitions or not.
PandasOnDaskFramePartitionManager¶
This class is the specific implementation of PandasFramePartitionManager
using Dask as the execution engine. This class is responsible for partition manipulation and applying a funcion to
block/row/column partitions.
Public API¶
- class modin.engines.dask.pandas_on_dask.frame.partition_manager.PandasOnDaskFramePartitionManager¶
The class implements the interface in PandasFramePartitionManager.
- classmethod broadcast_apply(axis, apply_func, left, right, other_name='r')¶
Broadcast the right partitions to left and apply apply_func function.
- Parameters
axis ({0, 1}) – Axis to apply and broadcast over.
apply_func (callable) – Function to apply.
left (np.ndarray) – NumPy array of left partitions.
right (np.ndarray) – NumPy array of right partitions.
other_name (str, default: "r") – Name of key-value argument for apply_func that is used to pass right to apply_func.
- Returns
NumPy array of result partition objects.
- Return type
np.ndarray
- classmethod get_indices(axis, partitions, index_func)¶
Get the internal indices stored in the partitions.
- Parameters
axis ({0, 1}) – Axis to extract the labels over.
partitions (np.ndarray) – The array of partitions from which need to extract the labels.
index_func (callable) – The function to be used to extract the indices.
- Returns
A pandas Index object.
- Return type
pandas.Index
Notes
These are the global indices of the object. This is mostly useful when you have deleted rows/columns internally, but do not know which ones were deleted.
Experimental¶
modin.experimental
holds experimental functionality that is under development right now
and provides a limited set of functionality:
Scikit-learn module description¶
This module holds experimental scikit-learn-specific functionality for Modin.
API¶
Module holds model selection specific functionality.
- modin.experimental.sklearn.model_selection.train_test_split(df, **options)¶
Split input data to train and test data.
- Parameters
df (modin.pandas.DataFrame / modin.pandas.Series) – Data to split.
**options (dict) – Keyword arguments. If train_size key isn’t provided train_size will be 0.75.
- Returns
A pair of modin.pandas.DataFrame / modin.pandas.Series.
- Return type
tuple
Modin XGBoost module description¶
High-level Module Overview¶
This module holds classes, public interface and internal functions for distributed XGBoost in Modin.
Public classes Booster
, DMatrix
and function train()
provide the user with familiar XGBoost interfaces.
They are located in the modin.experimental.xgboost.xgboost
module.
The internal module modin.experimental.xgboost.xgboost.xgboost_ray
contains the implementation of Modin XGBoost
for the Ray backend. This module mainly consists of the Ray actor-class ModinXGBoostActor
,
a function to distribute Modin’s partitions between actors _assign_row_partitions_to_actors()
,
an internal _train()
/_predict()
function used from the public interfaces and additional util functions for computing cluster resources, actor creations etc.
Public interfaces¶
DMatrix
inherits original class xgboost.DMatrix
and overrides
its constructor, which currently supports only data and label parameters. Both of the parameters must
be modin.pandas.DataFrame
, which will be internally unwrapped to lists of delayed objects of Modin’s
row partitions using the function unwrap_partitions()
.
- class modin.experimental.xgboost.DMatrix(data, label=None)¶
DMatrix holds references to partitions of Modin DataFrame.
On init stage unwrapping partitions of Modin DataFrame is started.
- Parameters
data (modin.pandas.DataFrame) – Data source of DMatrix.
label (modin.pandas.DataFrame or modin.pandas.Series, optional) – Labels used for training.
Notes
Currently DMatrix supports only data and label parameters.
Booster
inherits original class xgboost.Booster
and
overrides method predict
. The difference from original class interface for predict
method is changing the type of the data parameter to DMatrix
.
- class modin.experimental.xgboost.Booster(params=None, cache=(), model_file=None)¶
A Modin Booster of XGBoost.
Booster is the model of XGBoost, that contains low level routines for training, prediction and evaluation.
- Parameters
params (dict, optional) – Parameters for boosters.
cache (list, default: empty) – List of cache items.
model_file (string/os.PathLike/xgb.Booster/bytearray, optional) – Path to the model file if it’s string or PathLike or xgb.Booster.
- predict(data: modin.experimental.xgboost.xgboost.DMatrix, **kwargs)¶
Run distributed prediction with a trained booster.
During execution it runs
xgb.predict
on each worker for subset of data and creates Modin DataFrame with prediction results.- Parameters
data (modin.experimental.xgboost.DMatrix) – Input data used for prediction.
**kwargs (dict) – Other parameters are the same as for
xgboost.Booster.predict
.
- Returns
Modin DataFrame with prediction results.
- Return type
modin.pandas.DataFrame
train()
function has 2 differences from the original train
function - (1) the
data type of dtrain parameter is DMatrix
and (2) a new parameter num_actors.
- modin.experimental.xgboost.train(params: Dict, dtrain: modin.experimental.xgboost.xgboost.DMatrix, *args, evals=(), num_actors: Optional[int] = None, evals_result: Optional[Dict] = None, **kwargs)¶
Run distributed training of XGBoost model.
During work it evenly distributes dtrain between workers according to IP addresses partitions (in case of not even distribution of dtrain over nodes, some partitions will be re-distributed between nodes), runs xgb.train on each worker for subset of dtrain and reduces training results of each worker using Rabit Context.
- Parameters
params (dict) – Booster params.
dtrain (modin.experimental.xgboost.DMatrix) – Data to be trained against.
*args (iterable) – Other parameters for xgboost.train.
evals (list of pairs (modin.experimental.xgboost.DMatrix, str), default: empty) – List of validation sets for which metrics will evaluated during training. Validation metrics will help us track the performance of the model.
num_actors (int, optional) – Number of actors for training. If unspecified, this value will be computed automatically.
evals_result (dict, optional) – Dict to store evaluation results in.
**kwargs (dict) – Other parameters are the same as xgboost.train.
- Returns
A trained booster.
- Return type
Internal execution flow on Ray backend¶
Internal functions _train()
and
_predict()
work similar to xgboost.
The data is passed to
_train()
function as aDMatrix
object. Using an iterator ofDMatrix
, lists ofray.ObjectRef
with row partitions of Modin DataFrame are exctracted. Example:# Extract lists of row partitions from dtrain (DMatrix object) X_row_parts, y_row_parts = dtrain
On this step, the parameter num_actors is processed. The internal function
_get_num_actors()
examines the value provided by the user. In case the value isn’t provided, the num_actors will be computed using condition that 1 actor should use maximum 2 CPUs. This condition was chosen for using maximum parallel workers with multithreaded XGBoost training (2 threads per worker will be used in this case).
Note
num_actors parameter is made available for public function train()
to allow
fine-tuning for obtaining the best performance in specific use cases.
ModinXGBoostActor
objects are created.Data dtrain is split between actors evenly. The internal function
_split_data_across_actors()
runs assigning row partitions to actors using internal function_assign_row_partitions_to_actors()
. This function creates a dictionary in the form: {actor_rank: ([part_i0, part_i3, ..], [0, 3, ..]), ..}.
Note
_assign_row_partitions_to_actors()
takes into account IP
addresses of row partitions of dtrain data to minimize excess data transfer.
For each
ModinXGBoostActor
objectset_train_data
method is called remotely. This method runs loading row partitions in actor according to the dictionary with partitions distribution from previous step. When data is passed to the actor, the row partitions are automatically materialized (ray.ObjectRef
->pandas.DataFrame
).train
method ofModinXGBoostActor
class object is called remotely. This method runs XGBoost training on local data of actor, connects toRabit Tracker
for sharing training state between actors and returns dictionary with booster and evaluation results.At the final stage results from actors are returned. booster and evals_result are returned using
ray.get
function from remote actor.
The data is passed to
_predict()
function as aDMatrix
object._map_predict()
function is applied remotely for each partition of the data to make a partial prediction.Result
modin.pandas.DataFrame
is created fromray.ObjectRef
objects, obtained in the previous step.
Internal API¶
- class modin.experimental.xgboost.xgboost_ray.ModinXGBoostActor(rank, nthread)¶
Ray actor-class runs training on the remote worker.
- Parameters
rank (int) – Rank of this actor.
nthread (int) – Number of threads used by XGBoost in this actor.
- _get_dmatrix(X_y)¶
Create xgboost.DMatrix from sequence of pandas.DataFrame objects.
First half of X_y should contains objects for X, second for y.
- Parameters
X_y (list) – List of pandas.DataFrame objects.
- Returns
A XGBoost DMatrix.
- Return type
xgb.DMatrix
- add_eval_data(*X_y, eval_method)¶
Add evaluation data for actor.
- Parameters
*X_y (iterable) – Sequence of ray.ObjectRef objects. First half of sequence is for X data, second for y. When it is passed in actor, auto-materialization of ray.ObjectRef -> pandas.DataFrame happens.
eval_method (str) – Name of eval data.
- set_train_data(*X_y, add_as_eval_method=None)¶
Set train data for actor.
- Parameters
*X_y (iterable) – Sequence of ray.ObjectRef objects. First half of sequence is for X data, second for y. When it is passed in actor, auto-materialization of ray.ObjectRef -> pandas.DataFrame happens.
add_as_eval_method (str, optional) – Name of eval data. Used in case when train data also used for evaluation.
- train(rabit_args, params, *args, **kwargs)¶
Run local XGBoost training.
Connects to Rabit Tracker environment to share training data between actors and trains XGBoost booster using self._dtrain.
- Parameters
rabit_args (list) – List with environment variables for Rabit Tracker.
params (dict) – Booster params.
*args (iterable) – Other parameters for xgboost.train.
**kwargs (dict) – Other parameters for xgboost.train.
- Returns
A dictionary with trained booster and dict of evaluation results as {“booster”: xgb.Booster, “history”: dict}.
- Return type
dict
- modin.experimental.xgboost.xgboost_ray._assign_row_partitions_to_actors(actors: List, row_partitions, data_for_aligning=None)¶
Assign row_partitions to actors.
row_partitions will be assigned to actors according to their IPs. If distribution isn’t even, partitions will be moved from actor with excess partitions to actor with lack of them.
- Parameters
actors (list) – List of used actors.
row_partitions (list) – Row partitions of data to assign.
data_for_aligning (dict, optional) – Data according to the order of which should be distributed row_partitions. Used to align y with X.
- Returns
Dictionary of assigned to actors partitions as {actor_rank: (partitions, order)}.
- Return type
dict
- modin.experimental.xgboost.xgboost_ray._train(dtrain, params: Dict, *args, num_actors=None, evals=(), **kwargs)¶
Run distributed training of XGBoost model on Ray backend.
During work it evenly distributes dtrain between workers according to IP addresses partitions (in case of not even distribution of dtrain by nodes, part of partitions will be re-distributed between nodes), runs xgb.train on each worker for subset of dtrain and reduces training results of each worker using Rabit Context.
- Parameters
dtrain (modin.experimental.DMatrix) – Data to be trained against.
params (dict) – Booster params.
*args (iterable) – Other parameters for xgboost.train.
num_actors (int, optional) – Number of actors for training. If unspecified, this value will be computed automatically.
evals (list of pairs (modin.experimental.xgboost.DMatrix, str), default: empty) – List of validation sets for which metrics will be evaluated during training. Validation metrics will help us track the performance of the model.
**kwargs (dict) – Other parameters are the same as xgboost.train.
- Returns
A dictionary with trained booster and dict of evaluation results as {“booster”: xgboost.Booster, “history”: dict}.
- Return type
dict
- modin.experimental.xgboost.xgboost_ray._predict(booster, data, **kwargs)¶
Run distributed prediction with a trained booster on Ray backend.
During execution it runs
xgb.predict
on each worker for subset of data and creates Modin DataFrame with prediction results.- Parameters
booster (xgboost.Booster) – A trained booster.
data (modin.experimental.xgboost.DMatrix) – Input data used for prediction.
**kwargs (dict) – Other parameters are the same as for
xgboost.Booster.predict
.
- Returns
Modin DataFrame with prediction results.
- Return type
modin.pandas.DataFrame
- modin.experimental.xgboost.xgboost_ray._get_num_actors(num_actors=None)¶
Get number of actors to create.
- Parameters
num_actors (int, optional) – Desired number of actors. If is None, integer number of actors will be computed by condition 2 CPUs per 1 actor.
- Returns
Number of actors to create.
- Return type
int
- modin.experimental.xgboost.xgboost_ray._split_data_across_actors(actors: List, set_func, X_parts, y_parts)¶
Split row partitions of data between actors.
- Parameters
actors (list) – List of used actors.
set_func (callable) – The function for setting data in actor.
X_parts (list) – Row partitions of X data.
y_parts (list) – Row partitions of y data.
- modin.experimental.xgboost.xgboost_ray._map_predict(booster, part, columns, **kwargs)¶
Run prediction on a remote worker.
- Parameters
booster (xgboost.Booster or ray.ObjectRef) – A trained booster.
part (pandas.DataFrame or ray.ObjectRef) – Partition of full data used for local prediction.
columns (list or ray.ObjectRef) – Columns for the result.
**kwargs (dict) – Other parameters are the same as for
xgboost.Booster.predict
.
- Returns
ray.ObjectRef
with partial prediction.- Return type
ray.ObjectRef
Query Compiler¶
Base Query Compiler¶
Brief description¶
BaseQueryCompiler
is an abstract class of query compiler, and sets a common interface
that every other query compiler implementation in Modin must follow. The Base class contains a basic
implementations for most of the interface methods, all of which
default to pandas.
Subclassing BaseQueryCompiler
¶
If you want to add new type of query compiler to Modin the new class needs to inherit
from BaseQueryCompiler
and implement the abstract methods:
from_pandas()
build query compiler from pandas DataFrame.from_arrow()
build query compiler from Arrow Table.to_pandas()
get query compiler representation as pandas DataFrame.default_to_pandas()
do fallback to pandas for the passed function.finalize()
finalize object constructing.free()
trigger memory cleaning.
(Please refer to the code documentation to see the full documentation for these functions).
This is a minimum set of operations to ensure a new query compiler will function in the Modin architecture, and the rest of the API can safely default to the pandas implementation via the base class implementation. To add a backend-specific implementation for some of the query compiler operations, just override the corresponding method in your query compiler class.
Example¶
As an exercise let’s define a new query compiler in Modin, just to see how easy it is. Usually, the query compiler routes formed queries to the underlying frame class, which submits operators to an execution engine. For the sake of simplicity and independence of this example, our execution engine will be the pandas itself.
We need to inherit a new class from BaseQueryCompiler
and implement all of the abstract methods.
In this case, with pandas as an execution engine, it’s trivial:
from modin.backends import BaseQueryCompiler
class DefaultToPandasQueryCompiler(BaseQueryCompiler):
def __init__(self, pandas_df):
self._pandas_df = pandas_df
@classmethod
def from_pandas(cls, df, *args, **kwargs):
return cls(df)
@classmethod
def from_arrow(cls, at, *args, **kwargs):
return cls(at.to_pandas())
def to_pandas(self):
return self._pandas_df.copy()
def default_to_pandas(self, pandas_op, *args, **kwargs):
return type(self)(pandas_op(self.to_pandas(), *args, **kwargs))
def finalize(self):
pass
def free(self):
pass
All done! Now you’ve got a fully functional query compiler, which is ready for extensions and already can be used in Modin DataFrame:
import pandas
pandas_df = pandas.DataFrame({"col1": [1, 2, 2, 1], "col2": [10, 2, 3, 40]})
# Building our query compiler from pandas object
qc = DefaultToPandasQueryCompiler.from_pandas(pandas_df)
import modin.pandas as pd
# Building Modin DataFrame from newly created query compiler
modin_df = pd.DataFrame(query_compiler=qc)
# Got fully functional Modin DataFrame
>>> print(modin_df.groupby("col1").sum().reset_index())
col1 col2
0 1 50
1 2 5
To be able to select this query compiler as default via modin.config
you also need
to define the combination of your query compiler and pandas execution engine as a backend
by adding the corresponding factory. To find more information about factories,
visit corresponding section of the flow documentation.
Query Compiler API¶
- class modin.backends.base.query_compiler.BaseQueryCompiler¶
Abstract class that handles the queries to Modin dataframes.
This class defines common query compilers API, most of the methods are already implemented and defaulting to pandas.
- lazy_execution¶
Whether underlying execution engine is designed to be executed in a lazy mode only. If True, such QueryCompiler will be handled differently at the front-end in order to reduce execution triggering as much as possible.
- Type
bool
Notes
See the Abstract Methods and Fields section immediately below this for a list of requirements for subclassing this object.
- abs()¶
Get absolute numeric value of each element.
- Returns
QueryCompiler with absolute numeric value of each element.
- Return type
- add(other, **kwargs)¶
Perform element-wise addition (
self + other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- add_prefix(prefix, axis=1)¶
Add string prefix to the index labels along specified axis.
- Parameters
prefix (str) – The string to add before each label.
axis ({0, 1}, default: 1) – Axis to add prefix along. 0 is for index and 1 is for columns.
- Returns
New query compiler with updated labels.
- Return type
- add_suffix(suffix, axis=1)¶
Add string suffix to the index labels along specified axis.
- Parameters
suffix (str) – The string to add after each label.
axis ({0, 1}, default: 1) – Axis to add suffix along. 0 is for index and 1 is for columns.
- Returns
New query compiler with updated labels.
- Return type
- all(**kwargs)¶
Return whether all the elements are true, potentially over an axis.
- Parameters
axis ({0, 1}, optional) –
bool_only (bool, optional) –
skipna (bool) –
level (int or label) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
If axis was specified return one-column QueryCompiler with index labels of the specified axis, where each row contains boolean of whether all elements at the corresponding row or column are True. Otherwise return QueryCompiler with a single bool of whether all elements are True.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.all
for more information about parameters and output format.
- any(**kwargs)¶
Return whether any element is true, potentially over an axis.
- Parameters
axis ({0, 1}, optional) –
bool_only (bool, optional) –
skipna (bool) –
level (int or label) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
If axis was specified return one-column QueryCompiler with index labels of the specified axis, where each row contains boolean of whether any element at the corresponding row or column is True. Otherwise return QueryCompiler with a single bool of whether any element is True.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.any
for more information about parameters and output format.
- apply(func, axis, *args, **kwargs)¶
Apply passed function across given axis.
- Parameters
func (callable(pandas.Series) -> scalar, str, list or dict of such) – The function to apply to each column or row.
axis ({0, 1}) – Target axis to apply the function along. 0 is for index, 1 is for columns.
*args (iterable) – Positional arguments to pass to func.
**kwargs (dict) – Keyword arguments to pass to func.
- Returns
QueryCompiler that contains the results of execution and is built by the following rules:
Labels of specified axis are the passed functions names.
Labels of the opposite axis are preserved.
Each element is the result of execution of func against corresponding row/column.
- Return type
- applymap(func)¶
Apply passed function elementwise.
- Parameters
func (callable(scalar) -> scalar) – Function to apply to each element of the QueryCompiler.
- Returns
Transformed QueryCompiler.
- Return type
- astype(col_dtypes, **kwargs)¶
Convert columns dtypes to given dtypes.
- Parameters
col_dtypes (dict) – Map for column names and new dtypes.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with updated dtypes.
- Return type
- cat_codes()¶
Convert underlying categories data into its codes.
- Returns
New QueryCompiler containing the integer codes of the underlying categories.
- Return type
Notes
Please refer to
modin.pandas.Series.cat.codes
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- clip(lower, upper, **kwargs)¶
Trim values at input threshold.
- Parameters
lower (float or list-like) –
upper (float or list-like) –
axis ({0, 1}) –
inplace ({False}) – This parameter serves the compatibility purpose. Always has to be False.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with values limited by the specified thresholds.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.clip
for more information about parameters and output format.
- columnarize()¶
Transpose this QueryCompiler if it has a single row but multiple columns.
This method should be called for QueryCompilers representing a Series object, i.e.
self.is_series_like()
should be True.- Returns
Transposed new QueryCompiler or self.
- Return type
- combine(other, **kwargs)¶
Perform column-wise combine with another QueryCompiler with passed func.
If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler) – Left operand of the binary operation.
func (callable(pandas.Series, pandas.Series) -> pandas.Series) – Function that takes two
pandas.Series
with aligned axes and returns onepandas.Series
as resulting combination.fill_value (float or None) – Value to fill missing values with after frame alignment occurred.
overwrite (bool) – If True, columns in self that do not exist in other will be overwritten with NaNs.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of combine.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.combine
for more information about parameters and output format.
- combine_first(other, **kwargs)¶
Fill null elements of self with value in the same location in other.
If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler) – Provided frame to use to fill null values from.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.combine_first
for more information about parameters and output format.
- compare(other, align_axis, keep_shape, keep_equal)¶
Compare data of two QueryCompilers and highlight the difference.
- Parameters
other (BaseQueryCompiler) – Query compiler to compare with. Have to be the same shape and the same labeling as self.
align_axis ({0, 1}) –
keep_shape (bool) –
keep_equal (bool) –
- Returns
New QueryCompiler containing the differences between self and passed query compiler.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.compare
for more information about parameters and output format.
- concat(axis, other, **kwargs)¶
Concatenate self with passed query compilers along specified axis.
- Parameters
axis ({0, 1}) – Axis to concatenate along. 0 is for index and 1 is for columns.
other (BaseQueryCompiler or list of such) – Objects to concatenate with self.
join ({'outer', 'inner', 'right', 'left'}, default: 'outer') – Type of join that will be used if indices on the other axis are different. (note: if specified, has to be passed as
join=value
).ignore_index (bool, default: False) – If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. (note: if specified, has to be passed as
ignore_index=value
).sort (bool, default: False) – Whether or not to sort non-concatenation axis. (note: if specified, has to be passed as
sort=value
).**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Concatenated objects.
- Return type
- conj(**kwargs)¶
Get the complex conjugate for every element of self.
- Parameters
**kwargs (dict) –
- Returns
QueryCompiler with conjugate applied element-wise.
- Return type
Notes
Please refer to
numpy.conj
for parameters description.
- copy()¶
Make a copy of this object.
- Returns
Copy of self.
- Return type
Notes
For copy, we don’t want a situation where we modify the metadata of the copies if we end up modifying something here. We copy all of the metadata to prevent that.
- corr(**kwargs)¶
Compute pairwise correlation of columns, excluding NA/null values.
- Parameters
method ({'pearson', 'kendall', 'spearman'} or callable(pandas.Series, pandas.Series) -> pandas.Series) – Correlation method.
min_periods (int) – Minimum number of observations required per pair of columns to have a valid result. If fewer than min_periods non-NA values are present the result will be NA.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Correlation matrix.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.corr
for more information about parameters and output format.
- count(**kwargs)¶
Get the number of non-NaN values for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the number of non-NaN values for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.count
for more information about parameters and output format.
- cov(**kwargs)¶
Compute pairwise covariance of columns, excluding NA/null values.
- Parameters
min_periods (int) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Covariance matrix.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.cov
for more information about parameters and output format.
- cummax(**kwargs)¶
Get cummulative maximum for every row or column.
- Parameters
axis ({0, 1}) –
skipna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler of the same shape as self, where each element is the maximum of all the previous values in this row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.cummax
for more information about parameters and output format.
- cummin(**kwargs)¶
Get cummulative minimum for every row or column.
- Parameters
axis ({0, 1}) –
skipna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler of the same shape as self, where each element is the minimum of all the previous values in this row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.cummin
for more information about parameters and output format.
- cumprod(**kwargs)¶
Get cummulative product for every row or column.
- Parameters
axis ({0, 1}) –
skipna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler of the same shape as self, where each element is the product of all the previous values in this row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.cumprod
for more information about parameters and output format.
- cumsum(**kwargs)¶
Get cummulative sum for every row or column.
- Parameters
axis ({0, 1}) –
skipna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler of the same shape as self, where each element is the sum of all the previous values in this row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.cumsum
for more information about parameters and output format.
- abstract default_to_pandas(pandas_op, *args, **kwargs)¶
Do fallback to pandas for the passed function.
- Parameters
pandas_op (callable(pandas.DataFrame) -> object) – Function to apply to the casted to pandas frame.
*args (iterable) – Positional arguments to pass to pandas_op.
**kwargs (dict) – Key-value arguments to pass to pandas_op.
- Returns
The result of the pandas_op, converted back to
BaseQueryCompiler
.- Return type
- delitem(key)¶
Drop key column.
- Parameters
key (label) – Column name to drop.
- Returns
New QueryCompiler without key column.
- Return type
- describe(**kwargs)¶
Generate descriptive statistics.
- Parameters
percentiles (list-like) –
include ("all" or list of dtypes, optional) –
exclude (list of dtypes, optional) –
datetime_is_numeric (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler object containing the descriptive statistics of the underlying data.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.describe
for more information about parameters and output format.
- df_update(other, **kwargs)¶
Update values of self using non-NA values of other at the corresponding positions.
If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler) – Frame to grab replacement values from.
join ({"left"}) – Specify type of join to align frames if axes are not equal (note: currently only one type of join is implemented).
overwrite (bool) – Whether to overwrite every corresponding value of self, or only if it’s NAN.
filter_func (callable(pandas.Series, pandas.Series) -> numpy.ndarray<bool>) – Function that takes column of the self and return bool mask for values, that should be overwriten in the self frame.
errors ({"raise", "ignore"}) – If “raise”, will raise a
ValueError
if self and other both contain non-NA data in the same place.**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with updated values.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.update
for more information about parameters and output format.
- diff(**kwargs)¶
First discrete difference of element.
- Parameters
periods (int) –
axis ({0, 1}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler of the same shape as self, where each element is the difference between the corresponding value and the previous value in this row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.diff
for more information about parameters and output format.
- dot(other, **kwargs)¶
Compute the matrix multiplication of self and other.
- Parameters
other (BaseQueryCompiler or NumPy array) – The other query compiler or NumPy array to matrix multiply with self.
squeeze_self (boolean) – If self is a one-column query compiler, indicates whether it represents Series object.
squeeze_other (boolean) – If other is a one-column query compiler, indicates whether it represents Series object.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
A new query compiler that contains result of the matrix multiply.
- Return type
- drop(index=None, columns=None)¶
Drop specified rows or columns.
- Parameters
index (list of labels, optional) – Labels of rows to drop.
columns (list of labels, optional) – Labels of columns to drop.
- Returns
New QueryCompiler with removed data.
- Return type
- dropna(**kwargs)¶
Remove missing values.
- Parameters
axis ({0, 1}) –
how ({"any", "all"}) –
thresh (int, optional) –
subset (list of labels) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with null values dropped along given axis.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.dropna
for more information about parameters and output format.
- dt_ceil(freq, ambiguous='raise', nonexistent='raise')¶
Perform ceil operation on the underlying time-series data to the specified freq.
- Parameters
freq (str) –
ambiguous ({"raise", "infer", "NaT"} or bool mask, default: "raise") –
nonexistent ({"raise", "shift_forward", "shift_backward", "NaT"} or timedelta, default: "raise") –
- Returns
New QueryCompiler with performed ceil operation on every element.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.ceil
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_components()¶
Spread each date-time value into its components (days, hours, minutes…).
- Returns
- Return type
Notes
Please refer to
modin.pandas.Series.dt.components
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_date()¶
Get the date without timezone information for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is the date without timezone information for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.date
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_day()¶
Get day component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is day component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.day
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_day_name(locale=None)¶
Get day name for each datetime value.
- Parameters
locale (str, optional) –
- Returns
New QueryCompiler with the same shape as self, where each element is day name for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.day_name
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_dayofweek()¶
Get integer day of week for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is integer day of week for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.dayofweek
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_dayofyear()¶
Get day of year for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is day of year for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.dayofyear
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_days()¶
Get days for each interval value.
- Returns
New QueryCompiler with the same shape as self, where each element is days for the corresponding interval value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.days
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_days_in_month()¶
Get number of days in month for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is number of days in month for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.days_in_month
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_daysinmonth()¶
Get number of days in month for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is number of days in month for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.daysinmonth
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_end_time()¶
Get the timestamp of end time for each period value.
- Returns
New QueryCompiler with the same shape as self, where each element is the timestamp of end time for the corresponding period value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.end_time
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_floor(freq, ambiguous='raise', nonexistent='raise')¶
Perform floor operation on the underlying time-series data to the specified freq.
- Parameters
freq (str) –
ambiguous ({"raise", "infer", "NaT"} or bool mask, default: "raise") –
nonexistent ({"raise", "shift_forward", "shift_backward", "NaT"} or timedelta, default: "raise") –
- Returns
New QueryCompiler with performed floor operation on every element.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.floor
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_freq()¶
Get the time frequency of the underlying time-series data.
- Returns
QueryCompiler containing a single value, the frequency of the data.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.freq
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_hour()¶
Get hour for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is hour for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.hour
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_is_leap_year()¶
Get the boolean of whether corresponding year is leap for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is the boolean of whether corresponding year is leap for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.is_leap_year
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_is_month_end()¶
Get the boolean of whether the date is the last day of the month for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is the boolean of whether the date is the last day of the month for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.is_month_end
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_is_month_start()¶
Get the boolean of whether the date is the first day of the month for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is the boolean of whether the date is the first day of the month for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.is_month_start
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_is_quarter_end()¶
Get the boolean of whether the date is the last day of the quarter for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is the boolean of whether the date is the last day of the quarter for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.is_quarter_end
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_is_quarter_start()¶
Get the boolean of whether the date is the first day of the quarter for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is the boolean of whether the date is the first day of the quarter for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.is_quarter_start
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_is_year_end()¶
Get the boolean of whether the date is the last day of the year for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is the boolean of whether the date is the last day of the year for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.is_year_end
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_is_year_start()¶
Get the boolean of whether the date is the first day of the year for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is the boolean of whether the date is the first day of the year for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.is_year_start
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_microsecond()¶
Get microseconds component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is microseconds component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.microsecond
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_microseconds()¶
Get microseconds component for each interval value.
- Returns
New QueryCompiler with the same shape as self, where each element is microseconds component for the corresponding interval value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.microseconds
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_minute()¶
Get minute component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is minute component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.minute
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_month()¶
Get month component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is month component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.month
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_month_name(locale=None)¶
Get the month name for each datetime value.
- Parameters
locale (str, optional) –
- Returns
New QueryCompiler with the same shape as self, where each element is the month name for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.month name
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_nanosecond()¶
Get nanoseconds component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is nanoseconds component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.nanosecond
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_nanoseconds()¶
Get nanoseconds component for each interval value.
- Returns
New QueryCompiler with the same shape as self, where each element is nanoseconds component for the corresponding interval value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.nanoseconds
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_normalize()¶
Set the time component of each date-time value to midnight.
- Returns
New QueryCompiler containing date-time values with midnight time.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.normalize
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_quarter()¶
Get quarter component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is quarter component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.quarter
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_qyear()¶
Get the fiscal year for each period value.
- Returns
New QueryCompiler with the same shape as self, where each element is the fiscal year for the corresponding period value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.qyear
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_round(freq, ambiguous='raise', nonexistent='raise')¶
Perform round operation on the underlying time-series data to the specified freq.
- Parameters
freq (str) –
ambiguous ({"raise", "infer", "NaT"} or bool mask, default: "raise") –
nonexistent ({"raise", "shift_forward", "shift_backward", "NaT"} or timedelta, default: "raise") –
- Returns
New QueryCompiler with performed round operation on every element.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.round
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_second()¶
Get seconds component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is seconds component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.second
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_seconds()¶
Get seconds component for each interval value.
- Returns
New QueryCompiler with the same shape as self, where each element is seconds component for the corresponding interval value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.seconds
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_start_time()¶
Get the timestamp of start time for each period value.
- Returns
New QueryCompiler with the same shape as self, where each element is the timestamp of start time for the corresponding period value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.start_time
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_strftime(date_format)¶
Format underlying date-time data using specified format.
- Parameters
date_format (str) –
- Returns
New QueryCompiler containing formated date-time values.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.strftime
for more information about parameters and output format.
- dt_time()¶
Get time component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is time component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.time
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_timetz()¶
Get time component with timezone information for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is time component with timezone information for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.timetz
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_to_period(freq=None)¶
Convert underlying data to the period at a particular frequency.
- Parameters
freq (str, optional) –
- Returns
New QueryCompiler containing period data.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.to_period
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_to_pydatetime()¶
Convert underlying data to array of python native
datetime
.- Returns
New QueryCompiler containing 1D array of
datetime
objects.- Return type
Notes
Please refer to
modin.pandas.Series.dt.to_pydatetime
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_to_pytimedelta()¶
Convert underlying data to array of python native
datetime.timedelta
.- Returns
New QueryCompiler containing 1D array of
datetime.timedelta
.- Return type
Notes
Please refer to
modin.pandas.Series.dt.to_pytimedelta
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_to_timestamp()¶
Get the timestamp representation for each period value.
- Returns
New QueryCompiler with the same shape as self, where each element is the timestamp representation for the corresponding period value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.to_timestamp
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_total_seconds()¶
Get duration in seconds for each interval value.
- Returns
New QueryCompiler with the same shape as self, where each element is duration in seconds for the corresponding interval value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.total_seconds
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_tz()¶
Get the time-zone of the underlying time-series data.
- Returns
QueryCompiler containing a single value, time-zone of the data.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.tz
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_tz_convert(tz)¶
Convert time-series data to the specified time zone.
- Parameters
tz (str, pytz.timezone) –
- Returns
New QueryCompiler containing values with converted time zone.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.tz_convert
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_tz_localize(tz, ambiguous='raise', nonexistent='raise')¶
Localize tz-naive to tz-aware.
- Parameters
tz (str, pytz.timezone, optional) –
ambiguous ({"raise", "inner", "NaT"} or bool mask, default: "raise") –
nonexistent ({"raise", "shift_forward", "shift_backward, "NaT"} or pandas.timedelta, default: "raise") –
- Returns
New QueryCompiler containing values with localized time zone.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.tz_localize
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_week()¶
Get week component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is week component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.week
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_weekday()¶
Get integer day of week for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is integer day of week for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.weekday
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_weekofyear()¶
Get week of year for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is week of year for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.weekofyear
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_year()¶
Get year component for each datetime value.
- Returns
New QueryCompiler with the same shape as self, where each element is year component for the corresponding datetime value.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.year
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- property dtypes¶
Get columns dtypes.
- Returns
Series with dtypes of each column.
- Return type
pandas.Series
- eq(other, **kwargs)¶
Perform element-wise equality comparison (
self == other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- eval(expr, **kwargs)¶
Evaluate string expression on QueryCompiler columns.
- Parameters
expr (str) –
**kwargs (dict) –
- Returns
QueryCompiler containing the result of evaluation.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.eval
for more information about parameters and output format.
- fillna(**kwargs)¶
Replace NaN values using provided method.
- Parameters
value (scalar or dict) –
method ({"backfill", "bfill", "pad", "ffill", None}) –
axis ({0, 1}) –
inplace ({False}) – This parameter serves the compatibility purpose. Always has to be False.
limit (int, optional) –
downcast (dict, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with all null values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.fillna
for more information about parameters and output format.
- abstract finalize()¶
Finalize constructing the dataframe calling all deferred functions which were used to build it.
- first_valid_index()¶
Return index label of first non-NaN/NULL value.
- Returns
- Return type
scalar
- floordiv(other, **kwargs)¶
Perform element-wise integer division (
self // other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- abstract free()¶
Trigger a cleanup of this object.
- abstract classmethod from_arrow(at, data_cls)¶
Build QueryCompiler from Arrow Table.
- Parameters
at (Arrow Table) – The Arrow Table to convert from.
data_cls (type) –
BasePandasFrame
class (or its descendant) to convert to.
- Returns
QueryCompiler containing data from the pandas DataFrame.
- Return type
- abstract classmethod from_pandas(df, data_cls)¶
Build QueryCompiler from pandas DataFrame.
- Parameters
df (pandas.DataFrame) – The pandas DataFrame to convert from.
data_cls (type) –
BasePandasFrame
class (or its descendant) to convert to.
- Returns
QueryCompiler containing data from the pandas DataFrame.
- Return type
- ge(other, **kwargs)¶
Perform element-wise greater than or equal comparison (
self >= other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- get_axis(axis)¶
Return index labels of the specified axis.
- Parameters
axis ({0, 1}) – Axis to return labels on. 0 is for index, when 1 is for columns.
- Returns
- Return type
pandas.Index
- get_dummies(columns, **kwargs)¶
Convert categorical variables to dummy variables for certain columns.
- Parameters
columns (label or list of such) – Columns to convert.
prefix (str or list of such) –
prefix_sep (str) –
dummy_na (bool) –
drop_first (bool) –
dtype (dtype) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with categorical variables converted to dummy.
- Return type
Notes
Please refer to
modin.pandas.get_dummies
for more information about parameters and output format.
- get_index_name(axis=0)¶
Get index name of specified axis.
- Parameters
axis ({0, 1}, default: 0) – Axis to get index name on.
- Returns
Index name, None for MultiIndex.
- Return type
hashable
- get_index_names(axis=0)¶
Get index names of specified axis.
- Parameters
axis ({0, 1}, default: 0) – Axis to get index names on.
- Returns
Index names.
- Return type
list
- getitem_array(key)¶
Mask QueryCompiler with key.
- Parameters
key (BaseQueryCompiler, np.ndarray or list of column labels) – Boolean mask represented by QueryCompiler or
np.ndarray
of the same shape as self, or enumerable of columns to pick.- Returns
New masked QueryCompiler.
- Return type
- getitem_column_array(key, numeric=False)¶
Get column data for target labels.
- Parameters
key (list-like) – Target labels by which to retrieve data.
numeric (bool, default: False) – Whether or not the key passed in represents the numeric index or the named index.
- Returns
New QueryCompiler that contains specified columns.
- Return type
- getitem_row_array(key)¶
Get row data for target indices.
- Parameters
key (list-like) – Numeric indices of the rows to pick.
- Returns
New QueryCompiler that contains specified rows.
- Return type
- groupby_agg(by, is_multi_by, axis, agg_func, agg_args, agg_kwargs, groupby_kwargs, drop=False)¶
Group QueryCompiler data and apply passed aggregation function.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
is_multi_by (bool) – If by is a QueryCompiler or list of such indicates whether it’s grouping on multiple columns/rows.
axis ({0, 1}) – Axis to group and apply aggregation function along. 0 is for index, when 1 is for columns.
agg_func (dict or callable(DataFrameGroupBy) -> DataFrame) – Function to apply to the GroupBy object.
agg_args (dict) – Positional arguments to pass to the agg_func.
agg_kwargs (dict) – Key arguments to pass to the agg_func.
groupby_kwargs (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
QueryCompiler containing the result of groupby aggregation.
- Return type
Notes
Please refer to
modin.pandas.GroupBy.aggregate
for more information about parameters and output format.
- groupby_all(by, axis, groupby_args, map_args, reduce_args=None, numeric_only=True, drop=False)¶
Group QueryCompiler data and check whether all elements are True for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the boolean of whether all elements are True for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.all
for more information about parameters and output format.
- groupby_any(by, axis, groupby_args, map_args, reduce_args=None, numeric_only=True, drop=False)¶
Group QueryCompiler data and check whether any element is True for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the boolean of whether there is any element which is True for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.any
for more information about parameters and output format.
- groupby_count(by, axis, groupby_args, map_args, reduce_args=None, numeric_only=True, drop=False)¶
Group QueryCompiler data and count non-null values for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the number of non-null values for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.count
for more information about parameters and output format.
- groupby_max(by, axis, groupby_args, map_args, reduce_args=None, numeric_only=True, drop=False)¶
Group QueryCompiler data and get the maximum value for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the maximum value for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.max
for more information about parameters and output format.
- groupby_min(by, axis, groupby_args, map_args, reduce_args=None, numeric_only=True, drop=False)¶
Group QueryCompiler data and get the minimum value for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the minimum value for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.min
for more information about parameters and output format.
- groupby_prod(by, axis, groupby_args, map_args, reduce_args=None, numeric_only=True, drop=False)¶
Group QueryCompiler data and compute product for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the product for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.prod
for more information about parameters and output format.
- groupby_size(by, axis, groupby_args, map_args, reduce_args=None, numeric_only=True, drop=False)¶
Group QueryCompiler data and get the number of elements for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the number of elements for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.size
for more information about parameters and output format.
- groupby_sum(by, axis, groupby_args, map_args, reduce_args=None, numeric_only=True, drop=False)¶
Group QueryCompiler data and compute sum for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the sum for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.sum
for more information about parameters and output format.
- gt(other, **kwargs)¶
Perform element-wise greater than comparison (
self > other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- has_multiindex(axis=0)¶
Check if specified axis is indexed by MultiIndex.
- Parameters
axis ({0, 1}, default: 0) – The axis to check (0 - index, 1 - columns).
- Returns
True if index at specified axis is MultiIndex and False otherwise.
- Return type
bool
- idxmax(**kwargs)¶
Get position of the first occurence of the maximum for each row or column.
- Parameters
axis ({0, 1}) –
skipna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains position of the maximum element for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.idxmax
for more information about parameters and output format.
- idxmin(**kwargs)¶
Get position of the first occurence of the minimum for each row or column.
- Parameters
axis ({0, 1}) –
skipna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains position of the minimum element for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.idxmin
for more information about parameters and output format.
- insert(loc, column, value)¶
Insert new column.
- Parameters
loc (int) – Insertion position.
column (label) – Label of the new column.
value (One-column BaseQueryCompiler, 1D array or scalar) – Data to fill new column with.
- Returns
QueryCompiler with new column inserted.
- Return type
- insert_item(axis, loc, value, how='inner', replace=False)¶
Insert rows/columns defined by value at the specified position.
If frames are not aligned along specified axis, perform frames alignment first.
- Parameters
axis ({0, 1}) – Axis to insert along. 0 means insert rows, when 1 means insert columns.
loc (int) – Position to insert value.
value (BaseQueryCompiler) – Rows/columns to insert.
how ({"inner", "outer", "left", "right"}, default: "inner") – Type of join that will be used if frames are not aligned.
replace (bool, default: False) – Whether to insert item after column/row at loc-th position or to replace it by value.
- Returns
New QueryCompiler with inserted values.
- Return type
- invert()¶
Apply bitwise invertion for each element of the QueryCompiler.
- Returns
New QueryCompiler containing bitwise invertion for each value.
- Return type
- is_monotonic_decreasing()¶
Return boolean if values in the object are monotonicly decreasing.
- Returns
- Return type
bool
- is_monotonic_increasing()¶
Return boolean if values in the object are monotonicly increasing.
- Returns
- Return type
bool
- is_series_like()¶
Check whether this QueryCompiler can represent
modin.pandas.Series
object.- Returns
Return True if QueryCompiler has a single column or row, False otherwise.
- Return type
bool
- isin(**kwargs)¶
Check for each element of self whether it’s contained in passed values.
- Parameters
values (list-like, modin.pandas.Series, modin.pandas.DataFrame or dict) – Values to check elements of self in.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Boolean mask for self of whether an element at the corresponding position is contained in values.
- Return type
- isna()¶
Check for each element of self whether it’s NaN.
- Returns
Boolean mask for self of whether an element at the corresponding position is NaN.
- Return type
- join(right, **kwargs)¶
Join columns of another QueryCompiler.
- Parameters
right (BaseQueryCompiler) – QueryCompiler of the right frame to join with.
on (label or list of such) –
how ({"left", "right", "outer", "inner"}) –
lsuffix (str) –
rsuffix (str) –
sort (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler that contains result of the join.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.join
for more information about parameters and output format.
- kurt(axis, level=None, numeric_only=None, skipna=True, **kwargs)¶
Get the unbiased kurtosis for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the unbiased kurtosis for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.kurt
for more information about parameters and output format.
- last_valid_index()¶
Return index label of last non-NaN/NULL value.
- Returns
- Return type
scalar
- le(other, **kwargs)¶
Perform element-wise less than or equal comparison (
self <= other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- lt(other, **kwargs)¶
Perform element-wise less than comparison (
self < other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- mad(axis, skipna, level=None)¶
Get the mean absolute deviation for each column or row.
- Parameters
axis ({0, 1}) –
skipna (bool) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the mean absolute deviation for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.mad
for more information about parameters and output format.
- max(**kwargs)¶
Get the maximum value for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the maximum value for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.max
for more information about parameters and output format.
- mean(**kwargs)¶
Get the mean value for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the mean value for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.mean
for more information about parameters and output format.
- median(**kwargs)¶
Get the median value for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the median value for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.median
for more information about parameters and output format.
- melt(*args, **kwargs)¶
Unpivot QueryCompiler data from wide to long format.
- Parameters
id_vars (list of labels, optional) –
value_vars (list of labels, optional) –
var_name (label) –
value_name (label) –
col_level (int or label) –
ignore_index (bool) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with unpivoted data.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.melt
for more information about parameters and output format.
- memory_usage(**kwargs)¶
Return the memory usage of each column in bytes.
- Parameters
index (bool) –
deep (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of self, where each row contains the memory usage for the corresponding column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.memory_usage
for more information about parameters and output format.
- merge(right, **kwargs)¶
Merge QueryCompiler objects using a database-style join.
- Parameters
right (BaseQueryCompiler) – QueryCompiler of the right frame to merge with.
how ({"left", "right", "outer", "inner", "cross"}) –
on (label or list of such) –
left_on (label or list of such) –
right_on (label or list of such) –
left_index (bool) –
right_index (bool) –
sort (bool) –
suffixes (list-like) –
copy (bool) –
indicator (bool or str) –
validate (str) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler that contains result of the merge.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.merge
for more information about parameters and output format.
- min(**kwargs)¶
Get the minimum value for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the minimum value for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.min
for more information about parameters and output format.
- mod(other, **kwargs)¶
Perform element-wise modulo (
self % other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- mode(**kwargs)¶
Get the modes for every column or row.
- Parameters
axis ({0, 1}) –
numeric_only (bool) –
dropna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with modes calculated alogn given axis.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.mode
for more information about parameters and output format.
- mul(other, **kwargs)¶
Perform element-wise multiplication (
self * other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- ne(other, **kwargs)¶
Perform element-wise not equal comparison (
self != other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- negative(**kwargs)¶
Change the sign for every value of self.
- Parameters
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
- Return type
Notes
Be aware, that all QueryCompiler values have to be numeric.
- nlargest(n=5, columns=None, keep='first')¶
Return the first n rows ordered by columns in descending order.
- Parameters
n (int, default: 5) –
columns (list of labels, optional) – Column labels to order by. (note: this parameter can be omitted only for a single-column query compilers representing Series object, otherwise columns has to be specified).
keep ({"first", "last", "all"}, default: "first") –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.nlargest
for more information about parameters and output format.
- notna()¶
Check for each element of self whether it’s existing (non-missing) value.
- Returns
Boolean mask for self of whether an element at the corresponding position is not NaN.
- Return type
- nsmallest(n=5, columns=None, keep='first')¶
Return the first n rows ordered by columns in ascending order.
- Parameters
n (int, default: 5) –
columns (list of labels, optional) – Column labels to order by. (note: this parameter can be omitted only for a single-column query compilers representing Series object, otherwise columns has to be specified).
keep ({"first", "last", "all"}, default: "first") –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.nsmallest
for more information about parameters and output format.
- nunique(**kwargs)¶
Get the number of unique values for each column or row.
- Parameters
axis ({0, 1}) –
dropna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the number of unique values for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.nunique
for more information about parameters and output format.
- pivot(index, columns, values)¶
Produce pivot table based on column values.
- Parameters
index (label or list of such, pandas.Index, optional) –
columns (label or list of such) –
values (label or list of such, optional) –
- Returns
New QueryCompiler containing pivot table.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.pivot
for more information about parameters and output format.
- pivot_table(index, values, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)¶
Create a spreadsheet-style pivot table from underlying data.
- Parameters
index (label, pandas.Grouper, array or list of such) –
values (label, optional) –
columns (column, pandas.Grouper, array or list of such) –
aggfunc (callable(pandas.Series) -> scalar, dict of list of such) –
fill_value (scalar, optional) –
margins (bool) –
dropna (bool) –
margins_name (str) –
observed (bool) –
sort (bool) –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.pivot_table
for more information about parameters and output format.
- pow(other, **kwargs)¶
Perform element-wise exponential power (
self ** other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- prod(**kwargs)¶
Get the production for each column or row.
- Parameters
axis ({0, 1}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the production for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.prod
for more information about parameters and output format.
- prod_min_count(**kwargs)¶
Get the production for each column or row.
- Parameters
axis ({0, 1}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the production for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.prod
for more information about parameters and output format.
- quantile_for_list_of_values(**kwargs)¶
Get the value at the given quantile for each column or row.
- Parameters
q (list-like) –
axis ({0, 1}) –
numeric_only (bool) –
interpolation ({"linear", "lower", "higher", "midpoint", "nearest"}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the value at the given quantile for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.quantile
for more information about parameters and output format.
- quantile_for_single_value(**kwargs)¶
Get the value at the given quantile for each column or row.
- Parameters
q (float) –
axis ({0, 1}) –
numeric_only (bool) –
interpolation ({"linear", "lower", "higher", "midpoint", "nearest"}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the value at the given quantile for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.quantile
for more information about parameters and output format.
- query(expr, **kwargs)¶
Query columns of the QueryCompiler with a boolean expression.
- Parameters
expr (str) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the rows where the boolean expression is satisfied.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.query
for more information about parameters and output format.
- rank(**kwargs)¶
Compute numerical rank along the specified axis.
By default, equal values are assigned a rank that is the average of the ranks of those values, this behaviour can be changed via method parameter.
- Parameters
axis ({0, 1}) –
method ({"average", "min", "max", "first", "dense"}) –
numeric_only (bool) –
na_option ({"keep", "top", "bottom"}) –
ascending (bool) –
pct (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler of the same shape as self, where each element is the numerical rank of the corresponding value along row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.rank
for more information about parameters and output format.
- reindex(axis, labels, **kwargs)¶
Align QueryCompiler data with a new index along specified axis.
- Parameters
axis ({0, 1}) – Axis to align labels along. 0 is for index, 1 is for columns.
labels (list-like) – Index-labels to align with.
method ({None, "backfill"/"bfill", "pad"/"ffill", "nearest"}) – Method to use for filling holes in reindexed frame.
fill_value (scalar) – Value to use for missing values in the resulted frame.
limit (int) –
tolerance (int) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with aligned axis.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.reindex
for more information about parameters and output format.
- repeat(repeats)¶
Repeat each element of one-column QueryCompiler given number of times.
- Parameters
repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty QueryCompiler.
- Returns
New QueryCompiler with repeated elements.
- Return type
Notes
Please refer to
modin.pandas.Series.repeat
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- replace(**kwargs)¶
Replace values given in to_replace by value.
- Parameters
to_replace (scalar, list-like, regex, modin.pandas.Series, or None) –
value (scalar, list-like, regex or dict) –
inplace ({False}) – This parameter serves the compatibility purpose. Always has to be False.
limit (int or None) –
regex (bool or same types as to_replace) –
method ({"pad", "ffill", "bfill", None}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with all to_replace values replaced by value.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.replace
for more information about parameters and output format.
- resample_agg_df(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and apply passed aggregation function for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.agg
for more information about parameters and output format.
- resample_agg_ser(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and apply passed aggregation function in a one-column query compiler for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.agg
for more information about parameters and output format.Warning
This method duplicates logic of
resample_agg_df
and will be removed soon.
- resample_app_df(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and apply passed aggregation function for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.apply
for more information about parameters and output format.Warning
This method duplicates logic of
resample_agg_df
and will be removed soon.
- resample_app_ser(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and apply passed aggregation function in a one-column query compiler for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.apply
for more information about parameters and output format.Warning
This method duplicates logic of
resample_agg_df
and will be removed soon.
- resample_asfreq(resample_args, fill_value)¶
Resample time-series data and get the values at the new frequency.
Group data into intervals by time-series row/column with a specified frequency and get values at the new frequency.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.fill_value (scalar) –
- Returns
New QueryCompiler containing values at the specified frequency.
- Return type
- resample_backfill(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using back-fill method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.backfill
for more information about parameters and output format.
- resample_bfill(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using back-fill method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.bfill
for more information about parameters and output format.
- resample_count(resample_args)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute number of non-NA values for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the number of non-NA values for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.count
for more information about parameters and output format.
- resample_ffill(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using forward-fill method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.ffill
for more information about parameters and output format.
- resample_fillna(resample_args, method, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using specified method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.method (str) –
limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.fillna
for more information about parameters and output format.
- resample_first(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute first element for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the first element for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.first
for more information about parameters and output format.
- resample_get_group(resample_args, name, obj)¶
Resample time-series data and get the specified group.
Group data into intervals by time-series row/column with a specified frequency and get the values of the specified group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.name (object) –
obj (modin.pandas.DataFrame, optional) –
- Returns
New QueryCompiler containing the values from the specified group.
- Return type
- resample_interpolate(resample_args, method, axis, limit, inplace, limit_direction, limit_area, downcast, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using specified interpolation method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.method (str) –
axis ({0, 1}) –
limit (int) –
inplace ({False}) – This parameter serves the compatibility purpose. Always has to be False.
limit_direction ({"forward", "backward", "both"}) –
limit_area ({None, "inside", "outside"}) –
downcast (str, optional) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.interpolate
for more information about parameters and output format.
- resample_last(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute last element for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the last element for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.last
for more information about parameters and output format.
- resample_max(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute maximum value for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the maximum value for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.max
for more information about parameters and output format.
- resample_mean(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute mean value for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the mean value for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.mean
for more information about parameters and output format.
- resample_median(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute median value for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the median value for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.median
for more information about parameters and output format.
- resample_min(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute minimum value for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the minimum value for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.min
for more information about parameters and output format.
- resample_nearest(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using ‘nearest’ method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.nearest
for more information about parameters and output format.
- resample_nunique(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute number of unique values for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the number of unique values for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.nunique
for more information about parameters and output format.
- resample_ohlc_df(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute open, high, low and close values for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the labels of columns containing computed values.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.ohlc
for more information about parameters and output format.
- resample_ohlc_ser(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute open, high, low and close values for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the labels of columns containing computed values.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.ohlc
for more information about parameters and output format.
- resample_pad(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using ‘pad’ method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.pad
for more information about parameters and output format.
- resample_pipe(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency, build equivalent
pandas.Resampler
object and apply passed function to it.- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (callable(pandas.Resampler) -> object or tuple(callable, str)) –
*args (iterable) – Positional arguments to pass to function.
**kwargs (dict) – Keyword arguments to pass to function.
- Returns
New QueryCompiler containing the result of passed function.
- Return type
Notes
Please refer to
modin.pandas.Resampler.pipe
for more information about parameters and output format.
- resample_prod(resample_args, _method, min_count, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute product for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
min_count (int) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the product for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.prod
for more information about parameters and output format.
- resample_quantile(resample_args, q, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute quantile for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.q (float) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the quantile for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.quantile
for more information about parameters and output format.
- resample_sem(resample_args, ddof=1, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute standart error of the mean for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.ddof (int, default: 1) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the standart error of the mean for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.sem
for more information about parameters and output format.
- resample_size(resample_args, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute number of elements in a group for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the number of elements in a group for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.size
for more information about parameters and output format.
- resample_std(resample_args, ddof, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute standart deviation for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.ddof (int) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the standart deviation for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.std
for more information about parameters and output format.
- resample_sum(resample_args, _method, min_count, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute sum for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
min_count (int) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the sum for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.sum
for more information about parameters and output format.
- resample_transform(resample_args, arg, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and call passed function on each group. In contrast to
resample_app_df
apply function to the whole group, instead of a single axis.- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.arg (callable(pandas.DataFrame) -> pandas.Series) –
*args (iterable) – Positional arguments to pass to function.
**kwargs (dict) – Keyword arguments to pass to function.
- Returns
New QueryCompiler containing the result of passed function.
- Return type
- resample_var(resample_args, ddof, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute variance for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.ddof (int) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the variance for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.var
for more information about parameters and output format.
- reset_index(**kwargs)¶
Reset the index, or a level of it.
- Parameters
drop (bool) – Whether to drop the reset index or insert it at the beginning of the frame.
level (int or label, optional) – Level to remove from index. Removes all levels by default.
col_level (int or label) – If the columns have multiple levels, determines which level the labels are inserted into.
col_fill (label) – If the columns have multiple levels, determines how the other levels are named.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with reset index.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.reset_index
for more information about parameters and output format.
- rfloordiv(other, **kwargs)¶
Perform element-wise integer division (
other // self
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- rmod(other, **kwargs)¶
Perform element-wise modulo (
other % self
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- rolling_aggregate(rolling_args, func, *args, **kwargs)¶
Create rolling window and apply specified functions for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the result of passed functions for each window, built by the following rulles:
Labels on the specified axis are preserved.
Labels on the opposit of specified axis are MultiIndex, where first level contains preserved labels of this axis and the second level has the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding window and column/row.
- Return type
Notes
Please refer to
modin.pandas.Rolling.aggregate
for more information about parameters and output format.
- rolling_apply(rolling_args, func, raw=False, engine=None, engine_kwargs=None, args=None, kwargs=None)¶
Create rolling window and apply specified function for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.func (callable(pandas.Series) -> scalar) –
raw (bool, default: False) –
engine (None, default: None) – This parameters serves the compatibility purpose. Always has to be None.
engine_kwargs (None, default: None) – This parameters serves the compatibility purpose. Always has to be None.
args (tuple, optional) –
kwargs (dict, optional) –
- Returns
New QueryCompiler containing the result of passed function for each window, built by the following rulles:
Labels on the specified axis are preserved.
Labels on the opposit of specified axis are MultiIndex, where first level contains preserved labels of this axis and the second level has the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding window and column/row.
- Return type
Notes
Please refer to
modin.pandas.Rolling.apply
for more information about parameters and output format.Warning
This method duplicates logic of
rolling_aggregate
and will be removed soon.
- rolling_corr(rolling_args, other=None, pairwise=None, *args, **kwargs)¶
Create rolling window and compute correlation for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.other (modin.pandas.Series, modin.pandas.DataFrame, list-like, optional) –
pairwise (bool, optional) –
*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing correlation for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the correlation for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.corr
for more information about parameters and output format.
- rolling_count(rolling_args)¶
Create rolling window and compute number of non-NA values for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.- Returns
New QueryCompiler containing number of non-NA values for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the number of non-NA values for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.count
for more information about parameters and output format.
- rolling_cov(rolling_args, other=None, pairwise=None, ddof=1, **kwargs)¶
Create rolling window and compute covariance for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.other (modin.pandas.Series, modin.pandas.DataFrame, list-like, optional) –
pairwise (bool, optional) –
ddof (int, default: 1) –
**kwargs (dict) –
- Returns
New QueryCompiler containing covariance for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the covariance for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.cov
for more information about parameters and output format.
- rolling_kurt(rolling_args, **kwargs)¶
Create rolling window and compute unbiased kurtosis for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.**kwargs (dict) –
- Returns
New QueryCompiler containing unbiased kurtosis for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the unbiased kurtosis for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.kurt
for more information about parameters and output format.
- rolling_max(rolling_args, *args, **kwargs)¶
Create rolling window and compute maximum value for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing maximum value for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the maximum value for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.max
for more information about parameters and output format.
- rolling_mean(rolling_args, *args, **kwargs)¶
Create rolling window and compute mean value for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing mean value for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the mean value for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.mean
for more information about parameters and output format.
- rolling_median(rolling_args, **kwargs)¶
Create rolling window and compute median value for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.**kwargs (dict) –
- Returns
New QueryCompiler containing median value for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the median value for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.median
for more information about parameters and output format.
- rolling_min(rolling_args, *args, **kwargs)¶
Create rolling window and compute minimum value for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing minimum value for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the minimum value for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.min
for more information about parameters and output format.
- rolling_quantile(rolling_args, quantile, interpolation='linear', **kwargs)¶
Create rolling window and compute quantile for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.quantile (float) –
interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}, default: 'linear') –
**kwargs (dict) –
- Returns
New QueryCompiler containing quantile for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the quantile for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.quantile
for more information about parameters and output format.
- rolling_skew(rolling_args, **kwargs)¶
Create rolling window and compute unbiased skewness for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.**kwargs (dict) –
- Returns
New QueryCompiler containing unbiased skewness for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the unbiased skewness for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.skew
for more information about parameters and output format.
- rolling_std(rolling_args, ddof=1, *args, **kwargs)¶
Create rolling window and compute standart deviation for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.ddof (int, default: 1) –
*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing standart deviation for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the standart deviation for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.std
for more information about parameters and output format.
- rolling_sum(rolling_args, *args, **kwargs)¶
Create rolling window and compute sum for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing sum for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the sum for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.sum
for more information about parameters and output format.
- rolling_var(rolling_args, ddof=1, *args, **kwargs)¶
Create rolling window and compute variance for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.ddof (int, default: 1) –
*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing variance for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the variance for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.var
for more information about parameters and output format.
- round(**kwargs)¶
Round every numeric value up to specified number of decimals.
- Parameters
decimals (int or list-like) – Number of decimals to round each column to.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with rounded values.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.round
for more information about parameters and output format.
- rpow(other, **kwargs)¶
Perform element-wise exponential power (
other ** self
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- rsub(other, **kwargs)¶
Perform element-wise substraction (
other - self
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- rtruediv(other, **kwargs)¶
Perform element-wise division (
other / self
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- searchsorted(**kwargs)¶
Find positions in a sorted self where value should be inserted to maintain order.
- Parameters
value (list-like) –
side ({"left", "right"}) –
sorter (list-like, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler which contains indices to insert.
- Return type
Notes
Please refer to
modin.pandas.Series.searchsorted
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- sem(**kwargs)¶
Get the standard deviation of the mean for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
ddof (int) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the standard deviation of the mean for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sem
for more information about parameters and output format.
- series_update(other, **kwargs)¶
Update values of self using values of other at the corresponding indices.
- Parameters
other (BaseQueryCompiler) – One-column query compiler with updated values.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with updated values.
- Return type
Notes
Please refer to
modin.pandas.Series.update
for more information about parameters and output format.
- series_view(**kwargs)¶
Reinterpret underlying data with new dtype.
- Parameters
dtype (dtype) – Data type to reinterpret underlying data with.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler of the same data in memory, with reinterpreted values.
- Return type
Notes
Be aware, that if this method do fallback to pandas, then newly created QueryCompiler will be the copy of the original data.
Please refer to
modin.pandas.Series.view
for more information about parameters and output format.
Warning
This method is supported only by one-column query compilers.
- set_index_from_columns(keys: List[Hashable], drop: bool = True, append: bool = False)¶
Create new row labels from a list of columns.
- Parameters
keys (list of hashable) – The list of column names that will become the new index.
drop (bool, default: True) – Whether or not to drop the columns provided in the keys argument.
append (bool, default: True) – Whether or not to add the columns in keys as new levels appended to the existing index.
- Returns
A new QueryCompiler with updated index.
- Return type
- set_index_name(name, axis=0)¶
Set index name for the specified axis.
- Parameters
name (hashable) – New index name.
axis ({0, 1}, default: 0) – Axis to set name along.
- set_index_names(names, axis=0)¶
Set index names for the specified axis.
- Parameters
names (list) – New index names.
axis ({0, 1}, default: 0) – Axis to set names along.
- setitem(axis, key, value)¶
Set the row/column defined by key to the value provided.
- Parameters
axis ({0, 1}) – Axis to set value along. 0 means set row, 1 means set column.
key (label) – Row/column label to set value in.
value (BaseQueryCompiler, list-like or scalar) – Define new row/column value.
- Returns
New QueryCompiler with updated key value.
- Return type
- skew(**kwargs)¶
Get the unbiased skew for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the unbiased skew for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.skew
for more information about parameters and output format.
- sort_columns_by_row_values(rows, ascending=True, **kwargs)¶
Reorder the columns based on the lexicographic order of the given rows.
- Parameters
rows (label or list of labels) – The row or rows to sort by.
ascending (bool, default: True) – Sort in ascending order (True) or descending order (False).
kind ({"quicksort", "mergesort", "heapsort"}) –
na_position ({"first", "last"}) –
ignore_index (bool) –
key (callable(pandas.Index) -> pandas.Index, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler that contains result of the sort.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sort_values
for more information about parameters and output format.
- sort_index(**kwargs)¶
Sort data by index or column labels.
- Parameters
axis ({0, 1}) –
level (int, label or list of such) –
ascending (bool) –
inplace (bool) –
kind ({"quicksort", "mergesort", "heapsort"}) –
na_position ({"first", "last"}) –
sort_remaining (bool) –
ignore_index (bool) –
key (callable(pandas.Index) -> pandas.Index, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the data sorted by columns or indices.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sort_index
for more information about parameters and output format.
- sort_rows_by_column_values(columns, ascending=True, **kwargs)¶
Reorder the rows based on the lexicographic order of the given columns.
- Parameters
columns (label or list of labels) – The column or columns to sort by.
ascending (bool, default: True) – Sort in ascending order (True) or descending order (False).
kind ({"quicksort", "mergesort", "heapsort"}) –
na_position ({"first", "last"}) –
ignore_index (bool) –
key (callable(pandas.Index) -> pandas.Index, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler that contains result of the sort.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sort_values
for more information about parameters and output format.
- stack(level, dropna)¶
Stack the prescribed level(s) from columns to index.
- Parameters
level (int or label) –
dropna (bool) –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.stack
for more information about parameters and output format.
- std(**kwargs)¶
Get the standard deviation for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
ddof (int) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the standard deviation for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.std
for more information about parameters and output format.
- str___getitem__(key)¶
Apply “__getitem__” function to each string value in QueryCompiler.
- Parameters
key (object) –
- Returns
New QueryCompiler containing the result of execution of the “__getitem__” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.__getitem__
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_capitalize()¶
Apply “capitalize” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “capitalize” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.capitalize
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_center(width, fillchar=' ')¶
Apply “center” function to each string value in QueryCompiler.
- Parameters
width (int) –
fillchar (str, default: ' ') –
- Returns
New QueryCompiler containing the result of execution of the “center” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.center
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_contains(pat, case=True, flags=0, na=nan, regex=True)¶
Apply “contains” function to each string value in QueryCompiler.
- Parameters
pat (str) –
case (bool, default: True) –
flags (int, default: 0) –
na (object, default: np.NaN) –
regex (bool, default: True) –
- Returns
New QueryCompiler containing the result of execution of the “contains” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.contains
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_count(pat, flags=0, **kwargs)¶
Apply “count” function to each string value in QueryCompiler.
- Parameters
pat (str) –
flags (int, default: 0) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the result of execution of the “count” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.count
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_endswith(pat, na=nan)¶
Apply “endswith” function to each string value in QueryCompiler.
- Parameters
pat (str) –
na (object, default: np.NaN) –
- Returns
New QueryCompiler containing the result of execution of the “endswith” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.endswith
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_find(sub, start=0, end=None)¶
Apply “find” function to each string value in QueryCompiler.
- Parameters
sub (str) –
start (int, default: 0) –
end (int, optional) –
- Returns
New QueryCompiler containing the result of execution of the “find” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.find
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_findall(pat, flags=0, **kwargs)¶
Apply “findall” function to each string value in QueryCompiler.
- Parameters
pat (str) –
flags (int, default: 0) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the result of execution of the “findall” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.findall
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_get(i)¶
Apply “get” function to each string value in QueryCompiler.
- Parameters
i (int) –
- Returns
New QueryCompiler containing the result of execution of the “get” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.get
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_index(sub, start=0, end=None)¶
Apply “index” function to each string value in QueryCompiler.
- Parameters
sub (str) –
start (int, default: 0) –
end (int, optional) –
- Returns
New QueryCompiler containing the result of execution of the “index” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.index
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_isalnum()¶
Apply “isalnum” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “isalnum” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.isalnum
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_isalpha()¶
Apply “isalpha” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “isalpha” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.isalpha
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_isdecimal()¶
Apply “isdecimal” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “isdecimal” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.isdecimal
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_isdigit()¶
Apply “isdigit” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “isdigit” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.isdigit
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_islower()¶
Apply “islower” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “islower” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.islower
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_isnumeric()¶
Apply “isnumeric” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “isnumeric” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.isnumeric
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_isspace()¶
Apply “isspace” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “isspace” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.isspace
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_istitle()¶
Apply “istitle” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “istitle” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.istitle
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_isupper()¶
Apply “isupper” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “isupper” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.isupper
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_join(sep)¶
Apply “join” function to each string value in QueryCompiler.
- Parameters
sep (str) –
- Returns
New QueryCompiler containing the result of execution of the “join” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.join
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_len()¶
Apply “len” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “len” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.len
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_ljust(width, fillchar=' ')¶
Apply “ljust” function to each string value in QueryCompiler.
- Parameters
width (int) –
fillchar (str, default: ' ') –
- Returns
New QueryCompiler containing the result of execution of the “ljust” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.ljust
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_lower()¶
Apply “lower” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “lower” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.lower
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_lstrip(to_strip=None)¶
Apply “lstrip” function to each string value in QueryCompiler.
- Parameters
to_strip (str, optional) –
- Returns
New QueryCompiler containing the result of execution of the “lstrip” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.lstrip
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_match(pat, case=True, flags=0, na=nan)¶
Apply “match” function to each string value in QueryCompiler.
- Parameters
pat (str) –
case (bool, default: True) –
flags (int, default: 0) –
na (object, default: np.NaN) –
- Returns
New QueryCompiler containing the result of execution of the “match” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.match
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_normalize(form)¶
Apply “normalize” function to each string value in QueryCompiler.
- Parameters
form ({'NFC', 'NFKC', 'NFD', 'NFKD'}) –
- Returns
New QueryCompiler containing the result of execution of the “normalize” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.normalize
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_pad(width, side='left', fillchar=' ')¶
Apply “pad” function to each string value in QueryCompiler.
- Parameters
width (int) –
side ({'left', 'right', 'both'}, default: 'left') –
fillchar (str, default: ' ') –
- Returns
New QueryCompiler containing the result of execution of the “pad” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.pad
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_partition(sep=' ', expand=True)¶
Apply “partition” function to each string value in QueryCompiler.
- Parameters
sep (str, default: ' ') –
expand (bool, default: True) –
- Returns
New QueryCompiler containing the result of execution of the “partition” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.partition
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_repeat(repeats)¶
Apply “repeat” function to each string value in QueryCompiler.
- Parameters
repeats (int) –
- Returns
New QueryCompiler containing the result of execution of the “repeat” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.repeat
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_replace(pat, repl, n=- 1, case=None, flags=0, regex=True)¶
Apply “replace” function to each string value in QueryCompiler.
- Parameters
pat (str) –
repl (str or callable) –
n (int, default: -1) –
case (bool, optional) –
flags (int, default: 0) –
regex (bool, default: True) –
- Returns
New QueryCompiler containing the result of execution of the “replace” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.replace
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_rfind(sub, start=0, end=None)¶
Apply “rfind” function to each string value in QueryCompiler.
- Parameters
sub (str) –
start (int, default: 0) –
end (int, optional) –
- Returns
New QueryCompiler containing the result of execution of the “rfind” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.rfind
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_rindex(sub, start=0, end=None)¶
Apply “rindex” function to each string value in QueryCompiler.
- Parameters
sub (str) –
start (int, default: 0) –
end (int, optional) –
- Returns
New QueryCompiler containing the result of execution of the “rindex” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.rindex
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_rjust(width, fillchar=' ')¶
Apply “rjust” function to each string value in QueryCompiler.
- Parameters
width (int) –
fillchar (str, default: ' ') –
- Returns
New QueryCompiler containing the result of execution of the “rjust” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.rjust
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_rpartition(sep=' ', expand=True)¶
Apply “rpartition” function to each string value in QueryCompiler.
- Parameters
sep (str, default: ' ') –
expand (bool, default: True) –
- Returns
New QueryCompiler containing the result of execution of the “rpartition” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.rpartition
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_rsplit(pat=None, n=- 1, expand=False)¶
Apply “rsplit” function to each string value in QueryCompiler.
- Parameters
pat (str, optional) –
n (int, default: -1) –
expand (bool, default: False) –
- Returns
New QueryCompiler containing the result of execution of the “rsplit” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.rsplit
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_rstrip(to_strip=None)¶
Apply “rstrip” function to each string value in QueryCompiler.
- Parameters
to_strip (str, optional) –
- Returns
New QueryCompiler containing the result of execution of the “rstrip” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.rstrip
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_slice(start=None, stop=None, step=None)¶
Apply “slice” function to each string value in QueryCompiler.
- Parameters
start (int, optional) –
stop (int, optional) –
step (int, optional) –
- Returns
New QueryCompiler containing the result of execution of the “slice” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.slice
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_slice_replace(start=None, stop=None, repl=None)¶
Apply “slice_replace” function to each string value in QueryCompiler.
- Parameters
start (int, optional) –
stop (int, optional) –
repl (str or callable, optional) –
- Returns
New QueryCompiler containing the result of execution of the “slice_replace” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.slice_replace
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_split(pat=None, n=- 1, expand=False)¶
Apply “split” function to each string value in QueryCompiler.
- Parameters
pat (str, optional) –
n (int, default: -1) –
expand (bool, default: False) –
- Returns
New QueryCompiler containing the result of execution of the “split” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.split
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_startswith(pat, na=nan)¶
Apply “startswith” function to each string value in QueryCompiler.
- Parameters
pat (str) –
na (object, default: np.NaN) –
- Returns
New QueryCompiler containing the result of execution of the “startswith” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.startswith
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_strip(to_strip=None)¶
Apply “strip” function to each string value in QueryCompiler.
- Parameters
to_strip (str, optional) –
- Returns
New QueryCompiler containing the result of execution of the “strip” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.strip
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_swapcase()¶
Apply “swapcase” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “swapcase” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.swapcase
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_title()¶
Apply “title” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “title” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.title
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_translate(table)¶
Apply “translate” function to each string value in QueryCompiler.
- Parameters
table (dict) –
- Returns
New QueryCompiler containing the result of execution of the “translate” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.translate
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_upper()¶
Apply “upper” function to each string value in QueryCompiler.
- Returns
New QueryCompiler containing the result of execution of the “upper” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.upper
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_wrap(width, **kwargs)¶
Apply “wrap” function to each string value in QueryCompiler.
- Parameters
width (int) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the result of execution of the “wrap” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.wrap
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- str_zfill(width)¶
Apply “zfill” function to each string value in QueryCompiler.
- Parameters
width (int) –
- Returns
New QueryCompiler containing the result of execution of the “zfill” function against each string element.
- Return type
Notes
Please refer to
modin.pandas.Series.str.zfill
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- sub(other, **kwargs)¶
Perform element-wise substraction (
self - other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- sum(**kwargs)¶
Get the sum for each column or row.
- Parameters
axis ({0, 1}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the sum for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sum
for more information about parameters and output format.
- sum_min_count(**kwargs)¶
Get the sum for each column or row.
- Parameters
axis ({0, 1}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the sum for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sum
for more information about parameters and output format.
- to_datetime(*args, **kwargs)¶
Convert columns of the QueryCompiler to the datetime dtype.
- Parameters
*args (iterable) –
**kwargs (dict) –
- Returns
QueryCompiler with all columns converted to datetime dtype.
- Return type
Notes
Please refer to
modin.pandas.to_datetime
for more information about parameters and output format.
- to_numeric(*args, **kwargs)¶
Convert underlying data to numeric dtype.
- Parameters
errors ({"ignore", "raise", "coerce"}) –
downcast ({"integer", "signed", "unsigned", "float", None}) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with converted to numeric values.
- Return type
Notes
Please refer to
modin.pandas.to_numeric
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- to_numpy(**kwargs)¶
Convert underlying query compilers data to NumPy array.
- Parameters
dtype (dtype) – The dtype of the resulted array.
copy (bool) – Whether to ensure that the returned value is not a view on another array.
na_value (object) – The value to replace missing values with.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
The QueryCompiler converted to NumPy array.
- Return type
np.ndarray
- abstract to_pandas()¶
Convert underlying query compilers data to
pandas.DataFrame
.- Returns
The QueryCompiler converted to pandas.
- Return type
- transpose(*args, **kwargs)¶
Transpose this QueryCompiler.
- Parameters
copy (bool) – Whether to copy the data after transposing.
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Transposed new QueryCompiler.
- Return type
- truediv(other, **kwargs)¶
Perform element-wise division (
self / other
).If axes are not equal, perform frames alignment first.
- Parameters
other (BaseQueryCompiler, scalar or array-like) – Other operand of the binary operation.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that is passed from a high-level API.
level (int or label) – In case of MultiIndex match index values on the passed level.
axis ({{0, 1}}) – Axis to match indices along for 1D other (list or QueryCompiler that represents Series). 0 is for index, when 1 is for columns.
fill_value (float or None) – Value to fill missing elements during frame alignment.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Result of binary operation.
- Return type
- unique(**kwargs)¶
Get unique values of self.
- Parameters
**kwargs (dict) – Serves compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with unique values.
- Return type
Notes
Please refer to
modin.pandas.Series.unique
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- unstack(level, fill_value)¶
Pivot a level of the (necessarily hierarchical) index labels.
- Parameters
level (int or label) –
fill_value (scalar or dict) –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.unstack
for more information about parameters and output format.
- var(**kwargs)¶
Get the variance for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
ddof (int) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the variance for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.var
for more information about parameters and output format.
- view(index=None, columns=None)¶
Mask QueryCompiler with passed keys.
- Parameters
index (list of ints, optional) – Positional indices of rows to grab.
columns (list of ints, optional) – Positional indices of columns to grab.
- Returns
New masked QueryCompiler.
- Return type
- where(cond, other, **kwargs)¶
Update values of self using values from other at positions where cond is False.
- Parameters
cond (BaseQueryCompiler) – Boolean mask. True - keep the self value, False - replace by other value.
other (BaseQueryCompiler or pandas.Series) – Object to grab replacement values from.
axis ({0, 1}) – Axis to align frames along if axes of self, cond and other are not equal. 0 is for index, when 1 is for columns.
level (int or label, optional) – Level of MultiIndex to align frames along if axes of self, cond and other are not equal. Currently level parameter is not implemented, so only None value is acceptable.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with updated data.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.where
for more information about parameters and output format.
- window_mean(window_args, *args, **kwargs)¶
Create window of the specified type and compute mean for each window.
- Parameters
window_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing mean for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the mean for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.mean
for more information about parameters and output format.
- window_std(window_args, ddof=1, *args, **kwargs)¶
Create window of the specified type and compute standart deviation for each window.
- Parameters
window_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.ddof (int, default: 1) –
*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing standart deviation for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the standart deviation for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.std
for more information about parameters and output format.
- window_sum(window_args, *args, **kwargs)¶
Create window of the specified type and compute sum for each window.
- Parameters
window_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing sum for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the sum for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.sum
for more information about parameters and output format.
- window_var(window_args, ddof=1, *args, **kwargs)¶
Create window of the specified type and compute variance for each window.
- Parameters
window_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.ddof (int, default: 1) –
*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing variance for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the variance for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.var
for more information about parameters and output format.
- write_items(row_numeric_index, col_numeric_index, broadcasted_items)¶
Update QueryCompiler elements at the specified positions by passed values.
In contrast to
setitem
this method allows to do 2D assignments.- Parameters
row_numeric_index (list of ints) – Row positions to write value.
col_numeric_index (list of ints) – Column positions to write value.
broadcasted_items (2D-array) – Values to write. Have to be same size as defined by row_numeric_index and col_numeric_index.
- Returns
New QueryCompiler with updated values.
- Return type
Pandas backend¶
Pandas Query Compiler¶
PandasQueryCompiler
is responsible for compiling
a set of known predefined functions and pairing those with dataframe algebra operators in the
PandasFrame, specifically for dataframes backed by
pandas.DataFrame
objects.
Each PandasQueryCompiler
contains an instance of
PandasFrame
which it queries to get the result.
PandasQueryCompiler
supports methods built by the function module.
If you want to add an implementation for a query compiler method, visit the function module documentation
to see whether the new operation fits one of the existing function templates and can be easily implemented
with them.
PandasQueryCompiler
implements common query compilers API
defined by the BaseQueryCompiler
. Some functionalities
are inherited from the base class, in the following section only overridden methods are presented.
- class modin.backends.pandas.query_compiler.PandasQueryCompiler(modin_frame)¶
Query compiler for the pandas backend.
This class translates common query compiler API into the DataFrame Algebra queries, that is supposed to be executed by
PandasFrame
.- Parameters
modin_frame (PandasFrame) – Modin Frame to query with the compiled queries.
- abs(*args, **kwargs)¶
Execute Map function against passed query compiler.
- add(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- add_prefix(prefix, axis=1)¶
Add string prefix to the index labels along specified axis.
- Parameters
prefix (str) – The string to add before each label.
axis ({0, 1}, default: 1) – Axis to add prefix along. 0 is for index and 1 is for columns.
- Returns
New query compiler with updated labels.
- Return type
- add_suffix(suffix, axis=1)¶
Add string suffix to the index labels along specified axis.
- Parameters
suffix (str) – The string to add after each label.
axis ({0, 1}, default: 1) – Axis to add suffix along. 0 is for index and 1 is for columns.
- Returns
New query compiler with updated labels.
- Return type
- all(*args, **kwargs)¶
Execute MapReduce function against passed query compiler.
- any(*args, **kwargs)¶
Execute MapReduce function against passed query compiler.
- apply(func, axis, *args, **kwargs)¶
Apply passed function across given axis.
- Parameters
func (callable(pandas.Series) -> scalar, str, list or dict of such) – The function to apply to each column or row.
axis ({0, 1}) – Target axis to apply the function along. 0 is for index, 1 is for columns.
*args (iterable) – Positional arguments to pass to func.
**kwargs (dict) – Keyword arguments to pass to func.
- Returns
QueryCompiler that contains the results of execution and is built by the following rules:
Labels of specified axis are the passed functions names.
Labels of the opposite axis are preserved.
Each element is the result of execution of func against corresponding row/column.
- Return type
- applymap(*args, **kwargs)¶
Execute Map function against passed query compiler.
- astype(col_dtypes, **kwargs)¶
Convert columns dtypes to given dtypes.
- Parameters
col_dtypes (dict) – Map for column names and new dtypes.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with updated dtypes.
- Return type
- cat_codes()¶
Convert underlying categories data into its codes.
- Returns
New QueryCompiler containing the integer codes of the underlying categories.
- Return type
Notes
Please refer to
modin.pandas.Series.cat.codes
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- clip(lower, upper, **kwargs)¶
Trim values at input threshold.
- Parameters
lower (float or list-like) –
upper (float or list-like) –
axis ({0, 1}) –
inplace ({False}) – This parameter serves the compatibility purpose. Always has to be False.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with values limited by the specified thresholds.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.clip
for more information about parameters and output format.
- columnarize()¶
Transpose this QueryCompiler if it has a single row but multiple columns.
This method should be called for QueryCompilers representing a Series object, i.e.
self.is_series_like()
should be True.- Returns
Transposed new QueryCompiler or self.
- Return type
- combine(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- combine_first(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- compare(other, **kwargs)¶
Compare data of two QueryCompilers and highlight the difference.
- Parameters
other (BaseQueryCompiler) – Query compiler to compare with. Have to be the same shape and the same labeling as self.
align_axis ({0, 1}) –
keep_shape (bool) –
keep_equal (bool) –
- Returns
New QueryCompiler containing the differences between self and passed query compiler.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.compare
for more information about parameters and output format.
- concat(axis, other, **kwargs)¶
Concatenate self with passed query compilers along specified axis.
- Parameters
axis ({0, 1}) – Axis to concatenate along. 0 is for index and 1 is for columns.
other (BaseQueryCompiler or list of such) – Objects to concatenate with self.
join ({'outer', 'inner', 'right', 'left'}, default: 'outer') – Type of join that will be used if indices on the other axis are different. (note: if specified, has to be passed as
join=value
).ignore_index (bool, default: False) – If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. (note: if specified, has to be passed as
ignore_index=value
).sort (bool, default: False) – Whether or not to sort non-concatenation axis. (note: if specified, has to be passed as
sort=value
).**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Concatenated objects.
- Return type
- conj(*args, **kwargs)¶
Execute Map function against passed query compiler.
- copy()¶
Make a copy of this object.
- Returns
Copy of self.
- Return type
Notes
For copy, we don’t want a situation where we modify the metadata of the copies if we end up modifying something here. We copy all of the metadata to prevent that.
- corr(method='pearson', min_periods=1)¶
Compute pairwise correlation of columns, excluding NA/null values.
- Parameters
method ({'pearson', 'kendall', 'spearman'} or callable(pandas.Series, pandas.Series) -> pandas.Series) – Correlation method.
min_periods (int) – Minimum number of observations required per pair of columns to have a valid result. If fewer than min_periods non-NA values are present the result will be NA.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Correlation matrix.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.corr
for more information about parameters and output format.
- count(*args, **kwargs)¶
Execute MapReduce function against passed query compiler.
- cov(min_periods=None)¶
Compute pairwise covariance of columns, excluding NA/null values.
- Parameters
min_periods (int) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Covariance matrix.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.cov
for more information about parameters and output format.
- cummax(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- cummin(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- cumprod(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- cumsum(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- default_to_pandas(pandas_op, *args, **kwargs)¶
Do fallback to pandas for the passed function.
- Parameters
pandas_op (callable(pandas.DataFrame) -> object) – Function to apply to the casted to pandas frame.
*args (iterable) – Positional arguments to pass to pandas_op.
**kwargs (dict) – Key-value arguments to pass to pandas_op.
- Returns
The result of the pandas_op, converted back to
BaseQueryCompiler
.- Return type
- describe(**kwargs)¶
Generate descriptive statistics.
- Parameters
percentiles (list-like) –
include ("all" or list of dtypes, optional) –
exclude (list of dtypes, optional) –
datetime_is_numeric (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler object containing the descriptive statistics of the underlying data.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.describe
for more information about parameters and output format.
- df_update(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- diff(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- dot(other, squeeze_self=None, squeeze_other=None)¶
Compute the matrix multiplication of self and other.
- Parameters
other (BaseQueryCompiler or NumPy array) – The other query compiler or NumPy array to matrix multiply with self.
squeeze_self (boolean) – If self is a one-column query compiler, indicates whether it represents Series object.
squeeze_other (boolean) – If other is a one-column query compiler, indicates whether it represents Series object.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
A new query compiler that contains result of the matrix multiply.
- Return type
- drop(index=None, columns=None)¶
Drop specified rows or columns.
- Parameters
index (list of labels, optional) – Labels of rows to drop.
columns (list of labels, optional) – Labels of columns to drop.
- Returns
New QueryCompiler with removed data.
- Return type
- dropna(**kwargs)¶
Remove missing values.
- Parameters
axis ({0, 1}) –
how ({"any", "all"}) –
thresh (int, optional) –
subset (list of labels) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with null values dropped along given axis.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.dropna
for more information about parameters and output format.
- dt_ceil(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_date(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_day(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_day_name(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_dayofweek(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_dayofyear(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_days(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_days_in_month(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_daysinmonth(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_end_time(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_floor(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_freq()¶
Get the time frequency of the underlying time-series data.
- Returns
QueryCompiler containing a single value, the frequency of the data.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.freq
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_hour(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_is_leap_year(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_is_month_end(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_is_month_start(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_is_quarter_end(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_is_quarter_start(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_is_year_end(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_is_year_start(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_microsecond(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_microseconds(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_minute(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_month(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_month_name(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_nanosecond(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_nanoseconds(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_normalize(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_quarter(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_qyear(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_round(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_second(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_seconds(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_start_time(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_strftime(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_time(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_timetz(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_to_period(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_to_pydatetime(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_to_pytimedelta(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_to_timestamp(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_total_seconds(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_tz()¶
Get the time-zone of the underlying time-series data.
- Returns
QueryCompiler containing a single value, time-zone of the data.
- Return type
Notes
Please refer to
modin.pandas.Series.dt.tz
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- dt_tz_convert(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_tz_localize(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_week(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_weekday(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_weekofyear(*args, **kwargs)¶
Execute Map function against passed query compiler.
- dt_year(*args, **kwargs)¶
Execute Map function against passed query compiler.
- property dtypes¶
Get columns dtypes.
- Returns
Series with dtypes of each column.
- Return type
pandas.Series
- eq(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- eval(expr, **kwargs)¶
Evaluate string expression on QueryCompiler columns.
- Parameters
expr (str) –
**kwargs (dict) –
- Returns
QueryCompiler containing the result of evaluation.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.eval
for more information about parameters and output format.
- fillna(**kwargs)¶
Replace NaN values using provided method.
- Parameters
value (scalar or dict) –
method ({"backfill", "bfill", "pad", "ffill", None}) –
axis ({0, 1}) –
inplace ({False}) – This parameter serves the compatibility purpose. Always has to be False.
limit (int, optional) –
downcast (dict, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with all null values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.fillna
for more information about parameters and output format.
- finalize()¶
Finalize constructing the dataframe calling all deferred functions which were used to build it.
- first_valid_index()¶
Return index label of first non-NaN/NULL value.
- Returns
- Return type
scalar
- floordiv(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- free()¶
Trigger a cleanup of this object.
- classmethod from_arrow(at, data_cls)¶
Build QueryCompiler from Arrow Table.
- Parameters
at (Arrow Table) – The Arrow Table to convert from.
data_cls (type) –
BasePandasFrame
class (or its descendant) to convert to.
- Returns
QueryCompiler containing data from the pandas DataFrame.
- Return type
- classmethod from_pandas(df, data_cls)¶
Build QueryCompiler from pandas DataFrame.
- Parameters
df (pandas.DataFrame) – The pandas DataFrame to convert from.
data_cls (type) –
BasePandasFrame
class (or its descendant) to convert to.
- Returns
QueryCompiler containing data from the pandas DataFrame.
- Return type
- ge(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- get_dummies(columns, **kwargs)¶
Convert categorical variables to dummy variables for certain columns.
- Parameters
columns (label or list of such) – Columns to convert.
prefix (str or list of such) –
prefix_sep (str) –
dummy_na (bool) –
drop_first (bool) –
dtype (dtype) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with categorical variables converted to dummy.
- Return type
Notes
Please refer to
modin.pandas.get_dummies
for more information about parameters and output format.
- getitem_array(key)¶
Mask QueryCompiler with key.
- Parameters
key (BaseQueryCompiler, np.ndarray or list of column labels) – Boolean mask represented by QueryCompiler or
np.ndarray
of the same shape as self, or enumerable of columns to pick.- Returns
New masked QueryCompiler.
- Return type
- getitem_column_array(key, numeric=False)¶
Get column data for target labels.
- Parameters
key (list-like) – Target labels by which to retrieve data.
numeric (bool, default: False) – Whether or not the key passed in represents the numeric index or the named index.
- Returns
New QueryCompiler that contains specified columns.
- Return type
- getitem_row_array(key)¶
Get row data for target indices.
- Parameters
key (list-like) – Numeric indices of the rows to pick.
- Returns
New QueryCompiler that contains specified rows.
- Return type
- groupby_agg(by, is_multi_by, axis, agg_func, agg_args, agg_kwargs, groupby_kwargs, drop=False)¶
Group QueryCompiler data and apply passed aggregation function.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
is_multi_by (bool) – If by is a QueryCompiler or list of such indicates whether it’s grouping on multiple columns/rows.
axis ({0, 1}) – Axis to group and apply aggregation function along. 0 is for index, when 1 is for columns.
agg_func (dict or callable(DataFrameGroupBy) -> DataFrame) – Function to apply to the GroupBy object.
agg_args (dict) – Positional arguments to pass to the agg_func.
agg_kwargs (dict) – Key arguments to pass to the agg_func.
groupby_kwargs (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
QueryCompiler containing the result of groupby aggregation.
- Return type
Notes
Please refer to
modin.pandas.GroupBy.aggregate
for more information about parameters and output format.
- groupby_all(**kwargs)¶
Group QueryCompiler data and check whether all elements are True for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the boolean of whether all elements are True for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.all
for more information about parameters and output format.
- groupby_any(**kwargs)¶
Group QueryCompiler data and check whether any element is True for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the boolean of whether there is any element which is True for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.any
for more information about parameters and output format.
- groupby_count(**kwargs)¶
Group QueryCompiler data and count non-null values for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the number of non-null values for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.count
for more information about parameters and output format.
- groupby_max(**kwargs)¶
Group QueryCompiler data and get the maximum value for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the maximum value for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.max
for more information about parameters and output format.
- groupby_min(**kwargs)¶
Group QueryCompiler data and get the minimum value for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the minimum value for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.min
for more information about parameters and output format.
- groupby_prod(**kwargs)¶
Group QueryCompiler data and compute product for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the product for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.prod
for more information about parameters and output format.
- groupby_size(by, axis, groupby_args, map_args, reduce_args, numeric_only, drop)¶
Group QueryCompiler data and get the number of elements for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the number of elements for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.size
for more information about parameters and output format.
- groupby_sum(**kwargs)¶
Group QueryCompiler data and compute sum for every group.
- Parameters
by (BaseQueryCompiler, column or index label, Grouper or list of such) – Object that determine groups.
axis ({0, 1}) – Axis to group and apply reduction function along. 0 is for index, when 1 is for columns.
groupby_args (dict) – GroupBy parameters as expected by
modin.pandas.DataFrame.groupby
signature.map_args (dict) – Keyword arguments to pass to the reduction function. If GroupBy is implemented via MapReduce approach, this argument is passed at the map phase only.
reduce_args (dict, optional) – If GroupBy is implemented with MapReduce approach, specifies arguments to pass to the reduction function at the reduce phase, has no effect otherwise.
numeric_only (bool, default: True) – Whether or not to drop non-numeric columns before executing GroupBy.
drop (bool, default: False) – If by is a QueryCompiler indicates whether or not by-data came from the self.
- Returns
BaseQueryCompiler – QueryCompiler containing the result of groupby reduction built by the following rules:
Labels on the opposit of specified axis are preserved.
If groupby_args[“as_index”] is True then labels on the specified axis are the group names, otherwise labels would be default: 0, 1 … n.
If groupby_args[“as_index”] is False, then first N columns/rows of the frame contain group names, where N is the columns/rows to group on.
Each element of QueryCompiler is the sum for the corresponding group and column/row.
.. warning – map_args and reduce_args parameters are deprecated. They’re leaked here from
PandasQueryCompiler.groupby_*
, pandas backend implements groupby via MapReduce approach, but for other backends these parameters make no sense, and so they’ll be removed in the future.
Notes
Please refer to
modin.pandas.GroupBy.sum
for more information about parameters and output format.
- gt(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- idxmax(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- idxmin(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- insert(loc, column, value)¶
Insert new column.
- Parameters
loc (int) – Insertion position.
column (label) – Label of the new column.
value (One-column BaseQueryCompiler, 1D array or scalar) – Data to fill new column with.
- Returns
QueryCompiler with new column inserted.
- Return type
- invert(*args, **kwargs)¶
Execute Map function against passed query compiler.
- is_monotonic_decreasing()¶
Return boolean if values in the object are monotonicly decreasing.
- Returns
- Return type
bool
- is_monotonic_increasing()¶
Return boolean if values in the object are monotonicly increasing.
- Returns
- Return type
bool
- is_series_like()¶
Check whether this QueryCompiler can represent
modin.pandas.Series
object.- Returns
Return True if QueryCompiler has a single column or row, False otherwise.
- Return type
bool
- isin(*args, **kwargs)¶
Execute Map function against passed query compiler.
- isna(*args, **kwargs)¶
Execute Map function against passed query compiler.
- join(right, **kwargs)¶
Join columns of another QueryCompiler.
- Parameters
right (BaseQueryCompiler) – QueryCompiler of the right frame to join with.
on (label or list of such) –
how ({"left", "right", "outer", "inner"}) –
lsuffix (str) –
rsuffix (str) –
sort (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler that contains result of the join.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.join
for more information about parameters and output format.
- kurt(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- last_valid_index()¶
Return index label of last non-NaN/NULL value.
- Returns
- Return type
scalar
- le(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- lt(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- mad(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- max(axis, **kwargs)¶
Get the maximum value for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the maximum value for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.max
for more information about parameters and output format.
- mean(axis, **kwargs)¶
Get the mean value for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the mean value for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.mean
for more information about parameters and output format.
- median(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)¶
Unpivot QueryCompiler data from wide to long format.
- Parameters
id_vars (list of labels, optional) –
value_vars (list of labels, optional) –
var_name (label) –
value_name (label) –
col_level (int or label) –
ignore_index (bool) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with unpivoted data.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.melt
for more information about parameters and output format.
- memory_usage(*args, **kwargs)¶
Execute MapReduce function against passed query compiler.
- merge(right, **kwargs)¶
Merge QueryCompiler objects using a database-style join.
- Parameters
right (BaseQueryCompiler) – QueryCompiler of the right frame to merge with.
how ({"left", "right", "outer", "inner", "cross"}) –
on (label or list of such) –
left_on (label or list of such) –
right_on (label or list of such) –
left_index (bool) –
right_index (bool) –
sort (bool) –
suffixes (list-like) –
copy (bool) –
indicator (bool or str) –
validate (str) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler that contains result of the merge.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.merge
for more information about parameters and output format.
- min(axis, **kwargs)¶
Get the minimum value for each column or row.
- Parameters
axis ({{0, 1}}) –
level (None, default: None) – Serves the compatibility purpose. Always has to be None.
numeric_only (bool, optional) –
skipna (bool, default: True) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the minimum value for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.min
for more information about parameters and output format.
- mod(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- mode(**kwargs)¶
Get the modes for every column or row.
- Parameters
axis ({0, 1}) –
numeric_only (bool) –
dropna (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with modes calculated alogn given axis.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.mode
for more information about parameters and output format.
- mul(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- ne(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- negative(*args, **kwargs)¶
Execute Map function against passed query compiler.
- nlargest(*args, **kwargs)¶
Return the first n rows ordered by columns in descending order.
- Parameters
n (int, default: 5) –
columns (list of labels, optional) – Column labels to order by. (note: this parameter can be omitted only for a single-column query compilers representing Series object, otherwise columns has to be specified).
keep ({"first", "last", "all"}, default: "first") –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.nlargest
for more information about parameters and output format.
- notna(*args, **kwargs)¶
Execute Map function against passed query compiler.
- nsmallest(*args, **kwargs)¶
Return the first n rows ordered by columns in ascending order.
- Parameters
n (int, default: 5) –
columns (list of labels, optional) – Column labels to order by. (note: this parameter can be omitted only for a single-column query compilers representing Series object, otherwise columns has to be specified).
keep ({"first", "last", "all"}, default: "first") –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.nsmallest
for more information about parameters and output format.
- nunique(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- pivot(index, columns, values)¶
Produce pivot table based on column values.
- Parameters
index (label or list of such, pandas.Index, optional) –
columns (label or list of such) –
values (label or list of such, optional) –
- Returns
New QueryCompiler containing pivot table.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.pivot
for more information about parameters and output format.
- pivot_table(index, values, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)¶
Create a spreadsheet-style pivot table from underlying data.
- Parameters
index (label, pandas.Grouper, array or list of such) –
values (label, optional) –
columns (column, pandas.Grouper, array or list of such) –
aggfunc (callable(pandas.Series) -> scalar, dict of list of such) –
fill_value (scalar, optional) –
margins (bool) –
dropna (bool) –
margins_name (str) –
observed (bool) –
sort (bool) –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.pivot_table
for more information about parameters and output format.
- pow(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- prod(*args, **kwargs)¶
Execute MapReduce function against passed query compiler.
- prod_min_count(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- quantile_for_list_of_values(**kwargs)¶
Get the value at the given quantile for each column or row.
- Parameters
q (list-like) –
axis ({0, 1}) –
numeric_only (bool) –
interpolation ({"linear", "lower", "higher", "midpoint", "nearest"}) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler with index labels of the specified axis, where each row contains the value at the given quantile for the corresponding row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.quantile
for more information about parameters and output format.
- quantile_for_single_value(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- query(expr, **kwargs)¶
Query columns of the QueryCompiler with a boolean expression.
- Parameters
expr (str) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the rows where the boolean expression is satisfied.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.query
for more information about parameters and output format.
- rank(**kwargs)¶
Compute numerical rank along the specified axis.
By default, equal values are assigned a rank that is the average of the ranks of those values, this behaviour can be changed via method parameter.
- Parameters
axis ({0, 1}) –
method ({"average", "min", "max", "first", "dense"}) –
numeric_only (bool) –
na_option ({"keep", "top", "bottom"}) –
ascending (bool) –
pct (bool) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler of the same shape as self, where each element is the numerical rank of the corresponding value along row or column.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.rank
for more information about parameters and output format.
- reindex(axis, labels, **kwargs)¶
Align QueryCompiler data with a new index along specified axis.
- Parameters
axis ({0, 1}) – Axis to align labels along. 0 is for index, 1 is for columns.
labels (list-like) – Index-labels to align with.
method ({None, "backfill"/"bfill", "pad"/"ffill", "nearest"}) – Method to use for filling holes in reindexed frame.
fill_value (scalar) – Value to use for missing values in the resulted frame.
limit (int) –
tolerance (int) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with aligned axis.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.reindex
for more information about parameters and output format.
- replace(*args, **kwargs)¶
Execute Map function against passed query compiler.
- resample_agg_df(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and apply passed aggregation function for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.agg
for more information about parameters and output format.
- resample_agg_ser(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and apply passed aggregation function in a one-column query compiler for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.agg
for more information about parameters and output format.Warning
This method duplicates logic of
resample_agg_df
and will be removed soon.
- resample_app_df(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and apply passed aggregation function for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.apply
for more information about parameters and output format.Warning
This method duplicates logic of
resample_agg_df
and will be removed soon.
- resample_app_ser(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and apply passed aggregation function in a one-column query compiler for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.apply
for more information about parameters and output format.Warning
This method duplicates logic of
resample_agg_df
and will be removed soon.
- resample_asfreq(resample_args, fill_value)¶
Resample time-series data and get the values at the new frequency.
Group data into intervals by time-series row/column with a specified frequency and get values at the new frequency.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.fill_value (scalar) –
- Returns
New QueryCompiler containing values at the specified frequency.
- Return type
- resample_backfill(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using back-fill method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.backfill
for more information about parameters and output format.
- resample_bfill(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using back-fill method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.bfill
for more information about parameters and output format.
- resample_count(resample_args)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute number of non-NA values for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the number of non-NA values for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.count
for more information about parameters and output format.
- resample_ffill(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using forward-fill method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.ffill
for more information about parameters and output format.
- resample_fillna(resample_args, method, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using specified method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.method (str) –
limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.fillna
for more information about parameters and output format.
- resample_first(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute first element for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the first element for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.first
for more information about parameters and output format.
- resample_get_group(resample_args, name, obj)¶
Resample time-series data and get the specified group.
Group data into intervals by time-series row/column with a specified frequency and get the values of the specified group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.name (object) –
obj (modin.pandas.DataFrame, optional) –
- Returns
New QueryCompiler containing the values from the specified group.
- Return type
- resample_interpolate(resample_args, method, axis, limit, inplace, limit_direction, limit_area, downcast, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using specified interpolation method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.method (str) –
axis ({0, 1}) –
limit (int) –
inplace ({False}) – This parameter serves the compatibility purpose. Always has to be False.
limit_direction ({"forward", "backward", "both"}) –
limit_area ({None, "inside", "outside"}) –
downcast (str, optional) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.interpolate
for more information about parameters and output format.
- resample_last(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute last element for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the last element for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.last
for more information about parameters and output format.
- resample_max(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute maximum value for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the maximum value for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.max
for more information about parameters and output format.
- resample_mean(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute mean value for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the mean value for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.mean
for more information about parameters and output format.
- resample_median(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute median value for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the median value for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.median
for more information about parameters and output format.
- resample_min(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute minimum value for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the minimum value for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.min
for more information about parameters and output format.
- resample_nearest(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using ‘nearest’ method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.nearest
for more information about parameters and output format.
- resample_nunique(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute number of unique values for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the number of unique values for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.nunique
for more information about parameters and output format.
- resample_ohlc_df(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute open, high, low and close values for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the labels of columns containing computed values.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.ohlc
for more information about parameters and output format.
- resample_ohlc_ser(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute open, high, low and close values for each group over the specified axis.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
*args (iterable) – Positional arguments to pass to the aggregation function.
**kwargs (dict) – Keyword arguments to pass to the aggregation function.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are a MultiIndex, where first level contains preserved labels of this axis and the second level is the labels of columns containing computed values.
Each element of QueryCompiler is the result of corresponding function for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.ohlc
for more information about parameters and output format.
- resample_pad(resample_args, limit)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and fill missing values in each group independently using ‘pad’ method.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.limit (int) –
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
QueryCompiler contains unsampled data with missing values filled.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.pad
for more information about parameters and output format.
- resample_pipe(resample_args, func, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency, build equivalent
pandas.Resampler
object and apply passed function to it.- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.func (callable(pandas.Resampler) -> object or tuple(callable, str)) –
*args (iterable) – Positional arguments to pass to function.
**kwargs (dict) – Keyword arguments to pass to function.
- Returns
New QueryCompiler containing the result of passed function.
- Return type
Notes
Please refer to
modin.pandas.Resampler.pipe
for more information about parameters and output format.
- resample_prod(resample_args, _method, min_count, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute product for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
min_count (int) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the product for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.prod
for more information about parameters and output format.
- resample_quantile(resample_args, q, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute quantile for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.q (float) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the quantile for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.quantile
for more information about parameters and output format.
- resample_sem(resample_args, _method, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute standart error of the mean for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.ddof (int, default: 1) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the standart error of the mean for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.sem
for more information about parameters and output format.
- resample_size(resample_args)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute number of elements in a group for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the number of elements in a group for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.size
for more information about parameters and output format.
- resample_std(resample_args, ddof, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute standart deviation for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.ddof (int) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the standart deviation for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.std
for more information about parameters and output format.
- resample_sum(resample_args, _method, min_count, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute sum for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature._method (str) –
min_count (int) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the sum for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.sum
for more information about parameters and output format.
- resample_transform(resample_args, arg, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and call passed function on each group. In contrast to
resample_app_df
apply function to the whole group, instead of a single axis.- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.arg (callable(pandas.DataFrame) -> pandas.Series) –
*args (iterable) – Positional arguments to pass to function.
**kwargs (dict) – Keyword arguments to pass to function.
- Returns
New QueryCompiler containing the result of passed function.
- Return type
- resample_var(resample_args, ddof, *args, **kwargs)¶
Resample time-series data and apply aggregation on it.
Group data into intervals by time-series row/column with a specified frequency and compute variance for each group.
- Parameters
resample_args (list) – Resample parameters as expected by
modin.pandas.DataFrame.resample
signature.ddof (int) –
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the result of resample aggregation built by the following rules:
Labels on the specified axis are the group names (time-stamps)
Labels on the opposit of specified axis are preserved.
Each element of QueryCompiler is the variance for the corresponding group and column/row.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.Resampler.var
for more information about parameters and output format.
- reset_index(**kwargs)¶
Reset the index, or a level of it.
- Parameters
drop (bool) – Whether to drop the reset index or insert it at the beginning of the frame.
level (int or label, optional) – Level to remove from index. Removes all levels by default.
col_level (int or label) – If the columns have multiple levels, determines which level the labels are inserted into.
col_fill (label) – If the columns have multiple levels, determines how the other levels are named.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with reset index.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.reset_index
for more information about parameters and output format.
- rfloordiv(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- rmod(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- rolling_aggregate(rolling_args, func, *args, **kwargs)¶
Create rolling window and apply specified functions for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.func (str, dict, callable(pandas.Series) -> scalar, or list of such) –
*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the result of passed functions for each window, built by the following rulles:
Labels on the specified axis are preserved.
Labels on the opposit of specified axis are MultiIndex, where first level contains preserved labels of this axis and the second level has the function names.
Each element of QueryCompiler is the result of corresponding function for the corresponding window and column/row.
- Return type
Notes
Please refer to
modin.pandas.Rolling.aggregate
for more information about parameters and output format.
- rolling_apply(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_corr(rolling_args, other, pairwise, *args, **kwargs)¶
Create rolling window and compute correlation for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.other (modin.pandas.Series, modin.pandas.DataFrame, list-like, optional) –
pairwise (bool, optional) –
*args (iterable) –
**kwargs (dict) –
- Returns
New QueryCompiler containing correlation for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the correlation for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.corr
for more information about parameters and output format.
- rolling_count(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_cov(rolling_args, other, pairwise, ddof, **kwargs)¶
Create rolling window and compute covariance for each window.
- Parameters
rolling_args (list) – Rolling windows arguments with the same signature as
modin.pandas.DataFrame.rolling
.other (modin.pandas.Series, modin.pandas.DataFrame, list-like, optional) –
pairwise (bool, optional) –
ddof (int, default: 1) –
**kwargs (dict) –
- Returns
New QueryCompiler containing covariance for each window, built by the following rulles:
Output QueryCompiler has the same shape and axes labels as the source.
Each element is the covariance for the corresponding window.
- Return type
Notes
Please refer to
modin.pandas.Rolling.cov
for more information about parameters and output format.
- rolling_kurt(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_max(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_mean(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_median(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_min(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_quantile(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_skew(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_std(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_sum(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- rolling_var(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- round(*args, **kwargs)¶
Execute Map function against passed query compiler.
- rpow(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- rsub(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- rtruediv(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- searchsorted(**kwargs)¶
Find positions in a sorted self where value should be inserted to maintain order.
- Parameters
value (list-like) –
side ({"left", "right"}) –
sorter (list-like, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
One-column QueryCompiler which contains indices to insert.
- Return type
Notes
Please refer to
modin.pandas.Series.searchsorted
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- sem(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- series_update(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- series_view(*args, **kwargs)¶
Execute Map function against passed query compiler.
- set_index_from_columns(keys: List[Hashable], drop: bool = True, append: bool = False)¶
Create new row labels from a list of columns.
- Parameters
keys (list of hashable) – The list of column names that will become the new index.
drop (bool, default: True) – Whether or not to drop the columns provided in the keys argument.
append (bool, default: True) – Whether or not to add the columns in keys as new levels appended to the existing index.
- Returns
A new QueryCompiler with updated index.
- Return type
- setitem(axis, key, value)¶
Set the row/column defined by key to the value provided.
- Parameters
axis ({0, 1}) – Axis to set value along. 0 means set row, 1 means set column.
key (label) – Row/column label to set value in.
value (BaseQueryCompiler, list-like or scalar) – Define new row/column value.
- Returns
New QueryCompiler with updated key value.
- Return type
- skew(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- sort_columns_by_row_values(rows, ascending=True, **kwargs)¶
Reorder the columns based on the lexicographic order of the given rows.
- Parameters
rows (label or list of labels) – The row or rows to sort by.
ascending (bool, default: True) – Sort in ascending order (True) or descending order (False).
kind ({"quicksort", "mergesort", "heapsort"}) –
na_position ({"first", "last"}) –
ignore_index (bool) –
key (callable(pandas.Index) -> pandas.Index, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler that contains result of the sort.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sort_values
for more information about parameters and output format.
- sort_index(**kwargs)¶
Sort data by index or column labels.
- Parameters
axis ({0, 1}) –
level (int, label or list of such) –
ascending (bool) –
inplace (bool) –
kind ({"quicksort", "mergesort", "heapsort"}) –
na_position ({"first", "last"}) –
sort_remaining (bool) –
ignore_index (bool) –
key (callable(pandas.Index) -> pandas.Index, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler containing the data sorted by columns or indices.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sort_index
for more information about parameters and output format.
- sort_rows_by_column_values(columns, ascending=True, **kwargs)¶
Reorder the rows based on the lexicographic order of the given columns.
- Parameters
columns (label or list of labels) – The column or columns to sort by.
ascending (bool, default: True) – Sort in ascending order (True) or descending order (False).
kind ({"quicksort", "mergesort", "heapsort"}) –
na_position ({"first", "last"}) –
ignore_index (bool) –
key (callable(pandas.Index) -> pandas.Index, optional) –
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler that contains result of the sort.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.sort_values
for more information about parameters and output format.
- stack(level, dropna)¶
Stack the prescribed level(s) from columns to index.
- Parameters
level (int or label) –
dropna (bool) –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.stack
for more information about parameters and output format.
- std(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- str___getitem__(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_capitalize(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_center(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_contains(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_count(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_endswith(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_find(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_findall(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_get(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_index(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_isalnum(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_isalpha(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_isdecimal(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_isdigit(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_islower(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_isnumeric(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_isspace(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_istitle(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_isupper(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_join(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_len(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_ljust(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_lower(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_lstrip(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_match(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_normalize(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_pad(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_partition(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_repeat(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_replace(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_rfind(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_rindex(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_rjust(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_rpartition(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_rsplit(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_rstrip(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_slice(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_slice_replace(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_split(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_startswith(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_strip(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_swapcase(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_title(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_translate(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_upper(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_wrap(*args, **kwargs)¶
Execute Map function against passed query compiler.
- str_zfill(*args, **kwargs)¶
Execute Map function against passed query compiler.
- sub(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- sum(*args, **kwargs)¶
Execute MapReduce function against passed query compiler.
- sum_min_count(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- to_datetime(*args, **kwargs)¶
Convert columns of the QueryCompiler to the datetime dtype.
- Parameters
*args (iterable) –
**kwargs (dict) –
- Returns
QueryCompiler with all columns converted to datetime dtype.
- Return type
Notes
Please refer to
modin.pandas.to_datetime
for more information about parameters and output format.
- to_numeric(*args, **kwargs)¶
Execute Map function against passed query compiler.
- to_numpy(**kwargs)¶
Convert underlying query compilers data to NumPy array.
- Parameters
dtype (dtype) – The dtype of the resulted array.
copy (bool) – Whether to ensure that the returned value is not a view on another array.
na_value (object) – The value to replace missing values with.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
The QueryCompiler converted to NumPy array.
- Return type
np.ndarray
- to_pandas()¶
Convert underlying query compilers data to
pandas.DataFrame
.- Returns
The QueryCompiler converted to pandas.
- Return type
- transpose(*args, **kwargs)¶
Transpose this QueryCompiler.
- Parameters
copy (bool) – Whether to copy the data after transposing.
*args (iterable) – Serves the compatibility purpose. Does not affect the result.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
Transposed new QueryCompiler.
- Return type
- truediv(other, broadcast=False, *args, **kwargs)¶
Apply binary func to passed operands.
- Parameters
query_compiler (QueryCompiler) – Left operand of func.
other (QueryCompiler, list-like object or scalar) – Right operand of func.
broadcast (bool, default: False) – If other is a one-column query compiler, indicates whether it is a Series or not. Frames and Series have to be processed differently, however we can’t distinguish them at the query compiler level, so this parameter is a hint that passed from a high level API.
*args (args,) – Arguments that will be passed to func.
**kwargs (kwargs,) – Arguments that will be passed to func.
- Returns
Result of binary function.
- Return type
QueryCompiler
- unique()¶
Get unique values of self.
- Parameters
**kwargs (dict) – Serves compatibility purpose. Does not affect the result.
- Returns
New QueryCompiler with unique values.
- Return type
Notes
Please refer to
modin.pandas.Series.unique
for more information about parameters and output format.Warning
This method is supported only by one-column query compilers.
- unstack(level, fill_value)¶
Pivot a level of the (necessarily hierarchical) index labels.
- Parameters
level (int or label) –
fill_value (scalar or dict) –
- Returns
- Return type
Notes
Please refer to
modin.pandas.DataFrame.unstack
for more information about parameters and output format.
- var(*args, **kwargs)¶
Execute Reduction function against passed query compiler.
- view(index=None, columns=None)¶
Mask QueryCompiler with passed keys.
- Parameters
index (list of ints, optional) – Positional indices of rows to grab.
columns (list of ints, optional) – Positional indices of columns to grab.
- Returns
New masked QueryCompiler.
- Return type
- where(cond, other, **kwargs)¶
Update values of self using values from other at positions where cond is False.
- Parameters
cond (BaseQueryCompiler) – Boolean mask. True - keep the self value, False - replace by other value.
other (BaseQueryCompiler or pandas.Series) – Object to grab replacement values from.
axis ({0, 1}) – Axis to align frames along if axes of self, cond and other are not equal. 0 is for index, when 1 is for columns.
level (int or label, optional) – Level of MultiIndex to align frames along if axes of self, cond and other are not equal. Currently level parameter is not implemented, so only None value is acceptable.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
QueryCompiler with updated data.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.where
for more information about parameters and output format.
- window_mean(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- window_std(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- window_sum(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- window_var(*args, **kwargs)¶
Execute Fold function against passed query compiler.
- write_items(row_numeric_index, col_numeric_index, broadcasted_items)¶
Update QueryCompiler elements at the specified positions by passed values.
In contrast to
setitem
this method allows to do 2D assignments.- Parameters
row_numeric_index (list of ints) – Row positions to write value.
col_numeric_index (list of ints) – Column positions to write value.
broadcasted_items (2D-array) – Values to write. Have to be same size as defined by row_numeric_index and col_numeric_index.
- Returns
New QueryCompiler with updated values.
- Return type
Pandas Parsers Module Description¶
This module houses parser classes (classes that are used for data parsing on the workers)
and util functions for handling parsing results. PandasParser
is base class for parser
classes with pandas backend, that contains methods common for all child classes. Other
module classes implement parse
function that performs parsing of specific format data
basing on the chunk information computed in the modin.engines.base.io
module. After
chunk data parsing is completed, resulting DataFrame
-s will be splitted into smaller
DataFrame
-s according to num_splits
parameter, data type and number or
rows/columns in the parsed chunk, and then these frames and some additional metadata will
be returned.
Note
If you are interested in the data parsing mechanism implementation details, please refer to the source code documentation.
High-Level Module Overview¶
This module houses submodules which are responsible for communication between the query compiler level and execution engine level for pandas backend:
Query compiler is responsible for compiling efficient queries for PandasFrame.
Parsers are responsible for parsing data on workers during IO operations.
PyArrow backend¶
PyArrow Query Compiler¶
PyarrowQueryCompiler
is responsible for compiling efficient
DataFrame algebra queries for the PyarrowOnRayFrame,
the frames which are backed by pyarrow.Table
objects.
Each PyarrowQueryCompiler
contains an instance of
PyarrowOnRayFrame
which it queries to get the result.
PyarrowQueryCompiler
implements common query compilers API
defined by the BaseQueryCompiler
. Most functionalities
are inherited from PandasQueryCompiler
, in the following
section only overridden methods are presented.
- class modin.backends.pyarrow.query_compiler.PyarrowQueryCompiler(modin_frame)¶
Bases:
modin.backends.pandas.query_compiler.PandasQueryCompiler
Query compiler for the PyArrow backend.
This class translates common query compiler API into the DataFrame Algebra queries, that is supposed to be executed by
PyarrowOnRayFrame
.- Parameters
modin_frame (PyarrowOnRayFrame) – Modin Frame to query with the compiled queries.
- property dtypes¶
Get columns dtypes.
- Returns
Series with dtypes of each column.
- Return type
pandas.Series
- query(expr, **kwargs)¶
Query columns of the QueryCompiler with a boolean expression.
- Parameters
expr (str) –
**kwargs (dict) –
- Returns
New QueryCompiler containing the rows where the boolean expression is satisfied.
- Return type
Notes
Please refer to
modin.pandas.DataFrame.query
for more information about parameters and output format.
- to_numpy(**kwargs)¶
Convert underlying query compilers data to NumPy array.
- Parameters
dtype (dtype) – The dtype of the resulted array.
copy (bool) – Whether to ensure that the returned value is not a view on another array.
na_value (object) – The value to replace missing values with.
**kwargs (dict) – Serves the compatibility purpose. Does not affect the result.
- Returns
The QueryCompiler converted to NumPy array.
- Return type
np.ndarray
- to_pandas()¶
Convert underlying query compilers data to
pandas.DataFrame
.- Returns
The QueryCompiler converted to pandas.
- Return type
PyArrow Parsers Module Description¶
This module houses parser classes that are responsible for data parsing on the workers for the PyArrow backend.
Parsers for PyArrow backends follow an interface of pandas backend parsers:
parser class of every file format implements parse
method, which parses the specified part
of the file and builds PyArrow tables from the parsed data, based on the specified chunk size and number of splits.
The resulted PyArrow tables will be used as a partitions payload in the PyarrowOnRayFrame
.
Module houses Modin parser classes, that are used for data parsing on the workers.
- class modin.backends.pyarrow.parsers.PyarrowCSVParser¶
Class for handling CSV files on the workers using PyArrow backend.
- parse(fname, num_splits, start, end, header, **kwargs)¶
Parse CSV file into PyArrow tables.
- Parameters
fname (str) – Name of the CSV file to parse.
num_splits (int) – Number of partitions to split the resulted PyArrow table into.
start (int) – Position in the specified file to start parsing from.
end (int) – Position in the specified file to end parsing at.
header (str) – Header line that will be interpret as the first line of the parsed CSV file.
**kwargs (kwargs) – Serves the compatibility purpose. Does not affect the result.
- Returns
List with splitted parse results and it’s metadata:
First num_split elements are PyArrow tables, representing the corresponding chunk.
Next element is the number of rows in the parsed table.
Last element is the pandas Series, containing the data-types for each column of the parsed table.
- Return type
list
In general, PyArrow backend follows the flow of the pandas backend: query compiler contains an instance of Modin Frame, which is internally split into partitions. The main difference is that partitions contain PyArrow tables, instead of DataFrames like in pandas backend. To learn more about this approach please visit PyArrow execution engine section.
High-Level Module Overview¶
This module houses submodules which are responsible for communication between the query compiler level and execution engine level for PyArrow backend:
Query compiler is responsible for compiling efficient queries for PyarrowOnRayFrame.
Parsers are responsible for parsing data on workers during IO operations.
Note
Currently the only one available PyArrow backend factory is PyarrowOnRay
which works
in experimental mode only.
Modin supports several execution backends. Calling any DataFrame API function will end up in some backend-specific method. The query compiler is a bridge between Modin Dataframe and the actual execution engine.
Query compilers of all backends implement a common API, which is used by the Modin Dataframe to support dataframe queries. The role of the query compiler is to translate its API into a pairing of known user-defined functions and dataframe algebra operators. Each query compiler instance contains a frame of the selected execution engine and queries it with the compiled queries to get the result. The query compiler object is immutable, so the result of every method is a new query compiler.
The query compilers API is defined by the BaseQueryCompiler
class
and may resemble the pandas API, however, they’re not equal. The query compilers API
is significantly reduced in comparison with pandas, since many corner cases or even the
whole methods can be handled at the API layer with the existing API.
The query compiler is the level where Modin stops distinguishing DataFrame and Series (or column) objects.
A Series is represented by a 1xN query compiler, where the Series name is the column label.
If Series is unnamed, then the label is "__reduced__"
. The Dataframe API layer
interprets a one-column query compiler as Series or DataFrame depending on the operation context.
Note
Although we’re declaring that there is no difference between DataFrame and Series at the query compiler,
you still may find methods like method_ser
and method_df
which are implemented differently because they’re
emulating either Series or DataFrame logic, or you may find parameters, which indicates whether this one-column
query compiler is representing Series or not. All of these are hacks, and we’re working on getting rid of them.
High-level module overview¶
This module houses submodules of all of the stable query compilers:
Base module contains an abstract query compiler class which defines common API.
Pandas module contains query compiler and text parsers for pandas backend.
cuDF module contains query compiler and text parsers for cuDF backend.
Pyarrow module contains query compiler and text parsers for Pyarrow backend.
You can find more in the experimental section.
PandasOnPython Frame Objects¶
This page describes implementation of Base Frame Objects
specific for PandasOnPython
backend. Since Python engine doesn’t allow computation parallelization,
operations on partitions are performed sequentially. The absence of parallelization doesn’t give any
perfomance speed-up, so PandasOnPython
is used for testing purposes only.
PandasOnPythonFrame¶
The class is specific implementation of PandasFrame
for PandasOnPython
backend. It serves as an intermediate level between
PandasQueryCompiler
and
PandasOnPythonFramePartitionManager
.
Public API¶
- class modin.engines.python.pandas_on_python.frame.data.PandasOnPythonFrame(partitions, index, columns, row_lengths=None, column_widths=None, dtypes=None)¶
Class for dataframes with pandas backend and Python engine.
PandasOnPythonFrame
doesn’t implement any specific interfaces, all functionality is inherited from thePandasFrame
class.- Parameters
partitions (np.ndarray) – A 2D NumPy array of partitions.
index (sequence) – The index for the dataframe. Converted to a
pandas.Index
.columns (sequence) – The columns object for the dataframe. Converted to a
pandas.Index
.row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
dtypes (pandas.Series, optional) – The data types for the dataframe columns.
PandasOnPythonFramePartition¶
The class is specific implementation of PandasFramePartition
,
providing the API to perform operations on a block partition using Python as the execution engine.
In addition to wrapping a pandas.DataFrame
, the class also holds the following metadata:
length
- length ofpandas.DataFrame
wrappedwidth
- width ofpandas.DataFrame
wrapped
An operation on a block partition can be performed in two modes:
immediately via
apply()
- in this case accumulated call queue and new function will be executed immediately.lazily via
add_to_apply_calls()
- in this case function will be added to the call queue and no computations will be done at the moment.
Public API¶
- class modin.engines.python.pandas_on_python.frame.partition.PandasOnPythonFramePartition(data, length=None, width=None, call_queue=None)¶
Partition class with interface for pandas backend and Python engine.
Class holds the data and metadata for a single partition and implements methods of parent abstract class
PandasFramePartition
.- Parameters
data (pandas.DataFrame) –
pandas.DataFrame
that should be wrapped with this class.length (int, optional) – Length of data (number of rows in the input dataframe).
width (int, optional) – Width of data (number of columns in the input dataframe).
call_queue (list, optional) – Call queue of the partition (list with entities that should be called before partition materialization).
Notes
Objects of this class are treated as immutable by partition manager subclasses. There is no logic for updating in-place.
- add_to_apply_calls(func, *args, **kwargs)¶
Add a function to the call queue.
- Parameters
func (callable) – Function to be added to the call queue.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns
New
PandasOnPythonFramePartition
object with extended call queue.- Return type
- apply(func, *args, **kwargs)¶
Apply a function to the object wrapped by this partition.
- Parameters
func (callable) – Function to apply.
*args (iterable) – Additional positional arguments to be passed in func.
**kwargs (dict) – Additional keyword arguments to be passed in func.
- Returns
New
PandasOnPythonFramePartition
object.- Return type
- drain_call_queue()¶
Execute all operations stored in the call queue on the object wrapped by this partition.
- classmethod empty()¶
Create a new partition that wraps an empty pandas DataFrame.
- Returns
New
PandasOnPythonFramePartition
object wrapping empty pandas DataFrame.- Return type
- get()¶
Flush the call_queue and return copy of the data.
- Returns
Copy of DataFrame that was wrapped by this partition.
- Return type
Notes
Since this object is a simple wrapper, just return the copy of data.
- length()¶
Get the length of the object wrapped by this partition.
- Returns
The length of the object.
- Return type
int
- classmethod preprocess_func(func)¶
Preprocess a function before an
apply
call.- Parameters
func (callable) – Function to preprocess.
- Returns
An object that can be accepted by
apply
.- Return type
callable
Notes
No special preprocessing action is required, so unmodified func will be returned.
- classmethod put(obj)¶
Create partition containing obj.
- Parameters
obj (pandas.DataFrame) – DataFrame to be put into the new partition.
- Returns
New
PandasOnPythonFramePartition
object.- Return type
- to_numpy(**kwargs)¶
Return NumPy array representation of
pandas.DataFrame
stored in this partition.- Parameters
**kwargs (dict) – Keyword arguments to pass into pandas.DataFrame.to_numpy function.
- Returns
- Return type
np.ndarray
- to_pandas()¶
Return copy of the
pandas.Dataframe
stored in this partition.- Returns
- Return type
Notes
Equivalent to
get
method for this class.
- wait()¶
Wait for completion of computations on the object wrapped by the partition.
Internally will be done by flushing the call queue.
- width()¶
Get the width of the object wrapped by the partition.
- Returns
The width of the object.
- Return type
int
PandasOnPythonFrameAxisPartition¶
The class is specific implementation of PandasFrameAxisPartition
,
providing the API to perform operations on an axis partition, using Python
as the execution engine. The axis partition is made up of list of block
partitions that are stored in this class.
Public API¶
- class modin.engines.python.pandas_on_python.frame.axis_partition.PandasOnPythonFrameAxisPartition(list_of_blocks)¶
Class defines axis partition interface with pandas backend and Python engine.
Inherits functionality from
PandasFrameAxisPartition
class.- Parameters
list_of_blocks (list) – List with partition objects to create common axis partition from.
PandasOnPythonFrameColumnPartition¶
Public API¶
- class modin.engines.python.pandas_on_python.frame.axis_partition.PandasOnPythonFrameColumnPartition(list_of_blocks)¶
The column partition implementation for pandas backend and Python engine.
All of the implementation for this class is in the
PandasOnPythonFrameAxisPartition
parent class, and this class defines the axis to perform the computation over.- Parameters
list_of_blocks (list) – List with partition objects to create common axis partition from.
PandasOnPythonFrameRowPartition¶
Public API¶
- class modin.engines.python.pandas_on_python.frame.axis_partition.PandasOnPythonFrameRowPartition(list_of_blocks)¶
The row partition implementation for pandas backend and Python engine.
All of the implementation for this class is in the
PandasOnPythonFrameAxisPartition
parent class, and this class defines the axis to perform the computation over.- Parameters
list_of_blocks (list) – List with partition objects to create common axis partition from.
PythonFrameManager¶
The class is specific implementation of PandasFramePartitionManager
using Python as the execution engine. This class is responsible for partitions manipulation and applying
a funcion to block/row/column partitions.
Public API¶
- class modin.engines.python.pandas_on_python.frame.partition_manager.PandasOnPythonFramePartitionManager¶
Class for managing partitions with pandas backend and Python engine.
Inherits all functionality from
PandasFramePartitionManager
base class.
DataFrame Partitioning¶
The Modin DataFrame architecture follows in the footsteps of modern architectures for database and high performance matrix systems. We chose a partitioning schema that partitions along both columns and rows because it gives Modin flexibility and scalability in both the number of columns and the number of rows supported. The following figure illustrates this concept.

Currently, each partition’s memory format is a pandas DataFrame. In the future, we will support additional in-memory formats for the backend, namely Arrow tables.
Index¶
We currently use the pandas.Index
object for both indexing columns and rows. In the
future, we will implement a distributed, pandas-compatible Index object in order remove
this scaling limitation from the system. It does not start to become a problem until you
are operating on more than 10’s of billions of columns or rows, so most workloads will
not be affected by this scalability limit. Important note: If you are using the
default index (pandas.RangeIndex
) there is a fixed memory overhead (~200 bytes) and
there will be no scalability issues with the index.
API¶
The API is the outer-most layer that faces users. The majority of our current effort is spent implementing the components of the pandas API. We have implemented a toy example for a sqlite API as a proof of concept, but this isn’t ready for usage/testing. There are also plans to expose the Modin DataFrame API as a reduced API set that encompasses the entire pandas/dataframe API. See experimental features for more information.
BasePandasDataset¶
The class implements functionality that is common to Modin’s pandas API for both DataFrame
and Series
classes.
Public API¶
- class modin.pandas.base.BasePandasDataset
Implement most of the common code that exists in DataFrame/Series.
Since both objects share the same underlying representation, and the algorithms are the same, we use this object to define the general behavior of those objects and then use those objects to define the output type.
Notes
See pandas API documentation for pandas.DataFrame for more.
- abs()
Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
- Returns
Series/DataFrame containing the absolute value of each element.
- Return type
abs
See also
numpy.absolute
Calculate the absolute value element-wise.
Notes
See pandas API documentation for pandas.DataFrame.abs for more. For
complex
inputs,1.2 + 1j
, the absolute value is \(\sqrt{ a^2 + b^2 }\).Examples
Absolute numeric values in a Series.
>>> s = pd.Series([-1.10, 2, -3.33, 4]) >>> s.abs() 0 1.10 1 2.00 2 3.33 3 4.00 dtype: float64
Absolute numeric values in a Series with complex numbers.
>>> s = pd.Series([1.2 + 1j]) >>> s.abs() 0 1.56205 dtype: float64
Absolute numeric values in a Series with a Timedelta element.
>>> s = pd.Series([pd.Timedelta('1 days')]) >>> s.abs() 0 1 days dtype: timedelta64[ns]
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({ ... 'a': [4, 5, 6, 7], ... 'b': [10, 20, 30, 40], ... 'c': [100, 50, -30, -50] ... }) >>> df a b c 0 4 10 100 1 5 20 50 2 6 30 -30 3 7 40 -50 >>> df.loc[(df.c - 43).abs().argsort()] a b c 1 5 20 50 0 4 10 100 2 6 30 -30 3 7 40 -50
- add(other, axis='columns', level=None, fill_value=None)
Get Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.add for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- agg(func=None, axis=0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.
- Parameters
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns
scalar, Series or DataFrame – The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g.,
numpy.mean(arr_2d)
as opposed tonumpy.mean(arr_2d, axis=0)
.agg is an alias for aggregate. Use the alias.
See also
DataFrame.apply
Perform any type of operations.
DataFrame.transform
Perform transformation type operations.
core.groupby.GroupBy
Perform operations over groups.
core.resample.Resampler
Perform operations over resampled bins.
core.window.Rolling
Perform operations over rolling window.
core.window.Expanding
Perform operations over expanding window.
core.window.ExponentialMovingWindow
Perform operation over exponential weighted window.
Notes
See pandas API documentation for pandas.DataFrame.aggregate for more. agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0
Aggregate over the columns.
>>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- aggregate(func=None, axis=0, *args, **kwargs)
Aggregate using one or more operations over the specified axis.
- Parameters
func (function, str, list or dict) –
Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.
Accepted combinations are:
function
string function name
list of functions and/or function names, e.g.
[np.sum, 'mean']
dict of axis labels -> functions, function names or list of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns
scalar, Series or DataFrame – The return can be:
scalar : when Series.agg is called with single function
Series : when DataFrame.agg is called with a single function
DataFrame : when DataFrame.agg is called with several functions
Return scalar, Series or DataFrame.
The aggregation operations are always performed over an axis, either the
index (default) or the column axis. This behavior is different from
numpy aggregation functions (mean, median, prod, sum, std,
var), where the default is to compute the aggregation of the flattened
array, e.g.,
numpy.mean(arr_2d)
as opposed tonumpy.mean(arr_2d, axis=0)
.agg is an alias for aggregate. Use the alias.
See also
DataFrame.apply
Perform any type of operations.
DataFrame.transform
Perform transformation type operations.
core.groupby.GroupBy
Perform operations over groups.
core.resample.Resampler
Perform operations over resampled bins.
core.window.Rolling
Perform operations over rolling window.
core.window.Expanding
Perform operations over expanding window.
core.window.ExponentialMovingWindow
Perform operation over exponential weighted window.
Notes
See pandas API documentation for pandas.DataFrame.aggregate for more. agg is an alias for aggregate. Use the alias.
Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
A passed user-defined-function will be passed a Series for evaluation.
Examples
>>> df = pd.DataFrame([[1, 2, 3], ... [4, 5, 6], ... [7, 8, 9], ... [np.nan, np.nan, np.nan]], ... columns=['A', 'B', 'C'])
Aggregate these functions over the rows.
>>> df.agg(['sum', 'min']) A B C sum 12.0 15.0 18.0 min 1.0 2.0 3.0
Different aggregations per column.
>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']}) A B sum 12.0 NaN min 1.0 2.0 max NaN 8.0
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
>>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean)) A B C x 7.0 NaN NaN y NaN 2.0 NaN z NaN NaN 6.0
Aggregate over the columns.
>>> df.agg("mean", axis="columns") 0 2.0 1 5.0 2 8.0 3 NaN dtype: float64
- align(other, join='outer', axis=None, level=None, copy=True, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None)
Align two objects on their axes with the specified join method.
Join method is specified for each axis Index.
- Parameters
join ({'outer', 'inner', 'left', 'right'}, default 'outer') –
axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).
level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.
copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.
fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –
Method to use for filling holes in reindexed Series:
pad / ffill: propagate last valid observation forward to next valid.
backfill / bfill: use NEXT valid observation to fill gap.
limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.
fill_axis ({0 or 'index', 1 or 'columns'}, default 0) – Filling axis, method and limit.
broadcast_axis ({0 or 'index', 1 or 'columns'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.
- Returns
(left, right) – Aligned objects.
- Return type
(DataFrame, type of other)
Notes
See pandas API documentation for pandas.DataFrame.align for more.
- all(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
Return whether all elements are True, potentially over an axis.
Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).
- Parameters
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
- Return type
See also
Series.all
Return True if all elements are True.
DataFrame.any
Return True if one (or more) elements are True.
Examples
Series
>>> pd.Series([True, True]).all() True >>> pd.Series([True, False]).all() False >>> pd.Series([], dtype="float64").all() True >>> pd.Series([np.nan]).all() True >>> pd.Series([np.nan]).all(skipna=False) True
DataFrames
Create a dataframe from a dictionary.
>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]}) >>> df col1 col2 0 True True 1 True False
Default behaviour checks if column-wise values all return True.
>>> df.all() col1 True col2 False dtype: bool
Specify
axis='columns'
to check if row-wise values all return True.>>> df.all(axis='columns') 0 True 1 False dtype: bool
Or
axis=None
for whether every value is True.>>> df.all(axis=None) False
Notes
See pandas API documentation for pandas.DataFrame.all for more.
- any(axis=0, bool_only=None, skipna=True, level=None, **kwargs)
Return whether any element is True, potentially over an axis.
Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).
- Parameters
axis ({0 or 'index', 1 or 'columns', None}, default 0) –
Indicate which axis or axes should be reduced.
0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.
1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
None : reduce all axes, return a scalar.
bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.
skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
**kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
If level is specified, then, DataFrame is returned; otherwise, Series is returned.
- Return type
See also
numpy.any
Numpy version of this method.
Series.any
Return whether any element is True.
Series.all
Return whether all elements are True.
DataFrame.any
Return whether any element is True over requested axis.
DataFrame.all
Return whether all elements are True over requested axis.
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([False, False]).any() False >>> pd.Series([True, False]).any() True >>> pd.Series([], dtype="float64").any() False >>> pd.Series([np.nan]).any() False >>> pd.Series([np.nan]).any(skipna=False) True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]}) >>> df A B C 0 1 0 0 1 2 2 0
>>> df.any() A True B True C False dtype: bool
Aggregating over the columns.
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]}) >>> df A B 0 True 1 1 False 2
>>> df.any(axis='columns') 0 True 1 True dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]}) >>> df A B 0 True 1 1 False 0
>>> df.any(axis='columns') 0 True 1 False dtype: bool
Aggregating over the entire DataFrame with
axis=None
.>>> df.any(axis=None) True
any for an empty DataFrame is an empty Series.
>>> pd.DataFrame([]).any() Series([], dtype: bool)
Notes
See pandas API documentation for pandas.DataFrame.any for more.
- apply(func, axis=0, broadcast=None, raw=False, reduce=None, result_type=None, convert_dtype=True, args=(), **kwds)
Apply a function along an axis of the DataFrame.
Objects passed to the function are Series objects whose index is either the DataFrame’s index (
axis=0
) or the DataFrame’s columns (axis=1
). By default (result_type=None
), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.- Parameters
func (function) – Function to apply to each column or row.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the function is applied:
0 or ‘index’: apply function to each column.
1 or ‘columns’: apply function to each row.
raw (bool, default False) –
Determines if row or column is passed as a Series or ndarray object:
False
: passes each row or column as a Series to the function.True
: the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.
result_type ({'expand', 'reduce', 'broadcast', None}, default None) –
These only act when
axis=1
(columns):’expand’ : list-like results will be turned into columns.
’reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.
’broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.
The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.
args (tuple) – Positional arguments to pass to func in addition to the array/series.
**kwargs – Additional keyword arguments to pass as keywords arguments to func.
- Returns
Result of applying
func
along the given axis of the DataFrame.- Return type
See also
DataFrame.applymap
For elementwise operations.
DataFrame.aggregate
Only perform aggregating type operations.
DataFrame.transform
Only perform transforming type operations.
Notes
See pandas API documentation for pandas.DataFrame.apply for more. Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B']) >>> df A B 0 4 9 1 4 9 2 4 9
Using a numpy universal function (in this case the same as
np.sqrt(df)
):>>> df.apply(np.sqrt) A B 0 2.0 3.0 1 2.0 3.0 2 2.0 3.0
Using a reducing function on either axis
>>> df.apply(np.sum, axis=0) A 12 B 27 dtype: int64
>>> df.apply(np.sum, axis=1) 0 13 1 13 2 13 dtype: int64
Returning a list-like will result in a Series
>>> df.apply(lambda x: [1, 2], axis=1) 0 [1, 2] 1 [1, 2] 2 [1, 2] dtype: object
Passing
result_type='expand'
will expand list-like results to columns of a Dataframe>>> df.apply(lambda x: [1, 2], axis=1, result_type='expand') 0 1 0 1 2 1 1 2 2 1 2
Returning a Series inside the function is similar to passing
result_type='expand'
. The resulting column names will be the Series index.>>> df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1) foo bar 0 1 2 1 1 2 2 1 2
Passing
result_type='broadcast'
will ensure the same shape result, whether list-like or scalar is returned by the function, and broadcast it along the axis. The resulting column names will be the originals.>>> df.apply(lambda x: [1, 2], axis=1, result_type='broadcast') A B 0 1 2 1 1 2 2 1 2
- asfreq(freq, method=None, how=None, normalize=False, fill_value=None)
Convert time series to specified frequency.
Returns the original data conformed to a new index with the specified frequency.
If the index of this DataFrame is a
PeriodIndex
, the new index is the result of transforming the original index withPeriodIndex.asfreq
(so the original index will map one-to-one to the new index).Otherwise, the new index will be equivalent to
pd.date_range(start, end, freq=freq)
wherestart
andend
are, respectively, the first and last entries in the original index (seepandas.date_range()
). The values corresponding to any timesteps in the new index which were not present in the original index will be null (NaN
), unless a method for filling such unknowns is provided (see themethod
parameter below).The
resample()
method is more appropriate if an operation on each group of timesteps (such as an aggregate) is necessary to represent the data at the new frequency.- Parameters
freq (DateOffset or str) – Frequency DateOffset or string.
method ({'backfill'/'bfill', 'pad'/'ffill'}, default None) –
Method to use for filling holes in reindexed Series (note this does not fill NaNs that already were present):
’pad’ / ‘ffill’: propagate last valid observation forward to next valid
’backfill’ / ‘bfill’: use NEXT valid observation to fill.
how ({'start', 'end'}, default end) – For PeriodIndex only (see PeriodIndex.asfreq).
normalize (bool, default False) – Whether to reset output index to midnight.
fill_value (scalar, optional) – Value to use for missing values, applied during upsampling (note this does not fill NaNs that already were present).
- Returns
DataFrame object reindexed to the specified frequency.
- Return type
See also
reindex
Conform DataFrame to new index with optional filling logic.
Notes
See pandas API documentation for pandas.DataFrame.asfreq for more. To learn more about the frequency strings, please see this link.
Examples
Start by creating a series with 4 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=4, freq='T') >>> series = pd.Series([0.0, None, 2.0, 3.0], index=index) >>> df = pd.DataFrame({'s': series}) >>> df s 2000-01-01 00:00:00 0.0 2000-01-01 00:01:00 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:03:00 3.0
Upsample the series into 30 second bins.
>>> df.asfreq(freq='30S') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 NaN 2000-01-01 00:03:00 3.0
Upsample again, providing a
fill value
.>>> df.asfreq(freq='30S', fill_value=9.0) s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 9.0 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 9.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 9.0 2000-01-01 00:03:00 3.0
Upsample again, providing a
method
.>>> df.asfreq(freq='30S', method='bfill') s 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 NaN 2000-01-01 00:01:30 2.0 2000-01-01 00:02:00 2.0 2000-01-01 00:02:30 3.0 2000-01-01 00:03:00 3.0
- asof(where, subset=None)
Return the last row(s) without any NaNs before where.
The last row (for each element in where, if list) without any NaN is taken. In case of a
DataFrame
, the last row without NaN considering only the subset of columns (if not None)If there is no good value, NaN is returned for a Series or a Series of NaN values for a DataFrame
- Parameters
where (date or array-like of dates) – Date(s) before which the last row(s) are returned.
subset (str or array-like of str, default None) – For DataFrame, if not None, only use these columns to check for NaNs.
- Returns
The return can be:
scalar : when self is a Series and where is a scalar
Series: when self is a Series and where is an array-like, or when self is a DataFrame and where is a scalar
DataFrame : when self is a DataFrame and where is an array-like
Return scalar, Series, or DataFrame.
- Return type
See also
merge_asof
Perform an asof merge. Similar to left join.
Notes
See pandas API documentation for pandas.DataFrame.asof for more. Dates are assumed to be sorted. Raises if this is not the case.
Examples
A Series and a scalar where.
>>> s = pd.Series([1, 2, np.nan, 4], index=[10, 20, 30, 40]) >>> s 10 1.0 20 2.0 30 NaN 40 4.0 dtype: float64
>>> s.asof(20) 2.0
For a sequence where, a Series is returned. The first value is NaN, because the first element of where is before the first index value.
>>> s.asof([5, 20]) 5 NaN 20 2.0 dtype: float64
Missing values are not considered. The following is
2.0
, not NaN, even though NaN is at the index location for30
.>>> s.asof(30) 2.0
Take all columns into consideration
>>> df = pd.DataFrame({'a': [10, 20, 30, 40, 50], ... 'b': [None, None, None, None, 500]}, ... index=pd.DatetimeIndex(['2018-02-27 09:01:00', ... '2018-02-27 09:02:00', ... '2018-02-27 09:03:00', ... '2018-02-27 09:04:00', ... '2018-02-27 09:05:00'])) >>> df.asof(pd.DatetimeIndex(['2018-02-27 09:03:30', ... '2018-02-27 09:04:30'])) a b 2018-02-27 09:03:30 NaN NaN 2018-02-27 09:04:30 NaN NaN
Take a single column into consideration
>>> df.asof(pd.DatetimeIndex(['2018-02-27 09:03:30', ... '2018-02-27 09:04:30']), ... subset=['a']) a b 2018-02-27 09:03:30 30.0 NaN 2018-02-27 09:04:30 40.0 NaN
- astype(dtype, copy=True, errors='raise')
Cast a pandas object to a specified dtype
dtype
.- Parameters
dtype (data type, or dict of column name -> data type) – Use a numpy.dtype or Python type to cast entire pandas object to the same type. Alternatively, use {col: dtype, …}, where col is a column label and dtype is a numpy.dtype or Python type to cast one or more of the DataFrame’s columns to column-specific types.
copy (bool, default True) – Return a copy when
copy=True
(be very careful settingcopy=False
as changes to values then may propagate to other pandas objects).errors ({'raise', 'ignore'}, default 'raise') –
Control raising of exceptions on invalid data for provided dtype.
raise
: allow exceptions to be raisedignore
: suppress exceptions. On error return original object.
- Returns
casted
- Return type
same type as caller
See also
to_datetime
Convert argument to datetime.
to_timedelta
Convert argument to timedelta.
to_numeric
Convert argument to a numeric type.
numpy.ndarray.astype
Cast a numpy array to a specified type.
Notes
See pandas API documentation for pandas.DataFrame.astype for more. .. deprecated:: 1.3.0
Using
astype
to convert from timezone-naive dtype to timezone-aware dtype is deprecated and will raise in a future version. UseSeries.dt.tz_localize()
instead.Examples
Create a DataFrame:
>>> d = {'col1': [1, 2], 'col2': [3, 4]} >>> df = pd.DataFrame(data=d) >>> df.dtypes col1 int64 col2 int64 dtype: object
Cast all columns to int32:
>>> df.astype('int32').dtypes col1 int32 col2 int32 dtype: object
Cast col1 to int32 using a dictionary:
>>> df.astype({'col1': 'int32'}).dtypes col1 int32 col2 int64 dtype: object
Create a series:
>>> ser = pd.Series([1, 2], dtype='int32') >>> ser 0 1 1 2 dtype: int32 >>> ser.astype('int64') 0 1 1 2 dtype: int64
Convert to categorical type:
>>> ser.astype('category') 0 1 1 2 dtype: category Categories (2, int64): [1, 2]
Convert to ordered categorical type with custom ordering:
>>> from pandas.api.types import CategoricalDtype >>> cat_dtype = CategoricalDtype( ... categories=[2, 1], ordered=True) >>> ser.astype(cat_dtype) 0 1 1 2 dtype: category Categories (2, int64): [2 < 1]
Note that using
copy=False
and changing data on a new pandas object may propagate changes:>>> s1 = pd.Series([1, 2]) >>> s2 = s1.astype('int64', copy=False) >>> s2[0] = 10 >>> s1 # note that s1[0] has changed too 0 10 1 2 dtype: int64
Create a series of dates:
>>> ser_date = pd.Series(pd.date_range('20200101', periods=3)) >>> ser_date 0 2020-01-01 1 2020-01-02 2 2020-01-03 dtype: datetime64[ns]
- property at
Access a single value for a row/column label pair.
Similar to
loc
, in that both provide label-based lookups. Useat
if you only need to get or set a single value in a DataFrame or Series.- Raises
KeyError – If ‘label’ does not exist in DataFrame.
See also
DataFrame.iat
Access a single value for a row/column pair by integer position.
DataFrame.loc
Access a group of rows and columns by label(s).
Series.at
Access a single value using a label.
Examples
>>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], ... index=[4, 5, 6], columns=['A', 'B', 'C']) >>> df A B C 4 0 2 3 5 0 4 1 6 10 20 30
Get value at specified row/column pair
>>> df.at[4, 'B'] 2
Set value at specified row/column pair
>>> df.at[4, 'B'] = 10 >>> df.at[4, 'B'] 10
Get value within a Series
>>> df.loc[5].at['B'] 4
Notes
See pandas API documentation for pandas.DataFrame.at for more.
- at_time(time, asof=False, axis=None)
Select values at particular time of day (e.g., 9:30AM).
- Parameters
time (datetime.time or str) –
axis ({0 or 'index', 1 or 'columns'}, default 0) –
- Returns
- Return type
- Raises
TypeError – If the index is not a
DatetimeIndex
See also
between_time
Select values between particular times of the day.
first
Select initial periods of time series based on a date offset.
last
Select final periods of time series based on a date offset.
DatetimeIndex.indexer_at_time
Get just the index locations for values at particular time of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='12H') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 00:00:00 1 2018-04-09 12:00:00 2 2018-04-10 00:00:00 3 2018-04-10 12:00:00 4
>>> ts.at_time('12:00') A 2018-04-09 12:00:00 2 2018-04-10 12:00:00 4
Notes
See pandas API documentation for pandas.DataFrame.at_time for more.
- backfill(axis=None, inplace=False, limit=None, downcast=None)
Synonym for
DataFrame.fillna()
withmethod='bfill'
.- Returns
Object with missing values filled or None if
inplace=True
.- Return type
Series/DataFrame or None
Notes
See pandas API documentation for pandas.DataFrame.backfill for more.
- between_time(start_time, end_time, include_start=True, include_end=True, axis=None)
Select values between particular times of the day (e.g., 9:00-9:30 AM).
By setting
start_time
to be later thanend_time
, you can get the times that are not between the two times.- Parameters
start_time (datetime.time or str) – Initial time as a time filter limit.
end_time (datetime.time or str) – End time as a time filter limit.
include_start (bool, default True) – Whether the start time needs to be included in the result.
include_end (bool, default True) – Whether the end time needs to be included in the result.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Determine range time on index or columns value.
- Returns
Data from the original object filtered to the specified dates range.
- Return type
- Raises
TypeError – If the index is not a
DatetimeIndex
See also
at_time
Select values at a particular time of the day.
first
Select initial periods of time series based on a date offset.
last
Select final periods of time series based on a date offset.
DatetimeIndex.indexer_between_time
Get just the index locations for values between particular times of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='1D20min') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 00:00:00 1 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3 2018-04-12 01:00:00 4
>>> ts.between_time('0:15', '0:45') A 2018-04-10 00:20:00 2 2018-04-11 00:40:00 3
You get the times that are not between two times by setting
start_time
later thanend_time
:>>> ts.between_time('0:45', '0:15') A 2018-04-09 00:00:00 1 2018-04-12 01:00:00 4
Notes
See pandas API documentation for pandas.DataFrame.between_time for more.
- bfill(axis=None, inplace=False, limit=None, downcast=None)
Synonym for
DataFrame.fillna()
withmethod='bfill'
.- Returns
Object with missing values filled or None if
inplace=True
.- Return type
Series/DataFrame or None
Notes
See pandas API documentation for pandas.DataFrame.backfill for more.
- bool()
Return the bool of a single element Series or DataFrame.
This must be a boolean scalar value, either True or False. It will raise a ValueError if the Series or DataFrame does not have exactly 1 element, or that element is not boolean (integer values 0 and 1 will also raise an exception).
- Returns
The value in the Series or DataFrame.
- Return type
bool
See also
Series.astype
Change the data type of a Series, including to boolean.
DataFrame.astype
Change the data type of a DataFrame, including to boolean.
numpy.bool_
NumPy boolean data type, used by pandas for boolean values.
Examples
The method will only work for single element objects with a boolean value:
>>> pd.Series([True]).bool() True >>> pd.Series([False]).bool() False
>>> pd.DataFrame({'col': [True]}).bool() True >>> pd.DataFrame({'col': [False]}).bool() False
Notes
See pandas API documentation for pandas.DataFrame.bool for more.
- combine(other, func, fill_value=None, **kwargs)
Perform column-wise combine with another DataFrame.
Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.
- Parameters
other (DataFrame) – The DataFrame to merge column-wise.
func (function) – Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.
fill_value (scalar value, default None) – The value to fill NaNs with prior to passing any column to the merge func.
overwrite (bool, default True) – If True, columns in self that do not exist in other will be overwritten with NaNs.
- Returns
Combination of the provided DataFrames.
- Return type
See also
DataFrame.combine_first
Combine two DataFrame objects and default to non-null values in frame calling the method.
Examples
Combine using a simple function that chooses the smaller column.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2 >>> df1.combine(df2, take_smaller) A B 0 0 3 1 0 3
Example using a true element-wise combine function.
>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, np.minimum) A B 0 1 2 1 0 3
Using fill_value fills Nones prior to passing the column to the merge function.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 4.0
However, if the same element in both dataframes is None, that None is preserved
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]}) >>> df1.combine(df2, take_smaller, fill_value=-5) A B 0 0 -5.0 1 0 3.0
Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.
>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2]) >>> df1.combine(df2, take_smaller) A B C 0 NaN NaN NaN 1 NaN 3.0 -10.0 2 NaN 3.0 1.0
>>> df1.combine(df2, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 -10.0 2 NaN 3.0 1.0
Demonstrating the preference of the passed in dataframe.
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2]) >>> df2.combine(df1, take_smaller) A B C 0 0.0 NaN NaN 1 0.0 3.0 NaN 2 NaN 3.0 NaN
>>> df2.combine(df1, take_smaller, overwrite=False) A B C 0 0.0 NaN NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
Notes
See pandas API documentation for pandas.DataFrame.combine for more.
- combine_first(other)
Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
- Parameters
other (DataFrame) – Provided DataFrame to use to fill null values.
- Returns
The result of combining the provided DataFrame with the other object.
- Return type
See also
DataFrame.combine
Perform series-wise operation on two DataFrames using a given function.
Examples
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]}) >>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]}) >>> df1.combine_first(df2) A B 0 1.0 3.0 1 0.0 4.0
Null values still persist if the location of that null value does not exist in other
>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]}) >>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2]) >>> df1.combine_first(df2) A B C 0 NaN 4.0 NaN 1 0.0 3.0 1.0 2 NaN 3.0 1.0
Notes
See pandas API documentation for pandas.DataFrame.combine_first for more.
- convert_dtypes(infer_objects: modin.pandas.base.BasePandasDataset.bool = True, convert_string: modin.pandas.base.BasePandasDataset.bool = True, convert_integer: modin.pandas.base.BasePandasDataset.bool = True, convert_boolean: modin.pandas.base.BasePandasDataset.bool = True, convert_floating: modin.pandas.base.BasePandasDataset.bool = True)
Convert columns to best possible dtypes using dtypes supporting
pd.NA
.New in version 1.0.0.
- Parameters
infer_objects (bool, default True) – Whether object dtypes should be converted to the best possible types.
convert_string (bool, default True) – Whether object dtypes should be converted to
StringDtype()
.convert_integer (bool, default True) – Whether, if possible, conversion can be done to integer extension types.
convert_boolean (bool, defaults True) – Whether object dtypes should be converted to
BooleanDtypes()
.convert_floating (bool, defaults True) –
Whether, if possible, conversion can be done to floating extension types. If convert_integer is also True, preference will be give to integer dtypes if the floats can be faithfully casted to integers.
New in version 1.2.0.
- Returns
Copy of input object with new dtype.
- Return type
See also
infer_objects
Infer dtypes of objects.
to_datetime
Convert argument to datetime.
to_timedelta
Convert argument to timedelta.
to_numeric
Convert argument to a numeric type.
Notes
See pandas API documentation for pandas.DataFrame.convert_dtypes for more. By default,
convert_dtypes
will attempt to convert a Series (or each Series in a DataFrame) to dtypes that supportpd.NA
. By using the optionsconvert_string
,convert_integer
,convert_boolean
andconvert_boolean
, it is possible to turn off individual conversions toStringDtype
, the integer extension types,BooleanDtype
or floating extension types, respectively.For object-dtyped columns, if
infer_objects
isTrue
, use the inference rules as during normal Series/DataFrame construction. Then, if possible, convert toStringDtype
,BooleanDtype
or an appropriate integer or floating extension type, otherwise leave asobject
.If the dtype is integer, convert to an appropriate integer extension type.
If the dtype is numeric, and consists of all integers, convert to an appropriate integer extension type. Otherwise, convert to an appropriate floating extension type.
Changed in version 1.2: Starting with pandas 1.2, this method also converts float columns to the nullable floating extension type.
In the future, as new dtypes are added that support
pd.NA
, the results of this method will change to support those new dtypes.Examples
>>> df = pd.DataFrame( ... { ... "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")), ... "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")), ... "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")), ... "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")), ... "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")), ... "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")), ... } ... )
Start with a DataFrame with default dtypes.
>>> df a b c d e f 0 1 x True h 10.0 NaN 1 2 y False i NaN 100.5 2 3 z NaN NaN 20.0 200.0
>>> df.dtypes a int32 b object c object d object e float64 f float64 dtype: object
Convert the DataFrame to use best possible dtypes.
>>> dfn = df.convert_dtypes() >>> dfn a b c d e f 0 1 x True h 10 <NA> 1 2 y False i <NA> 100.5 2 3 z <NA> <NA> 20 200.0
>>> dfn.dtypes a Int32 b string c boolean d string e Int64 f Float64 dtype: object
Start with a Series of strings and missing data represented by
np.nan
.>>> s = pd.Series(["a", "b", np.nan]) >>> s 0 a 1 b 2 NaN dtype: object
Obtain a Series with dtype
StringDtype
.>>> s.convert_dtypes() 0 a 1 b 2 <NA> dtype: string
- copy(deep=True)
Make a copy of this object’s indices and data.
When
deep=True
(default), a new object will be created with a copy of the calling object’s data and indices. Modifications to the data or indices of the copy will not be reflected in the original object (see notes below).When
deep=False
, a new object will be created without copying the calling object’s data or index (only references to the data and index are copied). Any changes to the data of the original will be reflected in the shallow copy (and vice versa).- Parameters
deep (bool, default True) – Make a deep copy, including a copy of the data and the indices. With
deep=False
neither the indices nor the data are copied.- Returns
copy – Object type matches caller.
- Return type
Notes
See pandas API documentation for pandas.DataFrame.copy for more. When
deep=True
, data is copied but actual Python objects will not be copied recursively, only the reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively copies object data (see examples below).While
Index
objects are copied whendeep=True
, the underlying numpy array is not copied for performance reasons. SinceIndex
is immutable, the underlying data can be safely shared and a copy is not needed.Examples
>>> s = pd.Series([1, 2], index=["a", "b"]) >>> s a 1 b 2 dtype: int64
>>> s_copy = s.copy() >>> s_copy a 1 b 2 dtype: int64
Shallow copy versus default (deep) copy:
>>> s = pd.Series([1, 2], index=["a", "b"]) >>> deep = s.copy() >>> shallow = s.copy(deep=False)
Shallow copy shares data and index with original.
>>> s is shallow False >>> s.values is shallow.values and s.index is shallow.index True
Deep copy has own copy of data and index.
>>> s is deep False >>> s.values is deep.values or s.index is deep.index False
Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged.
>>> s[0] = 3 >>> shallow[1] = 4 >>> s a 3 b 4 dtype: int64 >>> shallow a 3 b 4 dtype: int64 >>> deep a 1 b 2 dtype: int64
Note that when copying an object containing Python objects, a deep copy will copy the data, but will not do so recursively. Updating a nested data object will be reflected in the deep copy.
>>> s = pd.Series([[1, 2], [3, 4]]) >>> deep = s.copy() >>> s[0][0] = 10 >>> s 0 [10, 2] 1 [3, 4] dtype: object >>> deep 0 [10, 2] 1 [3, 4] dtype: object
- count(axis=0, level=None, numeric_only=False)
Count non-NA cells for each column or row.
The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.
level (int or str, optional) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a DataFrame. A str specifies the level name.
numeric_only (bool, default False) – Include only float, int or boolean data.
- Returns
For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.
- Return type
See also
Series.count
Number of non-NA elements in a Series.
DataFrame.value_counts
Count unique combinations of columns.
DataFrame.shape
Number of DataFrame rows and columns (including NA elements).
DataFrame.isna
Boolean same-sized DataFrame showing places of NA elements.
Examples
Constructing DataFrame from a dictionary:
>>> df = pd.DataFrame({"Person": ... ["John", "Myla", "Lewis", "John", "Myla"], ... "Age": [24., np.nan, 21., 33, 26], ... "Single": [False, True, True, True, False]}) >>> df Person Age Single 0 John 24.0 False 1 Myla NaN True 2 Lewis 21.0 True 3 John 33.0 True 4 Myla 26.0 False
Notice the uncounted NA values:
>>> df.count() Person 5 Age 4 Single 5 dtype: int64
Counts for each row:
>>> df.count(axis='columns') 0 3 1 2 2 3 3 3 4 3 dtype: int64
Notes
See pandas API documentation for pandas.DataFrame.count for more.
- cummax(axis=None, skipna=True, *args, **kwargs)
Return cumulative maximum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative maximum.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
Return cumulative maximum of Series or DataFrame.
- Return type
See also
core.window.Expanding.max
Similar functionality but ignores
NaN
values.DataFrame.max
Return the maximum over DataFrame axis.
DataFrame.cummax
Return cumulative maximum over DataFrame axis.
DataFrame.cummin
Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
Return cumulative sum over DataFrame axis.
DataFrame.cumprod
Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummax() 0 2.0 1 NaN 2 5.0 3 5.0 4 5.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummax(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the maximum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummax() A B 0 2.0 1.0 1 3.0 NaN 2 3.0 1.0
To iterate over columns and find the maximum in each row, use
axis=1
>>> df.cummax(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 1.0
Notes
See pandas API documentation for pandas.DataFrame.cummax for more.
- cummin(axis=None, skipna=True, *args, **kwargs)
Return cumulative minimum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative minimum.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
Return cumulative minimum of Series or DataFrame.
- Return type
See also
core.window.Expanding.min
Similar functionality but ignores
NaN
values.DataFrame.min
Return the minimum over DataFrame axis.
DataFrame.cummax
Return cumulative maximum over DataFrame axis.
DataFrame.cummin
Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
Return cumulative sum over DataFrame axis.
DataFrame.cumprod
Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cummin() 0 2.0 1 NaN 2 2.0 3 -1.0 4 -1.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cummin(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cummin() A B 0 2.0 1.0 1 2.0 NaN 2 1.0 0.0
To iterate over columns and find the minimum in each row, use
axis=1
>>> df.cummin(axis=1) A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
Notes
See pandas API documentation for pandas.DataFrame.cummin for more.
- cumprod(axis=None, skipna=True, *args, **kwargs)
Return cumulative product over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative product.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
Return cumulative product of Series or DataFrame.
- Return type
See also
core.window.Expanding.prod
Similar functionality but ignores
NaN
values.DataFrame.prod
Return the product over DataFrame axis.
DataFrame.cummax
Return cumulative maximum over DataFrame axis.
DataFrame.cummin
Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
Return cumulative sum over DataFrame axis.
DataFrame.cumprod
Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumprod() 0 2.0 1 NaN 2 10.0 3 -10.0 4 -0.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumprod(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the product in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumprod() A B 0 2.0 1.0 1 6.0 NaN 2 6.0 0.0
To iterate over columns and find the product in each row, use
axis=1
>>> df.cumprod(axis=1) A B 0 2.0 2.0 1 3.0 NaN 2 1.0 0.0
Notes
See pandas API documentation for pandas.DataFrame.cumprod for more.
- cumsum(axis=None, skipna=True, *args, **kwargs)
Return cumulative sum over a DataFrame or Series axis.
Returns a DataFrame or Series of the same size containing the cumulative sum.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
*args – Additional keywords have no effect but might be accepted for compatibility with NumPy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.
- Returns
Return cumulative sum of Series or DataFrame.
- Return type
See also
core.window.Expanding.sum
Similar functionality but ignores
NaN
values.DataFrame.sum
Return the sum over DataFrame axis.
DataFrame.cummax
Return cumulative maximum over DataFrame axis.
DataFrame.cummin
Return cumulative minimum over DataFrame axis.
DataFrame.cumsum
Return cumulative sum over DataFrame axis.
DataFrame.cumprod
Return cumulative product over DataFrame axis.
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0]) >>> s 0 2.0 1 NaN 2 5.0 3 -1.0 4 0.0 dtype: float64
By default, NA values are ignored.
>>> s.cumsum() 0 2.0 1 NaN 2 7.0 3 6.0 4 6.0 dtype: float64
To include NA values in the operation, use
skipna=False
>>> s.cumsum(skipna=False) 0 2.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0], ... [3.0, np.nan], ... [1.0, 0.0]], ... columns=list('AB')) >>> df A B 0 2.0 1.0 1 3.0 NaN 2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to
axis=None
oraxis='index'
.>>> df.cumsum() A B 0 2.0 1.0 1 5.0 NaN 2 6.0 1.0
To iterate over columns and find the sum in each row, use
axis=1
>>> df.cumsum(axis=1) A B 0 2.0 3.0 1 3.0 NaN 2 1.0 1.0
Notes
See pandas API documentation for pandas.DataFrame.cumsum for more.
- describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding
NaN
values.Analyzes both numeric and object series, as well as
DataFrame
column sets of mixed data types. The output will vary depending on what is provided. Refer to the notes below for more detail.- Parameters
percentiles (list-like of numbers, optional) – The percentiles to include in the output. All should fall between 0 and 1. The default is
[.25, .5, .75]
, which returns the 25th, 50th, and 75th percentiles.include ('all', list-like of dtypes or None (default), optional) –
A white list of data types to include in the result. Ignored for
Series
. Here are the options:’all’ : All columns of the input will be included in the output.
A list-like of dtypes : Limits the results to the provided data types. To limit the result to numeric types submit
numpy.number
. To limit it instead to object columns submit thenumpy.object
data type. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To select pandas categorical columns, use'category'
None (default) : The result will include all numeric columns.
exclude (list-like of dtypes or None (default), optional,) –
A black list of data types to omit from the result. Ignored for
Series
. Here are the options:A list-like of dtypes : Excludes the provided data types from the result. To exclude numeric types submit
numpy.number
. To exclude object columns submit the data typenumpy.object
. Strings can also be used in the style ofselect_dtypes
(e.g.df.describe(include=['O'])
). To exclude pandas categorical columns, use'category'
None (default) : The result will exclude nothing.
datetime_is_numeric (bool, default False) –
Whether to treat datetime dtypes as numeric. This affects statistics calculated for the column. For DataFrame input, this also controls whether datetime columns are included by default.
New in version 1.1.0.
- Returns
Summary statistics of the Series or Dataframe provided.
- Return type
See also
DataFrame.count
Count number of non-NA/null observations.
DataFrame.max
Maximum of the values in the object.
DataFrame.min
Minimum of the values in the object.
DataFrame.mean
Mean of the values.
DataFrame.std
Standard deviation of the observations.
DataFrame.select_dtypes
Subset of a DataFrame including/excluding columns based on their dtype.
Notes
See pandas API documentation for pandas.DataFrame.describe for more. For numeric data, the result’s index will include
count
,mean
,std
,min
,max
as well as lower,50
and upper percentiles. By default the lower percentile is25
and the upper percentile is75
. The50
percentile is the same as the median.For object data (e.g. strings or timestamps), the result’s index will include
count
,unique
,top
, andfreq
. Thetop
is the most common value. Thefreq
is the most common value’s frequency. Timestamps also include thefirst
andlast
items.If multiple object values have the highest count, then the
count
andtop
results will be arbitrarily chosen from among those with the highest count.For mixed data types provided via a
DataFrame
, the default is to return only an analysis of numeric columns. If the dataframe consists only of object and categorical data without any numeric columns, the default is to return an analysis of both the object and categorical columns. Ifinclude='all'
is provided as an option, the result will include a union of attributes of each type.The include and exclude parameters can be used to limit which columns in a
DataFrame
are analyzed for the output. The parameters are ignored when analyzing aSeries
.Examples
Describing a numeric
Series
.>>> s = pd.Series([1, 2, 3]) >>> s.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 dtype: float64
Describing a categorical
Series
.>>> s = pd.Series(['a', 'a', 'b', 'c']) >>> s.describe() count 4 unique 3 top a freq 2 dtype: object
Describing a timestamp
Series
.>>> s = pd.Series([ ... np.datetime64("2000-01-01"), ... np.datetime64("2010-01-01"), ... np.datetime64("2010-01-01") ... ]) >>> s.describe(datetime_is_numeric=True) count 3 mean 2006-09-01 08:00:00 min 2000-01-01 00:00:00 25% 2004-12-31 12:00:00 50% 2010-01-01 00:00:00 75% 2010-01-01 00:00:00 max 2010-01-01 00:00:00 dtype: object
Describing a
DataFrame
. By default only numeric fields are returned.>>> df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), ... 'numeric': [1, 2, 3], ... 'object': ['a', 'b', 'c'] ... }) >>> df.describe() numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Describing all columns of a
DataFrame
regardless of data type.>>> df.describe(include='all') categorical numeric object count 3 3.0 3 unique 3 NaN 3 top f NaN a freq 1 NaN 1 mean NaN 2.0 NaN std NaN 1.0 NaN min NaN 1.0 NaN 25% NaN 1.5 NaN 50% NaN 2.0 NaN 75% NaN 2.5 NaN max NaN 3.0 NaN
Describing a column from a
DataFrame
by accessing it as an attribute.>>> df.numeric.describe() count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0 Name: numeric, dtype: float64
Including only numeric columns in a
DataFrame
description.>>> df.describe(include=[np.number]) numeric count 3.0 mean 2.0 std 1.0 min 1.0 25% 1.5 50% 2.0 75% 2.5 max 3.0
Including only string columns in a
DataFrame
description.>>> df.describe(include=[object]) object count 3 unique 3 top a freq 1
Including only categorical columns from a
DataFrame
description.>>> df.describe(include=['category']) categorical count 3 unique 3 top d freq 1
Excluding numeric columns from a
DataFrame
description.>>> df.describe(exclude=[np.number]) categorical object count 3 3 unique 3 3 top f a freq 1 1
Excluding object columns from a
DataFrame
description.>>> df.describe(exclude=[object]) categorical numeric count 3 3.0 unique 3 NaN top f NaN freq 1 NaN mean NaN 2.0 std NaN 1.0 min NaN 1.0 25% NaN 1.5 50% NaN 2.0 75% NaN 2.5 max NaN 3.0
- diff(periods=1, axis=0)
First discrete difference of element.
Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is element in previous row).
- Parameters
periods (int, default 1) – Periods to shift for calculating difference, accepts negative values.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Take difference over rows (0) or columns (1).
- Returns
First differences of the Series.
- Return type
Dataframe
See also
Dataframe.pct_change
Percent change over given number of periods.
Dataframe.shift
Shift index by desired number of periods with an optional time freq.
Series.diff
First discrete difference of object.
Notes
See pandas API documentation for pandas.DataFrame.diff for more. For boolean dtypes, this uses
operator.xor()
rather thanoperator.sub()
. The result is calculated according to current dtype in Dataframe, however dtype of the result is always float64.Examples
Difference with previous row
>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6], ... 'b': [1, 1, 2, 3, 5, 8], ... 'c': [1, 4, 9, 16, 25, 36]}) >>> df a b c 0 1 1 1 1 2 1 4 2 3 2 9 3 4 3 16 4 5 5 25 5 6 8 36
>>> df.diff() a b c 0 NaN NaN NaN 1 1.0 0.0 3.0 2 1.0 1.0 5.0 3 1.0 1.0 7.0 4 1.0 2.0 9.0 5 1.0 3.0 11.0
Difference with previous column
>>> df.diff(axis=1) a b c 0 NaN 0 0 1 NaN -1 3 2 NaN -1 7 3 NaN -1 13 4 NaN 0 20 5 NaN 2 28
Difference with 3rd previous row
>>> df.diff(periods=3) a b c 0 NaN NaN NaN 1 NaN NaN NaN 2 NaN NaN NaN 3 3.0 2.0 15.0 4 3.0 4.0 21.0 5 3.0 6.0 27.0
Difference with following row
>>> df.diff(periods=-1) a b c 0 -1.0 0.0 -3.0 1 -1.0 -1.0 -5.0 2 -1.0 -1.0 -7.0 3 -1.0 -2.0 -9.0 4 -1.0 -3.0 -11.0 5 NaN NaN NaN
Overflow in input dtype
>>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8) >>> df.diff() a 0 NaN 1 255.0
- div(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.truediv for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- divide(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.truediv for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- drop(labels=None, axis=0, index=None, columns=None, level=None, inplace=False, errors='raise')
Drop specified labels from rows or columns.
Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide <advanced.shown_levels> for more information about the now unused levels.
- Parameters
labels (single label or list-like) – Index or column labels to drop.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).
index (single label or list-like) – Alternative to specifying axis (
labels, axis=0
is equivalent toindex=labels
).columns (single label or list-like) – Alternative to specifying axis (
labels, axis=1
is equivalent tocolumns=labels
).level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.
inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.
errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.
- Returns
DataFrame without the removed index or column labels or None if
inplace=True
.- Return type
DataFrame or None
- Raises
KeyError – If any of the labels is not found in the selected axis.
See also
DataFrame.loc
Label-location based indexer for selection by label.
DataFrame.dropna
Return DataFrame with labels on given axis omitted where (all or any) data are missing.
DataFrame.drop_duplicates
Return DataFrame with duplicate rows removed, optionally only considering certain columns.
Series.drop
Return Series with specified index labels removed.
Examples
>>> df = pd.DataFrame(np.arange(12).reshape(3, 4), ... columns=['A', 'B', 'C', 'D']) >>> df A B C D 0 0 1 2 3 1 4 5 6 7 2 8 9 10 11
Drop columns
>>> df.drop(['B', 'C'], axis=1) A D 0 0 3 1 4 7 2 8 11
>>> df.drop(columns=['B', 'C']) A D 0 0 3 1 4 7 2 8 11
Drop a row by index
>>> df.drop([0, 1]) A B C D 2 8 9 10 11
Drop columns and/or rows of MultiIndex DataFrame
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'], ... ['speed', 'weight', 'length']], ... codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2], ... [0, 1, 2, 0, 1, 2, 0, 1, 2]]) >>> df = pd.DataFrame(index=midx, columns=['big', 'small'], ... data=[[45, 30], [200, 100], [1.5, 1], [30, 20], ... [250, 150], [1.5, 0.8], [320, 250], ... [1, 0.8], [0.3, 0.2]]) >>> df big small lama speed 45.0 30.0 weight 200.0 100.0 length 1.5 1.0 cow speed 30.0 20.0 weight 250.0 150.0 length 1.5 0.8 falcon speed 320.0 250.0 weight 1.0 0.8 length 0.3 0.2
>>> df.drop(index='cow', columns='small') big lama speed 45.0 weight 200.0 length 1.5 falcon speed 320.0 weight 1.0 length 0.3
>>> df.drop(index='length', level=1) big small lama speed 45.0 30.0 weight 200.0 100.0 cow speed 30.0 20.0 weight 250.0 150.0 falcon speed 320.0 250.0 weight 1.0 0.8
Notes
See pandas API documentation for pandas.DataFrame.drop for more.
- drop_duplicates(keep='first', inplace=False, **kwargs)
Return DataFrame with duplicate rows removed.
Considering certain columns is optional. Indexes, including time indexes are ignored.
- Parameters
subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.
keep ({'first', 'last', False}, default 'first') – Determines which duplicates (if any) to keep. -
first
: Drop duplicates except for the first occurrence. -last
: Drop duplicates except for the last occurrence. - False : Drop all duplicates.inplace (bool, default False) – Whether to drop duplicates in place or to return a copy.
ignore_index (bool, default False) –
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
- Returns
DataFrame with duplicates removed or None if
inplace=True
.- Return type
DataFrame or None
See also
DataFrame.value_counts
Count unique combinations of columns.
Examples
Consider dataset containing ramen rating.
>>> df = pd.DataFrame({ ... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'], ... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'], ... 'rating': [4, 4, 3.5, 15, 5] ... }) >>> df brand style rating 0 Yum Yum cup 4.0 1 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
By default, it removes duplicate rows based on all columns.
>>> df.drop_duplicates() brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5 3 Indomie pack 15.0 4 Indomie pack 5.0
To remove duplicates on specific column(s), use
subset
.>>> df.drop_duplicates(subset=['brand']) brand style rating 0 Yum Yum cup 4.0 2 Indomie cup 3.5
To remove duplicates and keep last occurrences, use
keep
.>>> df.drop_duplicates(subset=['brand', 'style'], keep='last') brand style rating 1 Yum Yum cup 4.0 2 Indomie cup 3.5 4 Indomie pack 5.0
Notes
See pandas API documentation for pandas.DataFrame.drop_duplicates for more.
- droplevel(level, axis=0)
Return Series/DataFrame with requested index / column level(s) removed.
- Parameters
level (int, str, or list-like) – If a string is given, must be the name of a level If list-like, elements must be names or positional indexes of levels.
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Axis along which the level(s) is removed:
0 or ‘index’: remove level(s) in column.
1 or ‘columns’: remove level(s) in row.
- Returns
Series/DataFrame with requested index / column level(s) removed.
- Return type
Series/DataFrame
Examples
>>> df = pd.DataFrame([ ... [1, 2, 3, 4], ... [5, 6, 7, 8], ... [9, 10, 11, 12] ... ]).set_index([0, 1]).rename_axis(['a', 'b'])
>>> df.columns = pd.MultiIndex.from_tuples([ ... ('c', 'e'), ('d', 'f') ... ], names=['level_1', 'level_2'])
>>> df level_1 c d level_2 e f a b 1 2 3 4 5 6 7 8 9 10 11 12
>>> df.droplevel('a') level_1 c d level_2 e f b 2 3 4 6 7 8 10 11 12
>>> df.droplevel('level_2', axis=1) level_1 c d a b 1 2 3 4 5 6 7 8 9 10 11 12
Notes
See pandas API documentation for pandas.DataFrame.droplevel for more.
- dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
Remove missing values.
See the User Guide for more on which values are considered missing, and how to work with missing data.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) –
Determine if rows or columns which contain missing values are removed.
0, or ‘index’ : Drop rows which contain missing values.
1, or ‘columns’ : Drop columns which contain missing value.
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single axis is allowed.
how ({'any', 'all'}, default 'any') –
Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.
’any’ : If any NA values are present, drop that row or column.
’all’ : If all values are NA, drop that row or column.
thresh (int, optional) – Require that many non-NA values.
subset (array-like, optional) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.
inplace (bool, default False) – If True, do operation inplace and return None.
- Returns
DataFrame with NA entries dropped from it or None if
inplace=True
.- Return type
DataFrame or None
See also
DataFrame.isna
Indicate missing values.
DataFrame.notna
Indicate existing (non-missing) values.
DataFrame.fillna
Replace missing values.
Series.dropna
Drop missing values.
Index.dropna
Drop missing indices.
Examples
>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'], ... "toy": [np.nan, 'Batmobile', 'Bullwhip'], ... "born": [pd.NaT, pd.Timestamp("1940-04-25"), ... pd.NaT]}) >>> df name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Drop the rows where at least one element is missing.
>>> df.dropna() name toy born 1 Batman Batmobile 1940-04-25
Drop the columns where at least one element is missing.
>>> df.dropna(axis='columns') name 0 Alfred 1 Batman 2 Catwoman
Drop the rows where all elements are missing.
>>> df.dropna(how='all') name toy born 0 Alfred NaN NaT 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep only the rows with at least 2 non-NA values.
>>> df.dropna(thresh=2) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Define in which columns to look for missing values.
>>> df.dropna(subset=['name', 'toy']) name toy born 1 Batman Batmobile 1940-04-25 2 Catwoman Bullwhip NaT
Keep the DataFrame with valid entries in the same variable.
>>> df.dropna(inplace=True) >>> df name toy born 1 Batman Batmobile 1940-04-25
Notes
See pandas API documentation for pandas.DataFrame.dropna for more.
- eq(other, axis='columns', level=None)
Get Equal to of dataframe and other, element-wise (binary operator eq).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
Result of the comparison.
- Return type
DataFrame of bool
See also
DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.
Notes
See pandas API documentation for pandas.DataFrame.eq for more. Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- ewm(com=None, span=None, halflife=None, alpha=None, min_periods=0, adjust=True, ignore_na=False, axis=0, times=None)
Provide exponential weighted (EW) functions.
Available EW functions:
mean()
,var()
,std()
,corr()
,cov()
.Exactly one parameter:
com
,span
,halflife
, oralpha
must be provided.- Parameters
com (float, optional) – Specify decay in terms of center of mass, \(\alpha = 1 / (1 + com)\), for \(com \geq 0\).
span (float, optional) – Specify decay in terms of span, \(\alpha = 2 / (span + 1)\), for \(span \geq 1\).
halflife (float, str, timedelta, optional) –
Specify decay in terms of half-life, \(\alpha = 1 - \exp\left(-\ln(2) / halflife\right)\), for \(halflife > 0\).
If
times
is specified, the time unit (str or timedelta) over which an observation decays to half its value. Only applicable tomean()
and halflife value will not apply to the other functions.New in version 1.1.0.
alpha (float, optional) – Specify smoothing factor \(\alpha\) directly, \(0 < \alpha \leq 1\).
min_periods (int, default 0) – Minimum number of observations in window required to have a value (otherwise result is NA).
adjust (bool, default True) –
Divide by decaying adjustment factor in beginning periods to account for imbalance in relative weightings (viewing EWMA as a moving average).
When
adjust=True
(default), the EW function is calculated using weights \(w_i = (1 - \alpha)^i\). For example, the EW moving average of the series [\(x_0, x_1, ..., x_t\)] would be:
\[y_t = \frac{x_t + (1 - \alpha)x_{t-1} + (1 - \alpha)^2 x_{t-2} + ... + (1 - \alpha)^t x_0}{1 + (1 - \alpha) + (1 - \alpha)^2 + ... + (1 - \alpha)^t}\]When
adjust=False
, the exponentially weighted function is calculated recursively:
\[\begin{split}\begin{split} y_0 &= x_0\\ y_t &= (1 - \alpha) y_{t-1} + \alpha x_t, \end{split}\end{split}\]ignore_na (bool, default False) –
Ignore missing values when calculating weights; specify
True
to reproduce pre-0.15.0 behavior.When
ignore_na=False
(default), weights are based on absolute positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \((1-\alpha)^2\) and \(1\) ifadjust=True
, and \((1-\alpha)^2\) and \(\alpha\) ifadjust=False
.When
ignore_na=True
(reproducing pre-0.15.0 behavior), weights are based on relative positions. For example, the weights of \(x_0\) and \(x_2\) used in calculating the final weighted average of [\(x_0\), None, \(x_2\)] are \(1-\alpha\) and \(1\) ifadjust=True
, and \(1-\alpha\) and \(\alpha\) ifadjust=False
.
axis ({0, 1}, default 0) – The axis to use. The value 0 identifies the rows, and 1 identifies the columns.
times (str, np.ndarray, Series, default None) –
New in version 1.1.0.
Times corresponding to the observations. Must be monotonically increasing and
datetime64[ns]
dtype.If str, the name of the column in the DataFrame representing the times.
If 1-D array like, a sequence with the same shape as the observations.
Only applicable to
mean()
.
- Returns
A Window sub-classed for the particular operation.
- Return type
See also
rolling
Provides rolling window calculations.
expanding
Provides expanding transformations.
Notes
See pandas API documentation for pandas.DataFrame.ewm for more.
More details can be found at: Exponentially weighted windows.
Examples
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}) >>> df B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0
>>> df.ewm(com=0.5).mean() B 0 0.000000 1 0.750000 2 1.615385 3 1.615385 4 3.670213
Specifying
times
with a timedeltahalflife
when computing mean.>>> times = ['2020-01-01', '2020-01-03', '2020-01-10', '2020-01-15', '2020-01-17'] >>> df.ewm(halflife='4 days', times=pd.DatetimeIndex(times)).mean() B 0 0.000000 1 0.585786 2 1.523889 3 1.523889 4 3.233686
- expanding(min_periods=1, center=None, axis=0, method='single')
Provide expanding transformations.
- Parameters
min_periods (int, default 1) – Minimum number of observations in window required to have a value (otherwise result is NA).
center (bool, default False) – Set the labels at the center of the window.
axis (int or str, default 0) –
method (str {'single', 'table'}, default 'single') –
Execute the rolling operation per single column or row (
'single'
) or over the entire object ('table'
).This argument is only implemented when specifying
engine='numba'
in the method call.New in version 1.3.0.
- Returns
- Return type
a Window sub-classed for the particular operation
See also
rolling
Provides rolling window calculations.
ewm
Provides exponential weighted functions.
Notes
See pandas API documentation for pandas.DataFrame.expanding for more. By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.Examples
>>> df = pd.DataFrame({"B": [0, 1, 2, np.nan, 4]}) >>> df B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0
>>> df.expanding(2).sum() B 0 NaN 1 1.0 2 3.0 3 3.0 4 7.0
- ffill(axis=None, inplace=False, limit=None, downcast=None)
Synonym for
DataFrame.fillna()
withmethod='ffill'
.- Returns
Object with missing values filled or None if
inplace=True
.- Return type
Series/DataFrame or None
Notes
See pandas API documentation for pandas.DataFrame.pad for more.
- filter(items=None, like=None, regex=None, axis=None)
Subset the dataframe rows or columns according to the specified index labels.
Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
- Parameters
items (list-like) – Keep labels from axis which are in items.
like (str) – Keep labels from axis for which “like in label == True”.
regex (str (regular expression)) – Keep labels from axis for which re.search(regex, label) == True.
axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – The axis to filter on, expressed either as an index (int) or axis name (str). By default this is the info axis, ‘index’ for Series, ‘columns’ for DataFrame.
- Returns
- Return type
same type as input object
See also
DataFrame.loc
Access a group of rows and columns by label(s) or a boolean array.
Notes
See pandas API documentation for pandas.DataFrame.filter for more. The
items
,like
, andregex
parameters are enforced to be mutually exclusive.axis
defaults to the info axis that is used when indexing with[]
.Examples
>>> df = pd.DataFrame(np.array(([1, 2, 3], [4, 5, 6])), ... index=['mouse', 'rabbit'], ... columns=['one', 'two', 'three']) >>> df one two three mouse 1 2 3 rabbit 4 5 6
>>> # select columns by name >>> df.filter(items=['one', 'three']) one three mouse 1 3 rabbit 4 6
>>> # select columns by regular expression >>> df.filter(regex='e$', axis=1) one three mouse 1 3 rabbit 4 6
>>> # select rows containing 'bbi' >>> df.filter(like='bbi', axis=0) one two three rabbit 4 5 6
- first(offset)
Select initial periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can select the first few rows based on a date offset.
- Parameters
offset (str, DateOffset or dateutil.relativedelta) – The offset length of the data that will be selected. For instance, ‘1M’ will display all the rows having their index within the first month.
- Returns
A subset of the caller.
- Return type
- Raises
TypeError – If the index is not a
DatetimeIndex
See also
last
Select final periods of time series based on a date offset.
at_time
Select values at a particular time of the day.
between_time
Select values between particular times of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the first 3 days:
>>> ts.first('3D') A 2018-04-09 1 2018-04-11 2
Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and therefore data for 2018-04-13 was not returned.
Notes
See pandas API documentation for pandas.DataFrame.first for more.
- first_valid_index()
Return index for first non-NA value or None, if no NA value is found.
- Returns
scalar
- Return type
type of index
Notes
See pandas API documentation for pandas.DataFrame.first_valid_index for more. If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
- property flags
Get the properties associated with this pandas object.
The available flags are
Flags.allows_duplicate_labels
See also
Flags
Flags that apply to pandas objects.
DataFrame.attrs
Global metadata applying to this dataset.
Notes
See pandas API documentation for pandas.DataFrame.flags for more. “Flags” differ from “metadata”. Flags reflect properties of the pandas object (the Series or DataFrame). Metadata refer to properties of the dataset, and should be stored in
DataFrame.attrs
.Examples
>>> df = pd.DataFrame({"A": [1, 2]}) >>> df.flags <Flags(allows_duplicate_labels=True)>
Flags can be get or set using
.
>>> df.flags.allows_duplicate_labels True >>> df.flags.allows_duplicate_labels = False
Or by slicing with a key
>>> df.flags["allows_duplicate_labels"] False >>> df.flags["allows_duplicate_labels"] = True
- floordiv(other, axis='columns', level=None, fill_value=None)
Get Integer division of dataframe and other, element-wise (binary operator floordiv).
Equivalent to
dataframe // other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.floordiv for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- ge(other, axis='columns', level=None)
Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
Result of the comparison.
- Return type
DataFrame of bool
See also
DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.
Notes
See pandas API documentation for pandas.DataFrame.ge for more. Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- get(key, default=None)
Get item from object for given key (ex: DataFrame column).
Returns default value if not found.
- Parameters
key (object) –
- Returns
value
- Return type
same type as items contained in object
Notes
See pandas API documentation for pandas.DataFrame.get for more.
- gt(other, axis='columns', level=None)
Get Greater than of dataframe and other, element-wise (binary operator gt).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
Result of the comparison.
- Return type
DataFrame of bool
See also
DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.
Notes
See pandas API documentation for pandas.DataFrame.gt for more. Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- head(n=5)
Return the first n rows.
This function returns the first n rows for the object based on position. It is useful for quickly testing if your object has the right type of data in it.
For negative values of n, this function returns all rows except the last n rows, equivalent to
df[:-n]
.- Parameters
n (int, default 5) – Number of rows to select.
- Returns
The first n rows of the caller object.
- Return type
same type as caller
See also
DataFrame.tail
Returns the last n rows.
Examples
>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion', ... 'monkey', 'parrot', 'shark', 'whale', 'zebra']}) >>> df animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot 6 shark 7 whale 8 zebra
Viewing the first 5 lines
>>> df.head() animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey
Viewing the first n lines (three in this case)
>>> df.head(3) animal 0 alligator 1 bee 2 falcon
For negative values of n
>>> df.head(-3) animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot
Notes
See pandas API documentation for pandas.DataFrame.head for more.
- property iat
Access a single value for a row/column pair by integer position.
Similar to
iloc
, in that both provide integer-based lookups. Useiat
if you only need to get or set a single value in a DataFrame or Series.- Raises
IndexError – When integer position is out of bounds.
See also
DataFrame.at
Access a single value for a row/column label pair.
DataFrame.loc
Access a group of rows and columns by label(s).
DataFrame.iloc
Access a group of rows and columns by integer position(s).
Examples
>>> df = pd.DataFrame([[0, 2, 3], [0, 4, 1], [10, 20, 30]], ... columns=['A', 'B', 'C']) >>> df A B C 0 0 2 3 1 0 4 1 2 10 20 30
Get value at specified row/column pair
>>> df.iat[1, 2] 1
Set value at specified row/column pair
>>> df.iat[1, 2] = 10 >>> df.iat[1, 2] 10
Get value within a series
>>> df.loc[0].iat[1] 2
Notes
See pandas API documentation for pandas.DataFrame.iat for more.
- idxmax(axis=0, skipna=True)
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- Returns
Indexes of maxima along the specified axis.
- Return type
- Raises
ValueError –
If the row/column is empty
See also
Series.idxmax
Return index of the maximum element.
Notes
See pandas API documentation for pandas.DataFrame.idxmax for more. This method is the DataFrame version of
ndarray.argmax
.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the maximum value in each column.
>>> df.idxmax() consumption Wheat Products co2_emissions Beef dtype: object
To return the index for the maximum value in each row, use
axis="columns"
.>>> df.idxmax(axis="columns") Pork co2_emissions Wheat Products consumption Beef co2_emissions dtype: object
- idxmin(axis=0, skipna=True)
Return index of first occurrence of minimum over requested axis.
NA/null values are excluded.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
- Returns
Indexes of minima along the specified axis.
- Return type
- Raises
ValueError –
If the row/column is empty
See also
Series.idxmin
Return index of the minimum element.
Notes
See pandas API documentation for pandas.DataFrame.idxmin for more. This method is the DataFrame version of
ndarray.argmin
.Examples
Consider a dataset containing food consumption in Argentina.
>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48], ... 'co2_emissions': [37.2, 19.66, 1712]}, ... index=['Pork', 'Wheat Products', 'Beef'])
>>> df consumption co2_emissions Pork 10.51 37.20 Wheat Products 103.11 19.66 Beef 55.48 1712.00
By default, it returns the index for the minimum value in each column.
>>> df.idxmin() consumption Pork co2_emissions Wheat Products dtype: object
To return the index for the minimum value in each row, use
axis="columns"
.>>> df.idxmin(axis="columns") Pork consumption Wheat Products co2_emissions Beef consumption dtype: object
- property iloc
Purely integer-location based indexing for selection by position.
.iloc[]
is primarily integer position based (from0
tolength-1
of the axis), but may also be used with a boolean array.Allowed inputs are:
An integer, e.g.
5
.A list or array of integers, e.g.
[4, 3, 0]
.A slice object with ints, e.g.
1:7
.A boolean array.
A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above). This is useful in method chains, when you don’t have a reference to the calling object, but would like to base your selection on some value.
.iloc
will raiseIndexError
if a requested indexer is out-of-bounds, except slice indexers which allow out-of-bounds indexing (this conforms with python/numpy slice semantics).See more at Selection by Position.
See also
DataFrame.iat
Fast integer location scalar accessor.
DataFrame.loc
Purely label-location based indexer for selection by label.
Series.iloc
Purely integer-location based indexing for selection by position.
Examples
>>> mydict = [{'a': 1, 'b': 2, 'c': 3, 'd': 4}, ... {'a': 100, 'b': 200, 'c': 300, 'd': 400}, ... {'a': 1000, 'b': 2000, 'c': 3000, 'd': 4000 }] >>> df = pd.DataFrame(mydict) >>> df a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000
Indexing just the rows
With a scalar integer.
>>> type(df.iloc[0]) <class 'pandas.core.series.Series'> >>> df.iloc[0] a 1 b 2 c 3 d 4 Name: 0, dtype: int64
With a list of integers.
>>> df.iloc[[0]] a b c d 0 1 2 3 4 >>> type(df.iloc[[0]]) <class 'pandas.core.frame.DataFrame'>
>>> df.iloc[[0, 1]] a b c d 0 1 2 3 4 1 100 200 300 400
With a slice object.
>>> df.iloc[:3] a b c d 0 1 2 3 4 1 100 200 300 400 2 1000 2000 3000 4000
With a boolean mask the same length as the index.
>>> df.iloc[[True, False, True]] a b c d 0 1 2 3 4 2 1000 2000 3000 4000
With a callable, useful in method chains. The x passed to the
lambda
is the DataFrame being sliced. This selects the rows whose index label even.>>> df.iloc[lambda x: x.index % 2 == 0] a b c d 0 1 2 3 4 2 1000 2000 3000 4000
Indexing both axes
You can mix the indexer types for the index and columns. Use
:
to select the entire axis.With scalar integers.
>>> df.iloc[0, 1] 2
With lists of integers.
>>> df.iloc[[0, 2], [1, 3]] b d 0 2 4 2 2000 4000
With slice objects.
>>> df.iloc[1:3, 0:3] a b c 1 100 200 300 2 1000 2000 3000
With a boolean array whose length matches the columns.
>>> df.iloc[:, [True, False, True, False]] a c 0 1 3 1 100 300 2 1000 3000
With a callable function that expects the Series or DataFrame.
>>> df.iloc[:, lambda df: [0, 2]] a c 0 1 3 1 100 300 2 1000 3000
Notes
See pandas API documentation for pandas.DataFrame.iloc for more.
- property index
Get the index for this DataFrame.
- Returns
The union of all indexes across the partitions.
- Return type
pandas.Index
- infer_objects()
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns unchanged. The inference rules are the same as during normal Series/DataFrame construction.
- Returns
converted
- Return type
same type as input object
See also
to_datetime
Convert argument to datetime.
to_timedelta
Convert argument to timedelta.
to_numeric
Convert argument to numeric type.
convert_dtypes
Convert argument to best possible dtype.
Examples
>>> df = pd.DataFrame({"A": ["a", 1, 2, 3]}) >>> df = df.iloc[1:] >>> df A 1 1 2 2 3 3
>>> df.dtypes A object dtype: object
>>> df.infer_objects().dtypes A int64 dtype: object
Notes
See pandas API documentation for pandas.DataFrame.infer_objects for more.
- isin(values)
Whether each element in the DataFrame is contained in values.
- Parameters
values (iterable, Series, DataFrame or dict) – The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.
- Returns
DataFrame of booleans showing whether each element in the DataFrame is contained in values.
- Return type
See also
DataFrame.eq
Equality test for DataFrame.
Series.isin
Equivalent method on Series.
Series.str.contains
Test if pattern or regex is contained within a string of a Series or Index.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]}, ... index=['falcon', 'dog']) >>> df num_legs num_wings falcon 2 2 dog 4 0
When
values
is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)>>> df.isin([0, 2]) num_legs num_wings falcon True True dog False True
When
values
is a dict, we can pass values to check for each column separately:>>> df.isin({'num_wings': [0, 3]}) num_legs num_wings falcon False False dog False True
When
values
is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in df2.>>> other = pd.DataFrame({'num_legs': [8, 2], 'num_wings': [0, 2]}, ... index=['spider', 'falcon']) >>> df.isin(other) num_legs num_wings falcon True True dog False False
Notes
See pandas API documentation for pandas.DataFrame.isin for more.
- isna()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).- Returns
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
- Return type
See also
DataFrame.isnull
Alias of isna.
DataFrame.notna
Boolean inverse of isna.
DataFrame.dropna
Omit axes labels with missing values.
isna
Top-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
Notes
See pandas API documentation for pandas.DataFrame.isna for more.
- isnull()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or
numpy.NaN
, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
).- Returns
Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.
- Return type
See also
DataFrame.isnull
Alias of isna.
DataFrame.notna
Boolean inverse of isna.
DataFrame.dropna
Omit axes labels with missing values.
isna
Top-level isna.
Examples
Show which entries in a DataFrame are NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.isna() age born name toy 0 False True False True 1 False False False False 2 True False False False
Show which entries in a Series are NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.isna() 0 False 1 False 2 True dtype: bool
Notes
See pandas API documentation for pandas.DataFrame.isna for more.
- kurt(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.kurt for more.
- kurtosis(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Return unbiased kurtosis over requested axis.
Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.
- Parameters
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.kurt for more.
- last(offset)
Select final periods of time series data based on a date offset.
For a DataFrame with a sorted DatetimeIndex, this function selects the last few rows based on a date offset.
- Parameters
offset (str, DateOffset, dateutil.relativedelta) – The offset length of the data that will be selected. For instance, ‘3D’ will display all the rows having their index within the last 3 days.
- Returns
A subset of the caller.
- Return type
- Raises
TypeError – If the index is not a
DatetimeIndex
See also
first
Select initial periods of time series based on a date offset.
at_time
Select values at a particular time of the day.
between_time
Select values between particular times of the day.
Examples
>>> i = pd.date_range('2018-04-09', periods=4, freq='2D') >>> ts = pd.DataFrame({'A': [1, 2, 3, 4]}, index=i) >>> ts A 2018-04-09 1 2018-04-11 2 2018-04-13 3 2018-04-15 4
Get the rows for the last 3 days:
>>> ts.last('3D') A 2018-04-13 3 2018-04-15 4
Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and therefore data for 2018-04-11 was not returned.
Notes
See pandas API documentation for pandas.DataFrame.last for more.
- last_valid_index()
Return index for last non-NA value or None, if no NA value is found.
- Returns
scalar
- Return type
type of index
Notes
See pandas API documentation for pandas.DataFrame.last_valid_index for more. If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
- le(other, axis='columns', level=None)
Get Less than or equal to of dataframe and other, element-wise (binary operator le).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
Result of the comparison.
- Return type
DataFrame of bool
See also
DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.
Notes
See pandas API documentation for pandas.DataFrame.le for more. Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- property loc
Access a group of rows and columns by label(s) or a boolean array.
.loc[]
is primarily label based, but may also be used with a boolean array.Allowed inputs are:
A single label, e.g.
5
or'a'
, (note that5
is interpreted as a label of the index, and never as an integer position along the index).A list or array of labels, e.g.
['a', 'b', 'c']
.A slice object with labels, e.g.
'a':'f'
.Warning
Note that contrary to usual python slices, both the start and the stop are included
A boolean array of the same length as the axis being sliced, e.g.
[True, False, True]
.An alignable boolean Series. The index of the key will be aligned before masking.
An alignable Index. The Index of the returned selection will be the input.
A
callable
function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
See more at Selection by Label.
- Raises
KeyError – If any items are not found.
IndexingError – If an indexed key is passed and its index is unalignable to the frame index.
See also
DataFrame.at
Access a single value for a row/column label pair.
DataFrame.iloc
Access group of rows and columns by integer position(s).
DataFrame.xs
Returns a cross-section (row(s) or column(s)) from the Series/DataFrame.
Series.loc
Access group of values using labels.
Examples
Getting values
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=['cobra', 'viper', 'sidewinder'], ... columns=['max_speed', 'shield']) >>> df max_speed shield cobra 1 2 viper 4 5 sidewinder 7 8
Single label. Note this returns the row as a Series.
>>> df.loc['viper'] max_speed 4 shield 5 Name: viper, dtype: int64
List of labels. Note using
[[]]
returns a DataFrame.>>> df.loc[['viper', 'sidewinder']] max_speed shield viper 4 5 sidewinder 7 8
Single label for row and column
>>> df.loc['cobra', 'shield'] 2
Slice with labels for row and single label for column. As mentioned above, note that both the start and stop of the slice are included.
>>> df.loc['cobra':'viper', 'max_speed'] cobra 1 viper 4 Name: max_speed, dtype: int64
Boolean list with the same length as the row axis
>>> df.loc[[False, False, True]] max_speed shield sidewinder 7 8
Alignable boolean Series:
>>> df.loc[pd.Series([False, True, False], ... index=['viper', 'sidewinder', 'cobra'])] max_speed shield sidewinder 7 8
Index (same behavior as
df.reindex
)>>> df.loc[pd.Index(["cobra", "viper"], name="foo")] max_speed shield foo cobra 1 2 viper 4 5
Conditional that returns a boolean Series
>>> df.loc[df['shield'] > 6] max_speed shield sidewinder 7 8
Conditional that returns a boolean Series with column labels specified
>>> df.loc[df['shield'] > 6, ['max_speed']] max_speed sidewinder 7
Callable that returns a boolean Series
>>> df.loc[lambda df: df['shield'] == 8] max_speed shield sidewinder 7 8
Setting values
Set value for all items matching the list of labels
>>> df.loc[['viper', 'sidewinder'], ['shield']] = 50 >>> df max_speed shield cobra 1 2 viper 4 50 sidewinder 7 50
Set value for an entire row
>>> df.loc['cobra'] = 10 >>> df max_speed shield cobra 10 10 viper 4 50 sidewinder 7 50
Set value for an entire column
>>> df.loc[:, 'max_speed'] = 30 >>> df max_speed shield cobra 30 10 viper 30 50 sidewinder 30 50
Set value for rows matching callable condition
>>> df.loc[df['shield'] > 35] = 0 >>> df max_speed shield cobra 30 10 viper 0 0 sidewinder 0 0
Getting values on a DataFrame with an index that has integer labels
Another example using integers for the index
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]], ... index=[7, 8, 9], columns=['max_speed', 'shield']) >>> df max_speed shield 7 1 2 8 4 5 9 7 8
Slice with integer labels for rows. As mentioned above, note that both the start and stop of the slice are included.
>>> df.loc[7:9] max_speed shield 7 1 2 8 4 5 9 7 8
Getting values with a MultiIndex
A number of examples using a DataFrame with a MultiIndex
>>> tuples = [ ... ('cobra', 'mark i'), ('cobra', 'mark ii'), ... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'), ... ('viper', 'mark ii'), ('viper', 'mark iii') ... ] >>> index = pd.MultiIndex.from_tuples(tuples) >>> values = [[12, 2], [0, 4], [10, 20], ... [1, 4], [7, 1], [16, 36]] >>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index) >>> df max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 mark iii 16 36
Single label. Note this returns a DataFrame with a single index.
>>> df.loc['cobra'] max_speed shield mark i 12 2 mark ii 0 4
Single index tuple. Note this returns a Series.
>>> df.loc[('cobra', 'mark ii')] max_speed 0 shield 4 Name: (cobra, mark ii), dtype: int64
Single label for row and column. Similar to passing in a tuple, this returns a Series.
>>> df.loc['cobra', 'mark i'] max_speed 12 shield 2 Name: (cobra, mark i), dtype: int64
Single tuple. Note using
[[]]
returns a DataFrame.>>> df.loc[[('cobra', 'mark ii')]] max_speed shield cobra mark ii 0 4
Single tuple for the index with a single label for the column
>>> df.loc[('cobra', 'mark i'), 'shield'] 2
Slice from index tuple to single label
>>> df.loc[('cobra', 'mark i'):'viper'] max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1 mark iii 16 36
Slice from index tuple to index tuple
>>> df.loc[('cobra', 'mark i'):('viper', 'mark ii')] max_speed shield cobra mark i 12 2 mark ii 0 4 sidewinder mark i 10 20 mark ii 1 4 viper mark ii 7 1
Notes
See pandas API documentation for pandas.DataFrame.loc for more.
- lt(other, axis='columns', level=None)
Get Less than of dataframe and other, element-wise (binary operator lt).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
Result of the comparison.
- Return type
DataFrame of bool
See also
DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.
Notes
See pandas API documentation for pandas.DataFrame.lt for more. Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- mad(axis=None, skipna=None, level=None)
Return the mean absolute deviation of the values over the requested axis.
- Parameters
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
skipna (bool, default None) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.mad for more.
- max(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Return the maximum of the values over the requested axis.
If you want the index of the maximum, use
idxmax
. This is the equivalent of thenumpy.ndarray
methodargmax
.- Parameters
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns
- Return type
See also
Series.sum
Return the sum.
Series.min
Return the minimum.
Series.max
Return the maximum.
Series.idxmin
Return the index of the minimum.
Series.idxmax
Return the index of the maximum.
DataFrame.sum
Return the sum over the requested axis.
DataFrame.min
Return the minimum over the requested axis.
DataFrame.max
Return the maximum over the requested axis.
DataFrame.idxmin
Return the index of the minimum over the requested axis.
DataFrame.idxmax
Return the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.max() 8
Notes
See pandas API documentation for pandas.DataFrame.max for more.
- mean(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Return the mean of the values over the requested axis.
- Parameters
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.mean for more.
- median(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Return the median of the values over the requested axis.
- Parameters
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.median for more.
- memory_usage(index=True, deep=False)
Return the memory usage of each column in bytes.
The memory usage can optionally include the contribution of the index and elements of object dtype.
This value is displayed in DataFrame.info by default. This can be suppressed by setting
pandas.options.display.memory_usage
to False.- Parameters
index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If
index=True
, the memory usage of the index is the first item in the output.deep (bool, default False) – If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.
- Returns
A Series whose index is the original column names and whose values is the memory usage of each column in bytes.
- Return type
See also
numpy.ndarray.nbytes
Total bytes consumed by the elements of an ndarray.
Series.memory_usage
Bytes consumed by a Series.
Categorical
Memory-efficient array for string values with many repeated values.
DataFrame.info
Concise summary of a DataFrame.
Examples
>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool'] >>> data = dict([(t, np.ones(shape=5000, dtype=int).astype(t)) ... for t in dtypes]) >>> df = pd.DataFrame(data) >>> df.head() int64 float64 complex128 object bool 0 1 1.0 1.0+0.0j 1 True 1 1 1.0 1.0+0.0j 1 True 2 1 1.0 1.0+0.0j 1 True 3 1 1.0 1.0+0.0j 1 True 4 1 1.0 1.0+0.0j 1 True
>>> df.memory_usage() Index 128 int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
>>> df.memory_usage(index=False) int64 40000 float64 40000 complex128 80000 object 40000 bool 5000 dtype: int64
The memory footprint of object dtype columns is ignored by default:
>>> df.memory_usage(deep=True) Index 128 int64 40000 float64 40000 complex128 80000 object 180000 bool 5000 dtype: int64
Use a Categorical for efficient storage of an object-dtype column with many repeated values.
>>> df['object'].astype('category').memory_usage(deep=True) 5244
Notes
See pandas API documentation for pandas.DataFrame.memory_usage for more.
- min(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Return the minimum of the values over the requested axis.
If you want the index of the minimum, use
idxmin
. This is the equivalent of thenumpy.ndarray
methodargmin
.- Parameters
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns
- Return type
See also
Series.sum
Return the sum.
Series.min
Return the minimum.
Series.max
Return the maximum.
Series.idxmin
Return the index of the minimum.
Series.idxmax
Return the index of the maximum.
DataFrame.sum
Return the sum over the requested axis.
DataFrame.min
Return the minimum over the requested axis.
DataFrame.max
Return the maximum over the requested axis.
DataFrame.idxmin
Return the index of the minimum over the requested axis.
DataFrame.idxmax
Return the index of the maximum over the requested axis.
Examples
>>> idx = pd.MultiIndex.from_arrays([ ... ['warm', 'warm', 'cold', 'cold'], ... ['dog', 'falcon', 'fish', 'spider']], ... names=['blooded', 'animal']) >>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx) >>> s blooded animal warm dog 4 falcon 2 cold fish 0 spider 8 Name: legs, dtype: int64
>>> s.min() 0
Notes
See pandas API documentation for pandas.DataFrame.min for more.
- mod(other, axis='columns', level=None, fill_value=None)
Get Modulo of dataframe and other, element-wise (binary operator mod).
Equivalent to
dataframe % other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.mod for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- mode(axis=0, numeric_only=False, dropna=True)
Get the mode(s) of each element along the selected axis.
The mode of a set of values is the value that appears most often. It can be multiple values.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) –
The axis to iterate over while searching for the mode:
0 or ‘index’ : get mode of each column
1 or ‘columns’ : get mode of each row.
numeric_only (bool, default False) – If True, only apply to numeric columns.
dropna (bool, default True) – Don’t consider counts of NaN/NaT.
- Returns
The modes of each column or row.
- Return type
See also
Series.mode
Return the highest frequency value in a Series.
Series.value_counts
Return the counts of values in a Series.
Examples
>>> df = pd.DataFrame([('bird', 2, 2), ... ('mammal', 4, np.nan), ... ('arthropod', 8, 0), ... ('bird', 2, np.nan)], ... index=('falcon', 'horse', 'spider', 'ostrich'), ... columns=('species', 'legs', 'wings')) >>> df species legs wings falcon bird 2 2.0 horse mammal 4 NaN spider arthropod 8 0.0 ostrich bird 2 NaN
By default, missing values are not considered, and the mode of wings are both 0 and 2. Because the resulting DataFrame has two rows, the second row of
species
andlegs
containsNaN
.>>> df.mode() species legs wings 0 bird 2.0 0.0 1 NaN NaN 2.0
Setting
dropna=False
NaN
values are considered and they can be the mode (like for wings).>>> df.mode(dropna=False) species legs wings 0 bird 2 NaN
Setting
numeric_only=True
, only the mode of numeric columns is computed, and columns of other types are ignored.>>> df.mode(numeric_only=True) legs wings 0 2.0 0.0 1 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
>>> df.mode(axis='columns', numeric_only=True) 0 1 falcon 2.0 NaN horse 4.0 NaN spider 0.0 8.0 ostrich 2.0 NaN
Notes
See pandas API documentation for pandas.DataFrame.mode for more.
- mul(other, axis='columns', level=None, fill_value=None)
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.mul for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- multiply(other, axis='columns', level=None, fill_value=None)
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.mul for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- ne(other, axis='columns', level=None)
Get Not equal to of dataframe and other, element-wise (binary operator ne).
Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.
Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
- Returns
Result of the comparison.
- Return type
DataFrame of bool
See also
DataFrame.eq
Compare DataFrames for equality elementwise.
DataFrame.ne
Compare DataFrames for inequality elementwise.
DataFrame.le
Compare DataFrames for less than inequality or equality elementwise.
DataFrame.lt
Compare DataFrames for strictly less than inequality elementwise.
DataFrame.ge
Compare DataFrames for greater than inequality or equality elementwise.
DataFrame.gt
Compare DataFrames for strictly greater than inequality elementwise.
Notes
See pandas API documentation for pandas.DataFrame.ne for more. Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df = pd.DataFrame({'cost': [250, 150, 100], ... 'revenue': [100, 250, 300]}, ... index=['A', 'B', 'C']) >>> df cost revenue A 250 100 B 150 250 C 100 300
Comparison with a scalar, using either the operator or method:
>>> df == 100 cost revenue A False True B False False C True False
>>> df.eq(100) cost revenue A False True B False False C True False
When other is a
Series
, the columns of a DataFrame are aligned with the index of other and broadcast:>>> df != pd.Series([100, 250], index=["cost", "revenue"]) cost revenue A True True B True False C False True
Use the method to control the broadcast axis:
>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index') cost revenue A True False B True True C True True D True True
When comparing to an arbitrary sequence, the number of columns must match the number elements in other:
>>> df == [250, 100] cost revenue A True True B False False C False False
Use the method to control the axis:
>>> df.eq([250, 250, 100], axis='index') cost revenue A True False B False True C True False
Compare to a DataFrame of different shape.
>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]}, ... index=['A', 'B', 'C', 'D']) >>> other revenue A 300 B 250 C 100 D 150
>>> df.gt(other) cost revenue A False False B False False C False True D False False
Compare to a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220], ... 'revenue': [100, 250, 300, 200, 175, 225]}, ... index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'], ... ['A', 'B', 'C', 'A', 'B', 'C']]) >>> df_multindex cost revenue Q1 A 250 100 B 150 250 C 100 300 Q2 A 150 200 B 300 175 C 220 225
>>> df.le(df_multindex, level=1) cost revenue Q1 A True True B True True C True True Q2 A False True B True False C True False
- notna()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.- Returns
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
- Return type
See also
DataFrame.notnull
Alias of notna.
DataFrame.isna
Boolean inverse of notna.
DataFrame.dropna
Omit axes labels with missing values.
notna
Top-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
Notes
See pandas API documentation for pandas.DataFrame.notna for more.
- notnull()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings
''
ornumpy.inf
are not considered NA values (unless you setpandas.options.mode.use_inf_as_na = True
). NA values, such as None ornumpy.NaN
, get mapped to False values.- Returns
Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.
- Return type
See also
DataFrame.notnull
Alias of notna.
DataFrame.isna
Boolean inverse of notna.
DataFrame.dropna
Omit axes labels with missing values.
notna
Top-level notna.
Examples
Show which entries in a DataFrame are not NA.
>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN], ... born=[pd.NaT, pd.Timestamp('1939-05-27'), ... pd.Timestamp('1940-04-25')], ... name=['Alfred', 'Batman', ''], ... toy=[None, 'Batmobile', 'Joker'])) >>> df age born name toy 0 5.0 NaT Alfred None 1 6.0 1939-05-27 Batman Batmobile 2 NaN 1940-04-25 Joker
>>> df.notna() age born name toy 0 True False True False 1 True True True True 2 False True True True
Show which entries in a Series are not NA.
>>> ser = pd.Series([5, 6, np.NaN]) >>> ser 0 5.0 1 6.0 2 NaN dtype: float64
>>> ser.notna() 0 True 1 True 2 False dtype: bool
Notes
See pandas API documentation for pandas.DataFrame.notna for more.
- nunique(axis=0, dropna=True)
Count number of distinct elements in specified axis.
Return Series with number of distinct elements. Can ignore NaN values.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
dropna (bool, default True) – Don’t include NaN in the counts.
- Returns
- Return type
See also
Series.nunique
Method nunique for Series.
DataFrame.count
Count non-NA cells for each column or row.
Examples
>>> df = pd.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]}) >>> df.nunique() A 3 B 2 dtype: int64
>>> df.nunique(axis=1) 0 1 1 2 2 2 dtype: int64
Notes
See pandas API documentation for pandas.DataFrame.nunique for more.
- pad(axis=None, inplace=False, limit=None, downcast=None)
Synonym for
DataFrame.fillna()
withmethod='ffill'
.- Returns
Object with missing values filled or None if
inplace=True
.- Return type
Series/DataFrame or None
Notes
See pandas API documentation for pandas.DataFrame.pad for more.
- pct_change(periods=1, fill_method='pad', limit=None, freq=None, **kwargs)
Percentage change between the current and a prior element.
Computes the percentage change from the immediately previous row by default. This is useful in comparing the percentage of change in a time series of elements.
- Parameters
periods (int, default 1) – Periods to shift for forming percent change.
fill_method (str, default 'pad') – How to handle NAs before computing percent changes.
limit (int, default None) – The number of consecutive NAs to fill before stopping.
freq (DateOffset, timedelta, or str, optional) – Increment to use from time series API (e.g. ‘M’ or BDay()).
**kwargs – Additional keyword arguments are passed into DataFrame.shift or Series.shift.
- Returns
chg – The same type as the calling object.
- Return type
See also
Series.diff
Compute the difference of two elements in a Series.
DataFrame.diff
Compute the difference of two elements in a DataFrame.
Series.shift
Shift the index by some number of periods.
DataFrame.shift
Shift the index by some number of periods.
Examples
Series
>>> s = pd.Series([90, 91, 85]) >>> s 0 90 1 91 2 85 dtype: int64
>>> s.pct_change() 0 NaN 1 0.011111 2 -0.065934 dtype: float64
>>> s.pct_change(periods=2) 0 NaN 1 NaN 2 -0.055556 dtype: float64
See the percentage change in a Series where filling NAs with last valid observation forward to next valid.
>>> s = pd.Series([90, 91, None, 85]) >>> s 0 90.0 1 91.0 2 NaN 3 85.0 dtype: float64
>>> s.pct_change(fill_method='ffill') 0 NaN 1 0.011111 2 0.000000 3 -0.065934 dtype: float64
DataFrame
Percentage change in French franc, Deutsche Mark, and Italian lira from 1980-01-01 to 1980-03-01.
>>> df = pd.DataFrame({ ... 'FR': [4.0405, 4.0963, 4.3149], ... 'GR': [1.7246, 1.7482, 1.8519], ... 'IT': [804.74, 810.01, 860.13]}, ... index=['1980-01-01', '1980-02-01', '1980-03-01']) >>> df FR GR IT 1980-01-01 4.0405 1.7246 804.74 1980-02-01 4.0963 1.7482 810.01 1980-03-01 4.3149 1.8519 860.13
>>> df.pct_change() FR GR IT 1980-01-01 NaN NaN NaN 1980-02-01 0.013810 0.013684 0.006549 1980-03-01 0.053365 0.059318 0.061876
Percentage of change in GOOG and APPL stock volume. Shows computing the percentage change between columns.
>>> df = pd.DataFrame({ ... '2016': [1769950, 30586265], ... '2015': [1500923, 40912316], ... '2014': [1371819, 41403351]}, ... index=['GOOG', 'APPL']) >>> df 2016 2015 2014 GOOG 1769950 1500923 1371819 APPL 30586265 40912316 41403351
>>> df.pct_change(axis='columns', periods=-1) 2016 2015 2014 GOOG 0.179241 0.094112 NaN APPL -0.252395 -0.011860 NaN
Notes
See pandas API documentation for pandas.DataFrame.pct_change for more.
- pipe(func, *args, **kwargs)
Apply func(self, *args, **kwargs).
- Parameters
func (function) – Function to apply to the Series/DataFrame.
args
, andkwargs
are passed intofunc
. Alternatively a(callable, data_keyword)
tuple wheredata_keyword
is a string indicating the keyword ofcallable
that expects the Series/DataFrame.args (iterable, optional) – Positional arguments passed into
func
.kwargs (mapping, optional) – A dictionary of keyword arguments passed into
func
.
- Returns
object
- Return type
the return type of
func
.
See also
DataFrame.apply
Apply a function along input axis of DataFrame.
DataFrame.applymap
Apply a function elementwise on a whole DataFrame.
Series.map
Apply a mapping correspondence on a
Series
.
Notes
See pandas API documentation for pandas.DataFrame.pipe for more. Use
.pipe
when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead of writing>>> func(g(h(df), arg1=a), arg2=b, arg3=c)
You can write
>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe(func, arg2=b, arg3=c) ... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which keyword expects the data. For example, suppose
f
takes its data asarg2
:>>> (df.pipe(h) ... .pipe(g, arg1=a) ... .pipe((func, 'arg2'), arg1=a, arg3=c) ... )
- pop(item)
Return item and drop from frame. Raise KeyError if not found.
- Parameters
item (label) – Label of column to be popped.
- Returns
- Return type
Examples
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), ... ('parrot', 'bird', 24.0), ... ('lion', 'mammal', 80.5), ... ('monkey', 'mammal', np.nan)], ... columns=('name', 'class', 'max_speed')) >>> df name class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
>>> df.pop('class') 0 bird 1 bird 2 mammal 3 mammal Name: class, dtype: object
>>> df name max_speed 0 falcon 389.0 1 parrot 24.0 2 lion 80.5 3 monkey NaN
Notes
See pandas API documentation for pandas.DataFrame.pop for more.
- pow(other, axis='columns', level=None, fill_value=None)
Get Exponential power of dataframe and other, element-wise (binary operator pow).
Equivalent to
dataframe ** other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.pow for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- quantile(q=0.5, axis=0, numeric_only=True, interpolation='linear')
Return values at the given quantile over requested axis.
- Parameters
q (float or array-like, default 0.5 (50% quantile)) – Value between 0 <= q <= 1, the quantile(s) to compute.
axis ({0, 1, 'index', 'columns'}, default 0) – Equals 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
numeric_only (bool, default True) – If False, the quantile of datetime and timedelta data will be computed as well.
interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –
This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:
linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.
lower: i.
higher: j.
nearest: i or j whichever is nearest.
midpoint: (i + j) / 2.
- Returns
- If
q
is an array, a DataFrame will be returned where the index is
q
, the columns are the columns of self, and the values are the quantiles.- If
q
is a float, a Series will be returned where the index is the columns of self and the values are the quantiles.
- If
- Return type
See also
core.window.Rolling.quantile
Rolling quantile.
numpy.percentile
Numpy function to compute the percentile.
Examples
>>> df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]), ... columns=['a', 'b']) >>> df.quantile(.1) a 1.3 b 3.7 Name: 0.1, dtype: float64 >>> df.quantile([.1, .5]) a b 0.1 1.3 3.7 0.5 2.5 55.0
Specifying numeric_only=False will also compute the quantile of datetime and timedelta data.
>>> df = pd.DataFrame({'A': [1, 2], ... 'B': [pd.Timestamp('2010'), ... pd.Timestamp('2011')], ... 'C': [pd.Timedelta('1 days'), ... pd.Timedelta('2 days')]}) >>> df.quantile(0.5, numeric_only=False) A 1.5 B 2010-07-02 12:00:00 C 1 days 12:00:00 Name: 0.5, dtype: object
Notes
See pandas API documentation for pandas.DataFrame.quantile for more.
- radd(other, axis='columns', level=None, fill_value=None)
Get Addition of dataframe and other, element-wise (binary operator add).
Equivalent to
dataframe + other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.add for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rank(axis=0, method='average', numeric_only=None, na_option='keep', ascending=True, pct=False)
Compute numerical data ranks (1 through n) along axis.
By default, equal values are assigned a rank that is the average of the ranks of those values.
- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – Index to direct ranking.
method ({'average', 'min', 'max', 'first', 'dense'}, default 'average') –
How to rank the group of records that have the same value (i.e. ties):
average: average rank of the group
min: lowest rank in the group
max: highest rank in the group
first: ranks assigned in order they appear in the array
dense: like ‘min’, but rank always increases by 1 between groups.
numeric_only (bool, optional) – For DataFrame objects, rank only numeric columns if set to True.
na_option ({'keep', 'top', 'bottom'}, default 'keep') –
How to rank NaN values:
keep: assign NaN rank to NaN values
top: assign lowest rank to NaN values
bottom: assign highest rank to NaN values
ascending (bool, default True) – Whether or not the elements should be ranked in ascending order.
pct (bool, default False) – Whether or not to display the returned rankings in percentile form.
- Returns
Return a Series or DataFrame with data ranks as values.
- Return type
same type as caller
See also
core.groupby.GroupBy.rank
Rank of values within each group.
Examples
>>> df = pd.DataFrame(data={'Animal': ['cat', 'penguin', 'dog', ... 'spider', 'snake'], ... 'Number_legs': [4, 2, 4, 8, np.nan]}) >>> df Animal Number_legs 0 cat 4.0 1 penguin 2.0 2 dog 4.0 3 spider 8.0 4 snake NaN
The following example shows how the method behaves with the above parameters:
default_rank: this is the default behaviour obtained without using any parameter.
max_rank: setting
method = 'max'
the records that have the same values are ranked using the highest rank (e.g.: since ‘cat’ and ‘dog’ are both in the 2nd and 3rd position, rank 3 is assigned.)NA_bottom: choosing
na_option = 'bottom'
, if there are records with NaN values they are placed at the bottom of the ranking.pct_rank: when setting
pct = True
, the ranking is expressed as percentile rank.
>>> df['default_rank'] = df['Number_legs'].rank() >>> df['max_rank'] = df['Number_legs'].rank(method='max') >>> df['NA_bottom'] = df['Number_legs'].rank(na_option='bottom') >>> df['pct_rank'] = df['Number_legs'].rank(pct=True) >>> df Animal Number_legs default_rank max_rank NA_bottom pct_rank 0 cat 4.0 2.5 3.0 2.5 0.625 1 penguin 2.0 1.0 1.0 1.0 0.250 2 dog 4.0 2.5 3.0 2.5 0.625 3 spider 8.0 4.0 4.0 4.0 1.000 4 snake NaN NaN NaN 5.0 NaN
Notes
See pandas API documentation for pandas.DataFrame.rank for more.
- rdiv(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.rtruediv for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- reindex(index=None, columns=None, copy=True, **kwargs)
Conform Series/DataFrame to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and
copy=False
.- Parameters
axes (keywords for) – New labels / index to conform to, should be specified using keywords. Preferably an Index object to avoid duplicating data.
method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: Propagate last valid observation forward to next valid.
backfill / bfill: Use next valid observation to fill gap.
nearest: Use nearest valid observations to fill gap.
copy (bool, default True) – Return a new object, even if the passed indexes are the same.
level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.
limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.
tolerance (optional) –
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation
abs(index[indexer] - target) <= tolerance
.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
- Returns
- Return type
Series/DataFrame with changed index.
See also
DataFrame.set_index
Set row labels.
DataFrame.reset_index
Remove row labels or move them to new columns.
DataFrame.reindex_like
Change to same indices as other DataFrame.
Examples
DataFrame.reindex
supports two calling conventions(index=index_labels, columns=column_labels, ...)
(labels, axis={'index', 'columns'}, ...)
We highly recommend using keyword arguments to clarify your intent.
Create a dataframe with some fictional data.
>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror'] >>> df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301], ... 'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]}, ... index=index) >>> df http_status response_time Firefox 200 0.04 Chrome 200 0.02 Safari 404 0.07 IE10 404 0.08 Konqueror 301 1.00
Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned
NaN
.>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10', ... 'Chrome'] >>> df.reindex(new_index) http_status response_time Safari 404.0 0.07 Iceweasel NaN NaN Comodo Dragon NaN NaN IE10 404.0 0.08 Chrome 200.0 0.02
We can fill in the missing values by passing a value to the keyword
fill_value
. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keywordmethod
to fill theNaN
values.>>> df.reindex(new_index, fill_value=0) http_status response_time Safari 404 0.07 Iceweasel 0 0.00 Comodo Dragon 0 0.00 IE10 404 0.08 Chrome 200 0.02
>>> df.reindex(new_index, fill_value='missing') http_status response_time Safari 404 0.07 Iceweasel missing missing Comodo Dragon missing missing IE10 404 0.08 Chrome 200 0.02
We can also reindex the columns.
>>> df.reindex(columns=['http_status', 'user_agent']) http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
Or we can use “axis-style” keyword arguments
>>> df.reindex(['http_status', 'user_agent'], axis="columns") http_status user_agent Firefox 200 NaN Chrome 200 NaN Safari 404 NaN IE10 404 NaN Konqueror 301 NaN
To further illustrate the filling functionality in
reindex
, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D') >>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]}, ... index=date_index) >>> df2 prices 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0
Suppose we decide to expand the dataframe to cover a wider date range.
>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D') >>> df2.reindex(date_index2) prices 2009-12-29 NaN 2009-12-30 NaN 2009-12-31 NaN 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with
NaN
. If desired, we can fill in the missing values using one of several options.For example, to back-propagate the last valid value to fill the
NaN
values, passbfill
as an argument to themethod
keyword.>>> df2.reindex(date_index2, method='bfill') prices 2009-12-29 100.0 2009-12-30 100.0 2009-12-31 100.0 2010-01-01 100.0 2010-01-02 101.0 2010-01-03 NaN 2010-01-04 100.0 2010-01-05 89.0 2010-01-06 88.0 2010-01-07 NaN
Please note that the
NaN
value present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in theNaN
values present in the original dataframe, use thefillna()
method.See the user guide for more.
Notes
See pandas API documentation for pandas.DataFrame.reindex for more.
- reindex_like(other, method=None, copy=True, limit=None, tolerance=None)
Return an object with matching indices as other object.
Conform the object to the same index on all axes. Optional filling logic, placing NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.
- Parameters
other (Object of the same data type) – Its row and column indices are used to define the new indices of this object.
method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –
Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.
None (default): don’t fill gaps
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use next valid observation to fill gap
nearest: use nearest valid observations to fill gap.
copy (bool, default True) – Return a new object, even if the passed indexes are the same.
limit (int, default None) – Maximum number of consecutive labels to fill for inexact matches.
tolerance (optional) –
Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations must satisfy the equation
abs(index[indexer] - target) <= tolerance
.Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.
- Returns
Same type as caller, but with changed indices on each axis.
- Return type
See also
DataFrame.set_index
Set row labels.
DataFrame.reset_index
Remove row labels or move them to new columns.
DataFrame.reindex
Change to new indices or expand indices.
Notes
See pandas API documentation for pandas.DataFrame.reindex_like for more. Same as calling
.reindex(index=other.index, columns=other.columns,...)
.Examples
>>> df1 = pd.DataFrame([[24.3, 75.7, 'high'], ... [31, 87.8, 'high'], ... [22, 71.6, 'medium'], ... [35, 95, 'medium']], ... columns=['temp_celsius', 'temp_fahrenheit', ... 'windspeed'], ... index=pd.date_range(start='2014-02-12', ... end='2014-02-15', freq='D'))
>>> df1 temp_celsius temp_fahrenheit windspeed 2014-02-12 24.3 75.7 high 2014-02-13 31.0 87.8 high 2014-02-14 22.0 71.6 medium 2014-02-15 35.0 95.0 medium
>>> df2 = pd.DataFrame([[28, 'low'], ... [30, 'low'], ... [35.1, 'medium']], ... columns=['temp_celsius', 'windspeed'], ... index=pd.DatetimeIndex(['2014-02-12', '2014-02-13', ... '2014-02-15']))
>>> df2 temp_celsius windspeed 2014-02-12 28.0 low 2014-02-13 30.0 low 2014-02-15 35.1 medium
>>> df2.reindex_like(df1) temp_celsius temp_fahrenheit windspeed 2014-02-12 28.0 NaN low 2014-02-13 30.0 NaN low 2014-02-14 NaN NaN NaN 2014-02-15 35.1 NaN medium
- rename_axis(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False)
Set the name of the axis for the index or columns.
- Parameters
mapper (scalar, list-like, optional) – Value to set the axis name attribute.
index (scalar, list-like, dict-like or function, optional) –
A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the
columns
parameter is not allowed if the object is a Series. This parameter only apply for DataFrame type objects.Use either
mapper
andaxis
to specify the axis to target withmapper
, orindex
and/orcolumns
.columns (scalar, list-like, dict-like or function, optional) –
A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the
columns
parameter is not allowed if the object is a Series. This parameter only apply for DataFrame type objects.Use either
mapper
andaxis
to specify the axis to target withmapper
, orindex
and/orcolumns
.axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename.
copy (bool, default True) – Also copy underlying data.
inplace (bool, default False) – Modifies the object directly, instead of creating a new Series or DataFrame.
- Returns
The same type as the caller or None if
inplace=True
.- Return type
See also
Series.rename
Alter Series index labels or name.
DataFrame.rename
Alter DataFrame index labels or name.
Index.rename
Set new names on index.
Notes
See pandas API documentation for pandas.DataFrame.rename_axis for more.
DataFrame.rename_axis
supports two calling conventions(index=index_mapper, columns=columns_mapper, ...)
(mapper, axis={'index', 'columns'}, ...)
The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter
copy
is ignored.The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.
We highly recommend using keyword arguments to clarify your intent.
Examples
Series
>>> s = pd.Series(["dog", "cat", "monkey"]) >>> s 0 dog 1 cat 2 monkey dtype: object >>> s.rename_axis("animal") animal 0 dog 1 cat 2 monkey dtype: object
DataFrame
>>> df = pd.DataFrame({"num_legs": [4, 4, 2], ... "num_arms": [0, 0, 2]}, ... ["dog", "cat", "monkey"]) >>> df num_legs num_arms dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("animal") >>> df num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2 >>> df = df.rename_axis("limbs", axis="columns") >>> df limbs num_legs num_arms animal dog 4 0 cat 4 0 monkey 2 2
MultiIndex
>>> df.index = pd.MultiIndex.from_product([['mammal'], ... ['dog', 'cat', 'monkey']], ... names=['type', 'name']) >>> df limbs num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2
>>> df.rename_axis(index={'type': 'class'}) limbs num_legs num_arms class name mammal dog 4 0 cat 4 0 monkey 2 2
>>> df.rename_axis(columns=str.upper) LIMBS num_legs num_arms type name mammal dog 4 0 cat 4 0 monkey 2 2
- reorder_levels(order, axis=0)
Rearrange index levels using input order. May not drop or duplicate levels.
- Parameters
order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).
axis ({0 or 'index', 1 or 'columns'}, default 0) – Where to reorder levels.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.reorder_levels for more.
- resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, loffset=None, base: Optional[int] = None, on=None, level=None, origin: Union[str, Timestamp, datetime.datetime, numpy.datetime64, int, numpy.int64, float] = 'start_day', offset: Optional[Union[Timedelta, datetime.timedelta, numpy.timedelta64, int, numpy.int64, float, str]] = None)
Resample time-series data.
Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the
on
/level
keyword parameter.- Parameters
rule (DateOffset, Timedelta or str) – The offset string or object representing target conversion.
axis ({0 or 'index', 1 or 'columns'}, default 0) – Which axis to use for up- or down-sampling. For Series this will default to 0, i.e. along the rows. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.
closed ({'right', 'left'}, default None) – Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
label ({'right', 'left'}, default None) – Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
convention ({'start', 'end', 's', 'e'}, default 'start') – For PeriodIndex only, controls whether to use the start or end of rule.
kind ({'timestamp', 'period'}, optional, default None) – Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.
loffset (timedelta, default None) –
Adjust the resampled time labels.
Deprecated since version 1.1.0: You should add the loffset to the df.index after the resample. See below.
base (int, default 0) –
For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for ‘5min’ frequency, base could range from 0 through 4. Defaults to 0.
Deprecated since version 1.1.0: The new arguments that you should use are ‘offset’ or ‘origin’.
on (str, optional) – For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.
level (str or int, optional) – For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.
origin ({'epoch', 'start', 'start_day', 'end', 'end_day'}, Timestamp) –
or str, default ‘start_day’ The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If a timestamp is not used, these values are also supported:
’epoch’: origin is 1970-01-01
’start’: origin is the first value of the timeseries
’start_day’: origin is the first day at midnight of the timeseries
New in version 1.1.0.
’end’: origin is the last value of the timeseries
’end_day’: origin is the ceiling midnight of the last day
New in version 1.3.0.
offset (Timedelta or str, default is None) –
An offset timedelta added to the origin.
New in version 1.1.0.
- Returns
Resampler
object.- Return type
pandas.core.Resampler
See also
Series.resample
Resample a Series.
DataFrame.resample
Resample a DataFrame.
groupby
Group DataFrame by mapping, function, label, or list of labels.
asfreq
Reindex a DataFrame with the given frequency without grouping.
Notes
See pandas API documentation for pandas.DataFrame.resample for more. See the user guide for more.
To learn more about the offset strings, please see this link.
Examples
Start by creating a series with 9 one minute timestamps.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T') >>> series = pd.Series(range(9), index=index) >>> series 2000-01-01 00:00:00 0 2000-01-01 00:01:00 1 2000-01-01 00:02:00 2 2000-01-01 00:03:00 3 2000-01-01 00:04:00 4 2000-01-01 00:05:00 5 2000-01-01 00:06:00 6 2000-01-01 00:07:00 7 2000-01-01 00:08:00 8 Freq: T, dtype: int64
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum() 2000-01-01 00:00:00 3 2000-01-01 00:03:00 12 2000-01-01 00:06:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket
2000-01-01 00:03:00
contains the value 3, but the summed value in the resampled bucket with the label2000-01-01 00:03:00
does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.>>> series.resample('3T', label='right').sum() 2000-01-01 00:03:00 3 2000-01-01 00:06:00 12 2000-01-01 00:09:00 21 Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
>>> series.resample('3T', label='right', closed='right').sum() 2000-01-01 00:00:00 0 2000-01-01 00:03:00 6 2000-01-01 00:06:00 15 2000-01-01 00:09:00 15 Freq: 3T, dtype: int64
Upsample the series into 30 second bins.
>>> series.resample('30S').asfreq()[0:5] # Select first 5 rows 2000-01-01 00:00:00 0.0 2000-01-01 00:00:30 NaN 2000-01-01 00:01:00 1.0 2000-01-01 00:01:30 NaN 2000-01-01 00:02:00 2.0 Freq: 30S, dtype: float64
Upsample the series into 30 second bins and fill the
NaN
values using thepad
method.>>> series.resample('30S').pad()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 0 2000-01-01 00:01:00 1 2000-01-01 00:01:30 1 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the
NaN
values using thebfill
method.>>> series.resample('30S').bfill()[0:5] 2000-01-01 00:00:00 0 2000-01-01 00:00:30 1 2000-01-01 00:01:00 1 2000-01-01 00:01:30 2 2000-01-01 00:02:00 2 Freq: 30S, dtype: int64
Pass a custom function via
apply
>>> def custom_resampler(arraylike): ... return np.sum(arraylike) + 5 ... >>> series.resample('3T').apply(custom_resampler) 2000-01-01 00:00:00 8 2000-01-01 00:03:00 17 2000-01-01 00:06:00 26 Freq: 3T, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.
Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.
>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01', ... freq='A', ... periods=2)) >>> s 2012 1 2013 2 Freq: A-DEC, dtype: int64 >>> s.resample('Q', convention='start').asfreq() 2012Q1 1.0 2012Q2 NaN 2012Q3 NaN 2012Q4 NaN 2013Q1 2.0 2013Q2 NaN 2013Q3 NaN 2013Q4 NaN Freq: Q-DEC, dtype: float64
Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.
>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01', ... freq='Q', ... periods=4)) >>> q 2018Q1 1 2018Q2 2 2018Q3 3 2018Q4 4 Freq: Q-DEC, dtype: int64 >>> q.resample('M', convention='end').asfreq() 2018-03 1.0 2018-04 NaN 2018-05 NaN 2018-06 2.0 2018-07 NaN 2018-08 NaN 2018-09 3.0 2018-10 NaN 2018-11 NaN 2018-12 4.0 Freq: M, dtype: float64
For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.
>>> d = {'price': [10, 11, 9, 13, 14, 18, 17, 19], ... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]} >>> df = pd.DataFrame(d) >>> df['week_starting'] = pd.date_range('01/01/2018', ... periods=8, ... freq='W') >>> df price volume week_starting 0 10 50 2018-01-07 1 11 60 2018-01-14 2 9 40 2018-01-21 3 13 100 2018-01-28 4 14 50 2018-02-04 5 18 100 2018-02-11 6 17 40 2018-02-18 7 19 50 2018-02-25 >>> df.resample('M', on='week_starting').mean() price volume week_starting 2018-01-31 10.75 62.5 2018-02-28 17.00 60.0
For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.
>>> days = pd.date_range('1/1/2000', periods=4, freq='D') >>> d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19], ... 'volume': [50, 60, 40, 100, 50, 100, 40, 50]} >>> df2 = pd.DataFrame( ... d2, ... index=pd.MultiIndex.from_product( ... [days, ['morning', 'afternoon']] ... ) ... ) >>> df2 price volume 2000-01-01 morning 10 50 afternoon 11 60 2000-01-02 morning 9 40 afternoon 13 100 2000-01-03 morning 14 50 afternoon 18 100 2000-01-04 morning 17 40 afternoon 19 50 >>> df2.resample('D', level=0).sum() price volume 2000-01-01 21 110 2000-01-02 22 140 2000-01-03 32 150 2000-01-04 36 90
If you want to adjust the start of the bins based on a fixed timestamp:
>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00' >>> rng = pd.date_range(start, end, freq='7min') >>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng) >>> ts 2000-10-01 23:30:00 0 2000-10-01 23:37:00 3 2000-10-01 23:44:00 6 2000-10-01 23:51:00 9 2000-10-01 23:58:00 12 2000-10-02 00:05:00 15 2000-10-02 00:12:00 18 2000-10-02 00:19:00 21 2000-10-02 00:26:00 24 Freq: 7T, dtype: int64
>>> ts.resample('17min').sum() 2000-10-01 23:14:00 0 2000-10-01 23:31:00 9 2000-10-01 23:48:00 21 2000-10-02 00:05:00 54 2000-10-02 00:22:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='epoch').sum() 2000-10-01 23:18:00 0 2000-10-01 23:35:00 18 2000-10-01 23:52:00 27 2000-10-02 00:09:00 39 2000-10-02 00:26:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='2000-01-01').sum() 2000-10-01 23:24:00 3 2000-10-01 23:41:00 15 2000-10-01 23:58:00 45 2000-10-02 00:15:00 45 Freq: 17T, dtype: int64
If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:
>>> ts.resample('17min', origin='start').sum() 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
>>> ts.resample('17min', offset='23h30min').sum() 2000-10-01 23:30:00 9 2000-10-01 23:47:00 21 2000-10-02 00:04:00 54 2000-10-02 00:21:00 24 Freq: 17T, dtype: int64
If you want to take the largest Timestamp as the end of the bins:
>>> ts.resample('17min', origin='end').sum() 2000-10-01 23:35:00 0 2000-10-01 23:52:00 18 2000-10-02 00:09:00 27 2000-10-02 00:26:00 63 Freq: 17T, dtype: int64
In contrast with the start_day, you can use end_day to take the ceiling midnight of the largest Timestamp as the end of the bins and drop the bins not containing data:
>>> ts.resample('17min', origin='end_day').sum() 2000-10-01 23:38:00 3 2000-10-01 23:55:00 15 2000-10-02 00:12:00 45 2000-10-02 00:29:00 45 Freq: 17T, dtype: int64
To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent to have base=2:
>>> ts.resample('17min', offset='2min').sum() 2000-10-01 23:16:00 0 2000-10-01 23:33:00 9 2000-10-01 23:50:00 36 2000-10-02 00:07:00 39 2000-10-02 00:24:00 24 Freq: 17T, dtype: int64
To replace the use of the deprecated loffset argument:
>>> from pandas.tseries.frequencies import to_offset >>> loffset = '19min' >>> ts_out = ts.resample('17min').sum() >>> ts_out.index = ts_out.index + to_offset(loffset) >>> ts_out 2000-10-01 23:33:00 0 2000-10-01 23:50:00 9 2000-10-02 00:07:00 21 2000-10-02 00:24:00 54 2000-10-02 00:41:00 24 Freq: 17T, dtype: int64
- reset_index(level=None, drop=False, inplace=False, col_level=0, col_fill='')
Reset the index, or a level of it.
Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.
- Parameters
level (int, str, tuple, or list, default None) – Only remove the given levels from the index. Removes all levels by default.
drop (bool, default False) – Do not try to insert index into dataframe columns. This resets the index to the default integer index.
inplace (bool, default False) – Modify the DataFrame in place (do not create a new object).
col_level (int or str, default 0) – If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.
col_fill (object, default '') – If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.
- Returns
DataFrame with the new index or None if
inplace=True
.- Return type
DataFrame or None
See also
DataFrame.set_index
Opposite of reset_index.
DataFrame.reindex
Change to new indices or expand indices.
DataFrame.reindex_like
Change to same indices as other DataFrame.
Examples
>>> df = pd.DataFrame([('bird', 389.0), ... ('bird', 24.0), ... ('mammal', 80.5), ... ('mammal', np.nan)], ... index=['falcon', 'parrot', 'lion', 'monkey'], ... columns=('class', 'max_speed')) >>> df class max_speed falcon bird 389.0 parrot bird 24.0 lion mammal 80.5 monkey mammal NaN
When we reset the index, the old index is added as a column, and a new sequential index is used:
>>> df.reset_index() index class max_speed 0 falcon bird 389.0 1 parrot bird 24.0 2 lion mammal 80.5 3 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as a column:
>>> df.reset_index(drop=True) class max_speed 0 bird 389.0 1 bird 24.0 2 mammal 80.5 3 mammal NaN
You can also use reset_index with MultiIndex.
>>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'), ... ('bird', 'parrot'), ... ('mammal', 'lion'), ... ('mammal', 'monkey')], ... names=['class', 'name']) >>> columns = pd.MultiIndex.from_tuples([('speed', 'max'), ... ('species', 'type')]) >>> df = pd.DataFrame([(389.0, 'fly'), ... ( 24.0, 'fly'), ... ( 80.5, 'run'), ... (np.nan, 'jump')], ... index=index, ... columns=columns) >>> df speed species max type class name bird falcon 389.0 fly parrot 24.0 fly mammal lion 80.5 run monkey NaN jump
If the index has multiple levels, we can reset a subset of them:
>>> df.reset_index(level='class') class speed species max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top level. We can place it in another level:
>>> df.reset_index(level='class', col_level=1) speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
When the index is inserted under another level, we can specify under which one with the parameter col_fill:
>>> df.reset_index(level='class', col_level=1, col_fill='species') species speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
If we specify a nonexistent level for col_fill, it is created:
>>> df.reset_index(level='class', col_level=1, col_fill='genus') genus speed species class max type name falcon bird 389.0 fly parrot bird 24.0 fly lion mammal 80.5 run monkey mammal NaN jump
Notes
See pandas API documentation for pandas.DataFrame.reset_index for more.
- rfloordiv(other, axis='columns', level=None, fill_value=None)
Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).
Equivalent to
other // dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.rfloordiv for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rmod(other, axis='columns', level=None, fill_value=None)
Get Modulo of dataframe and other, element-wise (binary operator rmod).
Equivalent to
other % dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.rmod for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rmul(other, axis='columns', level=None, fill_value=None)
Get Multiplication of dataframe and other, element-wise (binary operator mul).
Equivalent to
dataframe * other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.mul for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rolling(window, min_periods=None, center=False, win_type=None, on=None, axis=0, closed=None, method='single')
Provide rolling window calculations.
- Parameters
window (int, offset, or BaseIndexer subclass) –
Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.
If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes.
If a BaseIndexer subclass is passed, calculates the window boundaries based on the defined
get_window_bounds
method. Additional rolling keyword arguments, namely min_periods, center, and closed will be passed to get_window_bounds.min_periods (int, default None) – Minimum number of observations in window required to have a value (otherwise result is NA). For a window that is specified by an offset, min_periods will default to 1. Otherwise, min_periods will default to the size of the window.
center (bool, default False) – Set the labels at the center of the window.
win_type (str, default None) – Provide a window type. If
None
, all points are evenly weighted. See the notes below for further information.on (str, optional) – For a DataFrame, a datetime-like column or Index level on which to calculate the rolling window, rather than the DataFrame’s index. Provided integer column is ignored and excluded from result since an integer index is not used to calculate the rolling window.
axis (int or str, default 0) –
closed (str, default None) –
Make the interval closed on the ‘right’, ‘left’, ‘both’ or ‘neither’ endpoints. Defaults to ‘right’.
Changed in version 1.2.0: The closed parameter with fixed windows is now supported.
method (str {'single', 'table'}, default 'single') –
Execute the rolling operation per single column or row (
'single'
) or over the entire object ('table'
).This argument is only implemented when specifying
engine='numba'
in the method call.New in version 1.3.0.
- Returns
- Return type
a Window or Rolling sub-classed for the particular operation
See also
expanding
Provides expanding transformations.
ewm
Provides exponential weighted functions.
Notes
See pandas API documentation for pandas.DataFrame.rolling for more. By default, the result is set to the right edge of the window. This can be changed to the center of the window by setting
center=True
.To learn more about the offsets & frequency strings, please see this link.
If
win_type=None
, all points are evenly weighted; otherwise,win_type
can accept a string of any scipy.signal window function.Certain Scipy window types require additional parameters to be passed in the aggregation function. The additional parameters must match the keywords specified in the Scipy window type method signature. Please see the third example below on how to add the additional parameters.
Examples
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}) >>> df B 0 0.0 1 1.0 2 2.0 3 NaN 4 4.0
Rolling sum with a window length of 2, using the ‘triang’ window type.
>>> df.rolling(2, win_type='triang').sum() B 0 NaN 1 0.5 2 1.5 3 NaN 4 NaN
Rolling sum with a window length of 2, using the ‘gaussian’ window type (note how we need to specify std).
>>> df.rolling(2, win_type='gaussian').sum(std=3) B 0 NaN 1 0.986207 2 2.958621 3 NaN 4 NaN
Rolling sum with a window length of 2, min_periods defaults to the window length.
>>> df.rolling(2).sum() B 0 NaN 1 1.0 2 3.0 3 NaN 4 NaN
Same as above, but explicitly set the min_periods
>>> df.rolling(2, min_periods=1).sum() B 0 0.0 1 1.0 2 3.0 3 2.0 4 4.0
Same as above, but with forward-looking windows
>>> indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=2) >>> df.rolling(window=indexer, min_periods=1).sum() B 0 1.0 1 3.0 2 2.0 3 4.0 4 4.0
A ragged (meaning not-a-regular frequency), time-indexed DataFrame
>>> df = pd.DataFrame({'B': [0, 1, 2, np.nan, 4]}, ... index = [pd.Timestamp('20130101 09:00:00'), ... pd.Timestamp('20130101 09:00:02'), ... pd.Timestamp('20130101 09:00:03'), ... pd.Timestamp('20130101 09:00:05'), ... pd.Timestamp('20130101 09:00:06')])
>>> df B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 2.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 4.0
Contrasting to an integer rolling window, this will roll a variable length window corresponding to the time period. The default for min_periods is 1.
>>> df.rolling('2s').sum() B 2013-01-01 09:00:00 0.0 2013-01-01 09:00:02 1.0 2013-01-01 09:00:03 3.0 2013-01-01 09:00:05 NaN 2013-01-01 09:00:06 4.0
- round(decimals=0, *args, **kwargs)
Round a DataFrame to a variable number of decimal places.
- Parameters
decimals (int, dict, Series) – Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.
*args – Additional keywords have no effect but might be accepted for compatibility with numpy.
**kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.
- Returns
A DataFrame with the affected columns rounded to the specified number of decimal places.
- Return type
See also
numpy.around
Round a numpy array to the given number of decimals.
Series.round
Round a Series to the given number of decimals.
Examples
>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)], ... columns=['dogs', 'cats']) >>> df dogs cats 0 0.21 0.32 1 0.01 0.67 2 0.66 0.03 3 0.21 0.18
By providing an integer each column is rounded to the same number of decimal places
>>> df.round(1) dogs cats 0 0.2 0.3 1 0.0 0.7 2 0.7 0.0 3 0.2 0.2
With a dict, the number of places for specific columns can be specified with the column names as key and the number of decimal places as value
>>> df.round({'dogs': 1, 'cats': 0}) dogs cats 0 0.2 0.0 1 0.0 1.0 2 0.7 0.0 3 0.2 0.0
Using a Series, the number of places for specific columns can be specified with the column names as index and the number of decimal places as value
>>> decimals = pd.Series([0, 1], index=['cats', 'dogs']) >>> df.round(decimals) dogs cats 0 0.2 0.0 1 0.0 1.0 2 0.7 0.0 3 0.2 0.0
Notes
See pandas API documentation for pandas.DataFrame.round for more.
- rpow(other, axis='columns', level=None, fill_value=None)
Get Exponential power of dataframe and other, element-wise (binary operator rpow).
Equivalent to
other ** dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, pow.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.rpow for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rsub(other, axis='columns', level=None, fill_value=None)
Get Subtraction of dataframe and other, element-wise (binary operator rsub).
Equivalent to
other - dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, sub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.rsub for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- rtruediv(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator rtruediv).
Equivalent to
other / dataframe
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.rtruediv for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)
Return a random sample of items from an axis of object.
You can use random_state for reproducibility.
- Parameters
n (int, optional) – Number of items from axis to return. Cannot be used with frac. Default = 1 if frac = None.
frac (float, optional) – Fraction of axis items to return. Cannot be used with n.
replace (bool, default False) – Allow or disallow sampling of the same row more than once.
weights (str or ndarray-like, optional) – Default ‘None’ results in equal probability weighting. If passed a Series, will align with target object on index. Index values in weights not found in sampled object will be ignored and index values in sampled object not in weights will be assigned weights of zero. If called on a DataFrame, will accept the name of a column when axis = 0. Unless weights are a Series, weights must be same length as axis being sampled. If weights do not sum to 1, they will be normalized to sum to 1. Missing values in the weights column will be treated as zero. Infinite values not allowed.
random_state (int, array-like, BitGenerator, np.random.RandomState, optional) –
If int, array-like, or BitGenerator (NumPy>=1.17), seed for random number generator If np.random.RandomState, use as numpy RandomState object.
Changed in version 1.1.0: array-like and BitGenerator (for NumPy>=1.17) object now passed to np.random.RandomState() as seed
axis ({0 or ‘index’, 1 or ‘columns’, None}, default None) – Axis to sample. Accepts axis number or name. Default is stat axis for given data type (0 for Series and DataFrames).
ignore_index (bool, default False) –
If True, the resulting index will be labeled 0, 1, …, n - 1.
New in version 1.3.0.
- Returns
A new object of same type as caller containing n items randomly sampled from the caller object.
- Return type
See also
DataFrameGroupBy.sample
Generates random samples from each group of a DataFrame object.
SeriesGroupBy.sample
Generates random samples from each group of a Series object.
numpy.random.choice
Generates a random sample from a given 1-D numpy array.
Notes
See pandas API documentation for pandas.DataFrame.sample for more. If frac > 1, replacement should be set to True.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4, 8, 0], ... 'num_wings': [2, 0, 0, 0], ... 'num_specimen_seen': [10, 2, 1, 8]}, ... index=['falcon', 'dog', 'spider', 'fish']) >>> df num_legs num_wings num_specimen_seen falcon 2 2 10 dog 4 0 2 spider 8 0 1 fish 0 0 8
Extract 3 random elements from the
Series
df['num_legs']
: Note that we use random_state to ensure the reproducibility of the examples.>>> df['num_legs'].sample(n=3, random_state=1) fish 0 spider 8 falcon 2 Name: num_legs, dtype: int64
A random 50% sample of the
DataFrame
with replacement:>>> df.sample(frac=0.5, replace=True, random_state=1) num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8
An upsample sample of the
DataFrame
with replacement: Note that replace parameter has to be True for frac parameter > 1.>>> df.sample(frac=2, replace=True, random_state=1) num_legs num_wings num_specimen_seen dog 4 0 2 fish 0 0 8 falcon 2 2 10 falcon 2 2 10 fish 0 0 8 dog 4 0 2 fish 0 0 8 dog 4 0 2
Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen column are more likely to be sampled.
>>> df.sample(n=2, weights='num_specimen_seen', random_state=1) num_legs num_wings num_specimen_seen falcon 2 2 10 fish 0 0 8
- sem(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Return unbiased standard error of the mean over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
- Parameters
axis ({index (0), columns (1)}) –
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.sem for more. To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
- set_axis(labels, axis=0, inplace=False)
Assign desired index to given axis.
Indexes for column or row labels can be changed by assigning a list-like or Index.
- Parameters
labels (list-like, Index) – The values for the new index.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to update. The value 0 identifies the rows, and 1 identifies the columns.
inplace (bool, default False) – Whether to return a new DataFrame instance.
- Returns
renamed – An object of type DataFrame or None if
inplace=True
.- Return type
DataFrame or None
See also
DataFrame.rename_axis
Alter the name of the index or columns. Examples ——– >>> df = pd.DataFrame({“A”: [1, 2, 3], “B”: [4, 5, 6]}) Change the row labels. >>> df.set_axis([‘a’, ‘b’, ‘c’], axis=’index’) A B a 1 4 b 2 5 c 3 6 Change the column labels. >>> df.set_axis([‘I’, ‘II’], axis=’columns’) I II 0 1 4 1 2 5 2 3 6 Now, update the labels inplace. >>> df.set_axis([‘i’, ‘ii’], axis=’columns’, inplace=True) >>> df i ii 0 1 4 1 2 5 2 3 6
Notes
See pandas API documentation for pandas.DataFrame.set_axis for more.
- set_flags(*, copy: modin.pandas.base.BasePandasDataset.bool = False, allows_duplicate_labels: Optional[modin.pandas.base.BasePandasDataset.bool] = None)
Return a new object with updated flags.
- Parameters
allows_duplicate_labels (bool, optional) – Whether the returned object allows duplicate labels.
- Returns
The same type as the caller.
- Return type
See also
DataFrame.attrs
Global metadata applying to this dataset.
DataFrame.flags
Global flags applying to this object.
Notes
See pandas API documentation for pandas.DataFrame.set_flags for more. This method returns a new object that’s a view on the same data as the input. Mutating the input or the output values will be reflected in the other.
This method is intended to be used in method chains.
“Flags” differ from “metadata”. Flags reflect properties of the pandas object (the Series or DataFrame). Metadata refer to properties of the dataset, and should be stored in
DataFrame.attrs
.Examples
>>> df = pd.DataFrame({"A": [1, 2]}) >>> df.flags.allows_duplicate_labels True >>> df2 = df.set_flags(allows_duplicate_labels=False) >>> df2.flags.allows_duplicate_labels False
- shift(periods=1, freq=None, axis=0, fill_value=NoDefault.no_default)
Shift index by desired number of periods with an optional time freq.
When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.
- Parameters
periods (int) – Number of periods to shift. Can be positive or negative.
freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.
axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction.
fill_value (object, optional) –
The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data,
np.nan
is used. For datetime, timedelta, or period data, etc.NaT
is used. For extension dtypes,self.dtype.na_value
is used.Changed in version 1.1.0.
- Returns
Copy of input object, shifted.
- Return type
See also
Index.shift
Shift values of Index.
DatetimeIndex.shift
Shift values of DatetimeIndex.
PeriodIndex.shift
Shift values of PeriodIndex.
tshift
Shift the time index, using the index’s frequency if available.
Examples
>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45], ... "Col2": [13, 23, 18, 33, 48], ... "Col3": [17, 27, 22, 37, 52]}, ... index=pd.date_range("2020-01-01", "2020-01-05")) >>> df Col1 Col2 Col3 2020-01-01 10 13 17 2020-01-02 20 23 27 2020-01-03 15 18 22 2020-01-04 30 33 37 2020-01-05 45 48 52
>>> df.shift(periods=3) Col1 Col2 Col3 2020-01-01 NaN NaN NaN 2020-01-02 NaN NaN NaN 2020-01-03 NaN NaN NaN 2020-01-04 10.0 13.0 17.0 2020-01-05 20.0 23.0 27.0
>>> df.shift(periods=1, axis="columns") Col1 Col2 Col3 2020-01-01 NaN 10 13 2020-01-02 NaN 20 23 2020-01-03 NaN 15 18 2020-01-04 NaN 30 33 2020-01-05 NaN 45 48
>>> df.shift(periods=3, fill_value=0) Col1 Col2 Col3 2020-01-01 0 0 0 2020-01-02 0 0 0 2020-01-03 0 0 0 2020-01-04 10 13 17 2020-01-05 20 23 27
>>> df.shift(periods=3, freq="D") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
>>> df.shift(periods=3, freq="infer") Col1 Col2 Col3 2020-01-04 10 13 17 2020-01-05 20 23 27 2020-01-06 15 18 22 2020-01-07 30 33 37 2020-01-08 45 48 52
Notes
See pandas API documentation for pandas.DataFrame.shift for more.
- property size
Return an int representing the number of elements in this object.
Return the number of rows if Series. Otherwise return the number of rows times number of columns if DataFrame.
See also
ndarray.size
Number of elements in the array.
Examples
>>> s = pd.Series({'a': 1, 'b': 2, 'c': 3}) >>> s.size 3
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]}) >>> df.size 4
Notes
See pandas API documentation for pandas.DataFrame.size for more.
- skew(axis=None, skipna=None, level=None, numeric_only=None, **kwargs)
Return unbiased skew over requested axis.
Normalized by N-1.
- Parameters
axis ({index (0), columns (1)}) – Axis for the function to be applied on.
skipna (bool, default True) – Exclude NA/null values when computing the result.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs – Additional keyword arguments to be passed to the function.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.skew for more.
- sort_index(axis=0, level=None, ascending=True, inplace=False, kind='quicksort', na_position='last', sort_remaining=True, ignore_index: modin.pandas.base.BasePandasDataset.bool = False, key: Optional[Callable[[Index], Union[Index, ExtensionArray, numpy.ndarray, Series]]] = None)
Sort object by labels (along an axis).
Returns a new DataFrame sorted by label if inplace argument is
False
, otherwise updates the original DataFrame and returns None.- Parameters
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).
ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.
inplace (bool, default False) – If True, perform operation in-place.
kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also
numpy.sort()
for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.
sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.
ignore_index (bool, default False) –
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
key (callable, optional) –
If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin
sorted()
function, with the notable difference that this key function should be vectorized. It should expect anIndex
and return anIndex
of the same shape. For MultiIndex inputs, the key is applied per level.New in version 1.1.0.
- Returns
The original DataFrame sorted by the labels or None if
inplace=True
.- Return type
DataFrame or None
See also
Series.sort_index
Sort Series by the index.
DataFrame.sort_values
Sort DataFrame by the value.
Series.sort_values
Sort Series by the value.
Examples
>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150], ... columns=['A']) >>> df.sort_index() A 1 4 29 2 100 1 150 5 234 3
By default, it sorts in ascending order, to sort in descending order, use
ascending=False
>>> df.sort_index(ascending=False) A 234 3 150 5 100 1 29 2 1 4
A key function can be specified which is applied to the index before sorting. For a
MultiIndex
this is applied to each level separately.>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd']) >>> df.sort_index(key=lambda x: x.str.lower()) a A 1 b 2 C 3 d 4
Notes
See pandas API documentation for pandas.DataFrame.sort_index for more.
- sort_values(by, axis=0, ascending=True, inplace: modin.pandas.base.BasePandasDataset.bool = False, kind='quicksort', na_position='last', ignore_index: modin.pandas.base.BasePandasDataset.bool = False, key: Optional[Callable[[Index], Union[Index, ExtensionArray, numpy.ndarray, Series]]] = None)
Sort by the values along either axis.
- Parameters
by (str or list of str) –
Name or list of names to sort by.
if axis is 0 or ‘index’ then by may contain index levels and/or column labels.
if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.
- axis{0 or ‘index’, 1 or ‘columns’}, default 0
Axis to be sorted.
- ascendingbool or list of bool, default True
Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.
- inplacebool, default False
If True, perform operation in-place.
- kind{‘quicksort’, ‘mergesort’, ‘heapsort’, ‘stable’}, default ‘quicksort’
Choice of sorting algorithm. See also
numpy.sort()
for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.- na_position{‘first’, ‘last’}, default ‘last’
Puts NaNs at the beginning if first; last puts NaNs at the end.
- ignore_indexbool, default False
If True, the resulting axis will be labeled 0, 1, …, n - 1.
New in version 1.0.0.
- keycallable, optional
Apply the key function to the values before sorting. This is similar to the key argument in the builtin
sorted()
function, with the notable difference that this key function should be vectorized. It should expect aSeries
and return a Series with the same shape as the input. It will be applied to each column in by independently.New in version 1.1.0.
- Returns
DataFrame with sorted values or None if
inplace=True
.- Return type
DataFrame or None
See also
DataFrame.sort_index
Sort a DataFrame by the index.
Series.sort_values
Similar method for a Series.
Examples
>>> df = pd.DataFrame({ ... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'], ... 'col2': [2, 1, 9, 8, 7, 4], ... 'col3': [0, 1, 9, 4, 2, 3], ... 'col4': ['a', 'B', 'c', 'D', 'e', 'F'] ... }) >>> df col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F
Sort by col1
>>> df.sort_values(by=['col1']) col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 5 C 4 3 F 4 D 7 2 e 3 NaN 8 4 D
Sort by multiple columns
>>> df.sort_values(by=['col1', 'col2']) col1 col2 col3 col4 1 A 1 1 B 0 A 2 0 a 2 B 9 9 c 5 C 4 3 F 4 D 7 2 e 3 NaN 8 4 D
Sort Descending
>>> df.sort_values(by='col1', ascending=False) col1 col2 col3 col4 4 D 7 2 e 5 C 4 3 F 2 B 9 9 c 0 A 2 0 a 1 A 1 1 B 3 NaN 8 4 D
Putting NAs first
>>> df.sort_values(by='col1', ascending=False, na_position='first') col1 col2 col3 col4 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F 2 B 9 9 c 0 A 2 0 a 1 A 1 1 B
Sorting with a key function
>>> df.sort_values(by='col4', key=lambda col: col.str.lower()) col1 col2 col3 col4 0 A 2 0 a 1 A 1 1 B 2 B 9 9 c 3 NaN 8 4 D 4 D 7 2 e 5 C 4 3 F
Natural sort with the key argument, using the natsort <https://github.com/SethMMorton/natsort> package.
>>> df = pd.DataFrame({ ... "time": ['0hr', '128hr', '72hr', '48hr', '96hr'], ... "value": [10, 20, 30, 40, 50] ... }) >>> df time value 0 0hr 10 1 128hr 20 2 72hr 30 3 48hr 40 4 96hr 50 >>> from natsort import index_natsorted >>> df.sort_values( ... by="time", ... key=lambda x: np.argsort(index_natsorted(df["time"])) ... ) time value 0 0hr 10 3 48hr 40 2 72hr 30 4 96hr 50 1 128hr 20
Notes
See pandas API documentation for pandas.DataFrame.sort_values for more.
- std(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Return sample standard deviation over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
- Parameters
axis ({index (0), columns (1)}) –
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.std for more. To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
- sub(other, axis='columns', level=None, fill_value=None)
Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.sub for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- subtract(other, axis='columns', level=None, fill_value=None)
Get Subtraction of dataframe and other, element-wise (binary operator sub).
Equivalent to
dataframe - other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.sub for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- swapaxes(axis1, axis2, copy=True)
Interchange axes and swap values axes appropriately.
- Returns
y
- Return type
same as input
Notes
See pandas API documentation for pandas.DataFrame.swapaxes for more.
- swaplevel(i=- 2, j=- 1, axis=0)
Swap levels i and j in a
MultiIndex
.Default is to swap the two innermost levels of the index.
- Parameters
i (int or str) – Levels of the indices to be swapped. Can pass level name as string.
j (int or str) – Levels of the indices to be swapped. Can pass level name as string.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to swap levels on. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.
- Returns
DataFrame – DataFrame with levels swapped in MultiIndex.
Examples – ——– >>> df = pd.DataFrame( … {“Grade”: [“A”, “B”, “A”, “C”]}, … index=[ … [“Final exam”, “Final exam”, “Coursework”, “Coursework”], … [“History”, “Geography”, “History”, “Geography”], … [“January”, “February”, “March”, “April”], … ], … ) >>> df
Grade
- Final exam History January A
Geography February B
- Coursework History March A
Geography April C
In the following example, we will swap the levels of the indices. Here, we will swap the levels column-wise, but levels can be swapped row-wise in a similar manner. Note that column-wise is the default behaviour. By not supplying any arguments for i and j, we swap the last and second to last indices.
>>> df.swaplevel() Grade Final exam January History A February Geography B Coursework March History A April Geography C
By supplying one argument, we can choose which index to swap the last index with. We can for example swap the first index with the last one as follows.
>>> df.swaplevel(0) Grade January History Final exam A February Geography Final exam B March History Coursework A April Geography Coursework C
We can also define explicitly which indices we want to swap by supplying values for both i and j. Here, we for example swap the first and second indices.
>>> df.swaplevel(0, 1) Grade History Final exam January A Geography Final exam February B History Coursework March A Geography Coursework April C
Notes
See pandas API documentation for pandas.DataFrame.swaplevel for more.
- tail(n=5)
Return the last n rows.
This function returns last n rows from the object based on position. It is useful for quickly verifying data, for example, after sorting or appending rows.
For negative values of n, this function returns all rows except the first n rows, equivalent to
df[n:]
.- Parameters
n (int, default 5) – Number of rows to select.
- Returns
The last n rows of the caller object.
- Return type
type of caller
See also
DataFrame.head
The first n rows of the caller object.
Examples
>>> df = pd.DataFrame({'animal': ['alligator', 'bee', 'falcon', 'lion', ... 'monkey', 'parrot', 'shark', 'whale', 'zebra']}) >>> df animal 0 alligator 1 bee 2 falcon 3 lion 4 monkey 5 parrot 6 shark 7 whale 8 zebra
Viewing the last 5 lines
>>> df.tail() animal 4 monkey 5 parrot 6 shark 7 whale 8 zebra
Viewing the last n lines (three in this case)
>>> df.tail(3) animal 6 shark 7 whale 8 zebra
For negative values of n
>>> df.tail(-3) animal 3 lion 4 monkey 5 parrot 6 shark 7 whale 8 zebra
Notes
See pandas API documentation for pandas.DataFrame.tail for more.
- take(indices, axis=0, is_copy=None, **kwargs)
Return the elements in the given positional indices along an axis.
This means that we are not indexing according to actual values in the index attribute of the object. We are indexing according to the actual position of the element in the object.
- Parameters
indices (array-like) – An array of ints indicating which positions to take.
axis ({0 or 'index', 1 or 'columns', None}, default 0) – The axis on which to select elements.
0
means that we are selecting rows,1
means that we are selecting columns.is_copy (bool) –
Before pandas 1.0,
is_copy=False
can be specified to ensure that the return value is an actual copy. Starting with pandas 1.0,take
always returns a copy, and the keyword is therefore deprecated.Deprecated since version 1.0.0.
**kwargs – For compatibility with
numpy.take()
. Has no effect on the output.
- Returns
taken – An array-like containing the elements taken from the object.
- Return type
same type as caller
See also
DataFrame.loc
Select a subset of a DataFrame by labels.
DataFrame.iloc
Select a subset of a DataFrame by positions.
numpy.take
Take elements from an array along an axis.
Examples
>>> df = pd.DataFrame([('falcon', 'bird', 389.0), ... ('parrot', 'bird', 24.0), ... ('lion', 'mammal', 80.5), ... ('monkey', 'mammal', np.nan)], ... columns=['name', 'class', 'max_speed'], ... index=[0, 2, 3, 1]) >>> df name class max_speed 0 falcon bird 389.0 2 parrot bird 24.0 3 lion mammal 80.5 1 monkey mammal NaN
Take elements at positions 0 and 3 along the axis 0 (default).
Note how the actual indices selected (0 and 1) do not correspond to our selected indices 0 and 3. That’s because we are selecting the 0th and 3rd rows, not rows whose indices equal 0 and 3.
>>> df.take([0, 3]) name class max_speed 0 falcon bird 389.0 1 monkey mammal NaN
Take elements at indices 1 and 2 along the axis 1 (column selection).
>>> df.take([1, 2], axis=1) class max_speed 0 bird 389.0 2 bird 24.0 3 mammal 80.5 1 mammal NaN
We may take elements using negative integers for positive indices, starting from the end of the object, just like with Python lists.
>>> df.take([-1, -2]) name class max_speed 1 monkey mammal NaN 3 lion mammal 80.5
Notes
See pandas API documentation for pandas.DataFrame.take for more.
- to_clipboard(excel=True, sep=None, **kwargs)
Copy object to the system clipboard.
Write a text representation of object to the system clipboard. This can be pasted into Excel, for example.
- Parameters
excel (bool, default True) –
Produce output in a csv format for easy pasting into excel.
True, use the provided separator for csv pasting.
False, write a string representation of the object to the clipboard.
sep (str, default
'\t'
) – Field delimiter.**kwargs – These parameters will be passed to DataFrame.to_csv.
See also
DataFrame.to_csv
Write a DataFrame to a comma-separated values (csv) file.
read_clipboard
Read text from clipboard and pass to read_table.
Notes
See pandas API documentation for pandas.DataFrame.to_clipboard for more. Requirements for your platform.
Linux : xclip, or xsel (with PyQt4 modules)
Windows : none
OS X : none
Examples
Copy the contents of a DataFrame to the clipboard.
>>> df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
>>> df.to_clipboard(sep=',') ... # Wrote the following to the system clipboard: ... # ,A,B,C ... # 0,1,2,3 ... # 1,4,5,6
We can omit the index by passing the keyword index and setting it to false.
>>> df.to_clipboard(sep=',', index=False) ... # Wrote the following to the system clipboard: ... # A,B,C ... # 1,2,3 ... # 4,5,6
- to_csv(path_or_buf=None, sep=',', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, mode='w', encoding=None, compression='infer', quoting=None, quotechar='"', line_terminator=None, chunksize=None, date_format=None, doublequote=True, escapechar=None, decimal='.', errors: str = 'strict', storage_options: Optional[Dict[str, Any]] = None)
Write object to a comma-separated values (csv) file.
- Parameters
path_or_buf (str or file handle, default None) –
File path or object, if None is provided the result is returned as a string. If a non-binary file object is passed, it should be opened with newline=’’, disabling universal newlines. If a binary file object is passed, mode might need to contain a ‘b’.
Changed in version 1.2.0: Support for binary file objects was introduced.
sep (str, default ',') – String of length 1. Field delimiter for the output file.
na_rep (str, default '') – Missing data representation.
float_format (str, default None) – Format string for floating point numbers.
columns (sequence, optional) – Columns to write.
header (bool or list of str, default True) – Write out the column names. If a list of strings is given it is assumed to be aliases for the column names.
index (bool, default True) – Write row names (index).
index_label (str or sequence, or False, default None) – Column label for index column(s) if desired. If None is given, and header and index are True, then the index names are used. A sequence should be given if the object uses MultiIndex. If False do not print fields for index names. Use index_label=False for easier importing in R.
mode (str) – Python write mode, default ‘w’.
encoding (str, optional) – A string representing the encoding to use in the output file, defaults to ‘utf-8’. encoding is not supported if path_or_buf is a non-binary file object.
compression (str or dict, default 'infer') –
If str, represents compression mode. If dict, value at ‘method’ is the compression mode. Compression mode may be any of the following possible values: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}. If compression mode is ‘infer’ and path_or_buf is path-like, then detect compression mode from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’ or ‘.xz’. (otherwise no compression). If dict given and mode is one of {‘zip’, ‘gzip’, ‘bz2’}, or inferred as one of the above, other entries passed as additional compression options.
Changed in version 1.0.0: May now be a dict with key ‘method’ as compression mode and other entries as additional compression options if compression mode is ‘zip’.
Changed in version 1.1.0: Passing compression options as keys in dict is supported for compression modes ‘gzip’ and ‘bz2’ as well as ‘zip’.
Changed in version 1.2.0: Compression is supported for binary file objects.
Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to gzip.open instead of gzip.GzipFile which prevented setting mtime.
quoting (optional constant from csv module) – Defaults to csv.QUOTE_MINIMAL. If you have set a float_format then floats are converted to strings and thus csv.QUOTE_NONNUMERIC will treat them as non-numeric.
quotechar (str, default '"') – String of length 1. Character used to quote fields.
line_terminator (str, optional) – The newline character or character sequence to use in the output file. Defaults to os.linesep, which depends on the OS in which this method is called (’\n’ for linux, ‘\r\n’ for Windows, i.e.).
chunksize (int or None) – Rows to write at a time.
date_format (str, default None) – Format string for datetime objects.
doublequote (bool, default True) – Control quoting of quotechar inside a field.
escapechar (str, default None) – String of length 1. Character used to escape sep and quotechar when appropriate.
decimal (str, default '.') – Character recognized as decimal separator. E.g. use ‘,’ for European data.
errors (str, default 'strict') –
Specifies how encoding and decoding errors are to be handled. See the errors argument for
open()
for a full list of options.New in version 1.1.0.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
- Returns
If path_or_buf is None, returns the resulting csv format as a string. Otherwise returns None.
- Return type
None or str
See also
read_csv
Load a CSV file into a DataFrame.
to_excel
Write DataFrame to an Excel file.
Examples
>>> df = pd.DataFrame({'name': ['Raphael', 'Donatello'], ... 'mask': ['red', 'purple'], ... 'weapon': ['sai', 'bo staff']}) >>> df.to_csv(index=False) 'name,mask,weapon\nRaphael,red,sai\nDonatello,purple,bo staff\n'
Create ‘out.zip’ containing ‘out.csv’
>>> compression_opts = dict(method='zip', ... archive_name='out.csv') >>> df.to_csv('out.zip', index=False, ... compression=compression_opts)
Notes
See pandas API documentation for pandas.DataFrame.to_csv for more.
- to_dict(orient='dict', into=<class 'dict'>)
Convert the DataFrame to a dictionary.
The type of the key-value pairs can be customized with the parameters (see below).
- Parameters
orient (str {'dict', 'list', 'series', 'split', 'records', 'index'}) –
Determines the type of the values of the dictionary.
’dict’ (default) : dict like {column -> {index -> value}}
’list’ : dict like {column -> [values]}
’series’ : dict like {column -> Series(values)}
’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
’records’ : list like [{column -> value}, … , {column -> value}]
’index’ : dict like {index -> {column -> value}}
Abbreviations are allowed. s indicates series and sp indicates split.
into (class, default dict) – The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.
- Returns
Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.
- Return type
dict, list or collections.abc.Mapping
See also
DataFrame.from_dict
Create a DataFrame from a dictionary.
DataFrame.to_json
Convert a DataFrame to JSON format.
Examples
>>> df = pd.DataFrame({'col1': [1, 2], ... 'col2': [0.5, 0.75]}, ... index=['row1', 'row2']) >>> df col1 col2 row1 1 0.50 row2 2 0.75 >>> df.to_dict() {'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}
You can specify the return orientation.
>>> df.to_dict('series') {'col1': row1 1 row2 2 Name: col1, dtype: int64, 'col2': row1 0.50 row2 0.75 Name: col2, dtype: float64}
>>> df.to_dict('split') {'index': ['row1', 'row2'], 'columns': ['col1', 'col2'], 'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records') [{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index') {'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
You can also specify the mapping type.
>>> from collections import OrderedDict, defaultdict >>> df.to_dict(into=OrderedDict) OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])), ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])
If you want a defaultdict, you need to initialize it:
>>> dd = defaultdict(list) >>> df.to_dict('records', into=dd) [defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}), defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
Notes
See pandas API documentation for pandas.DataFrame.to_dict for more.
- to_excel(excel_writer, sheet_name='Sheet1', na_rep='', float_format=None, columns=None, header=True, index=True, index_label=None, startrow=0, startcol=0, engine=None, merge_cells=True, encoding=None, inf_rep='inf', verbose=True, freeze_panes=None, storage_options: Optional[Dict[str, Any]] = None)
Write object to an Excel sheet.
To write a single object to an Excel .xlsx file it is only necessary to specify a target file name. To write to multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet in the file to write to.
Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already exists will result in the contents of the existing file being erased.
- Parameters
excel_writer (path-like, file-like, or ExcelWriter object) – File path or existing ExcelWriter.
sheet_name (str, default 'Sheet1') – Name of sheet which will contain DataFrame.
na_rep (str, default '') – Missing data representation.
float_format (str, optional) – Format string for floating point numbers. For example
float_format="%.2f"
will format 0.1234 to 0.12.columns (sequence or list of str, optional) – Columns to write.
header (bool or list of str, default True) – Write out the column names. If a list of string is given it is assumed to be aliases for the column names.
index (bool, default True) – Write row names (index).
index_label (str or sequence, optional) – Column label for index column(s) if desired. If not specified, and header and index are True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
startrow (int, default 0) – Upper left cell row to dump data frame.
startcol (int, default 0) – Upper left cell column to dump data frame.
engine (str, optional) –
Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set this via the options
io.excel.xlsx.writer
,io.excel.xls.writer
, andio.excel.xlsm.writer
.Deprecated since version 1.2.0: As the xlwt package is no longer maintained, the
xlwt
engine will be removed in a future version of pandas.merge_cells (bool, default True) – Write MultiIndex and Hierarchical Rows as merged cells.
encoding (str, optional) – Encoding of the resulting excel file. Only necessary for xlwt, other writers support unicode natively.
inf_rep (str, default 'inf') – Representation for infinity (there is no native representation for infinity in Excel).
verbose (bool, default True) – Display more information in the error logs.
freeze_panes (tuple of int (length 2), optional) – Specifies the one-based bottommost row and rightmost column that is to be frozen.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
See also
to_csv
Write DataFrame to a comma-separated values (csv) file.
ExcelWriter
Class for writing DataFrame objects into excel sheets.
read_excel
Read an Excel file into a pandas DataFrame.
read_csv
Read a comma-separated values (csv) file into DataFrame.
Notes
See pandas API documentation for pandas.DataFrame.to_excel for more. For compatibility with
to_csv()
, to_excel serializes lists and dicts to strings before writing.Once a workbook has been saved it is not possible to write further data without rewriting the whole workbook.
Examples
Create, write to and save a workbook:
>>> df1 = pd.DataFrame([['a', 'b'], ['c', 'd']], ... index=['row 1', 'row 2'], ... columns=['col 1', 'col 2']) >>> df1.to_excel("output.xlsx")
To specify the sheet name:
>>> df1.to_excel("output.xlsx", ... sheet_name='Sheet_name_1')
If you wish to write to more than one sheet in the workbook, it is necessary to specify an ExcelWriter object:
>>> df2 = df1.copy() >>> with pd.ExcelWriter('output.xlsx') as writer: ... df1.to_excel(writer, sheet_name='Sheet_name_1') ... df2.to_excel(writer, sheet_name='Sheet_name_2')
ExcelWriter can also be used to append to an existing Excel file:
>>> with pd.ExcelWriter('output.xlsx', ... mode='a') as writer: ... df.to_excel(writer, sheet_name='Sheet_name_3')
To set the library that is used to write the Excel file, you can pass the engine keyword (the default engine is automatically chosen depending on the file extension):
>>> df1.to_excel('output1.xlsx', engine='xlsxwriter')
- to_hdf(path_or_buf, key, format='table', **kwargs)
Write the contained data to an HDF5 file using HDFStore.
Hierarchical Data Format (HDF) is self-describing, allowing an application to interpret the structure and contents of a file with no outside information. One HDF file can hold a mix of related objects which can be accessed as a group or as individual objects.
In order to add another DataFrame or Series to an existing HDF file please use append mode and a different a key.
Warning
One can store a subclass of
DataFrame
orSeries
to HDF5, but the type of the subclass is lost upon storing.For more information see the user guide.
- Parameters
path_or_buf (str or pandas.HDFStore) – File path or HDFStore object.
key (str) – Identifier for the group in the store.
mode ({'a', 'w', 'r+'}, default 'a') –
Mode to open file:
’w’: write, a new file is created (an existing file with the same name would be deleted).
’a’: append, an existing file is opened for reading and writing, and if the file does not exist it is created.
’r+’: similar to ‘a’, but the file must already exist.
complevel ({0-9}, optional) – Specifies a compression level for data. A value of 0 disables compression.
complib ({'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib') – Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’, ‘blosc:zlib’, ‘blosc:zstd’}. Specifying a compression library which is not available issues a ValueError.
append (bool, default False) – For Table formats, append the input data to the existing.
format ({'fixed', 'table', None}, default 'fixed') –
Possible values:
’fixed’: Fixed format. Fast writing/reading. Not-appendable, nor searchable.
’table’: Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.
If None, pd.get_option(‘io.hdf.default_format’) is checked, followed by fallback to “fixed”
errors (str, default 'strict') – Specifies how encoding and decoding errors are to be handled. See the errors argument for
open()
for a full list of options.encoding (str, default "UTF-8") –
min_itemsize (dict or int, optional) – Map column names to minimum string sizes for columns.
nan_rep (Any, optional) – How to represent null values as str. Not allowed with append=True.
data_columns (list of columns or True, optional) – List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See io.hdf5-query-data-columns. Applicable only to format=’table’.
See also
read_hdf
Read from HDF file.
DataFrame.to_parquet
Write a DataFrame to the binary parquet format.
DataFrame.to_sql
Write to a SQL table.
DataFrame.to_feather
Write out feather-format for DataFrames.
DataFrame.to_csv
Write out to a csv file.
Examples
>>> df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}, ... index=['a', 'b', 'c']) >>> df.to_hdf('data.h5', key='df', mode='w')
We can add another object to the same file:
>>> s = pd.Series([1, 2, 3, 4]) >>> s.to_hdf('data.h5', key='s')
Reading from HDF file:
>>> pd.read_hdf('data.h5', 'df') A B a 1 4 b 2 5 c 3 6 >>> pd.read_hdf('data.h5', 's') 0 1 1 2 2 3 3 4 dtype: int64
Deleting file with data:
>>> import os >>> os.remove('data.h5')
Notes
See pandas API documentation for pandas.DataFrame.to_hdf for more.
- to_json(path_or_buf=None, orient=None, date_format=None, double_precision=10, force_ascii=True, date_unit='ms', default_handler=None, lines=False, compression='infer', index=True, indent=None, storage_options: Optional[Dict[str, Any]] = None)
Convert the object to a JSON string.
Note NaN’s and None will be converted to null and datetime objects will be converted to UNIX timestamps.
- Parameters
path_or_buf (str or file handle, optional) – File path or object. If not specified, the result is returned as a string.
orient (str) –
Indication of expected JSON string format.
Series:
default is ‘index’
allowed values are: {‘split’, ‘records’, ‘index’, ‘table’}.
DataFrame:
default is ‘columns’
allowed values are: {‘split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’}.
The format of the JSON string:
’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}
’records’ : list like [{column -> value}, … , {column -> value}]
’index’ : dict like {index -> {column -> value}}
’columns’ : dict like {column -> {index -> value}}
’values’ : just the values array
’table’ : dict like {‘schema’: {schema}, ‘data’: {data}}
Describing the data, where data component is like
orient='records'
.
date_format ({None, 'epoch', 'iso'}) – Type of date conversion. ‘epoch’ = epoch milliseconds, ‘iso’ = ISO8601. The default depends on the orient. For
orient='table'
, the default is ‘iso’. For all other orients, the default is ‘epoch’.double_precision (int, default 10) – The number of decimal places to use when encoding floating point values.
force_ascii (bool, default True) – Force encoded string to be ASCII.
date_unit (str, default 'ms' (milliseconds)) – The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’, ‘ns’ for second, millisecond, microsecond, and nanosecond respectively.
default_handler (callable, default None) – Handler to call if object cannot otherwise be converted to a suitable format for JSON. Should receive a single argument which is the object to convert and return a serialisable object.
lines (bool, default False) – If ‘orient’ is ‘records’ write out line-delimited json format. Will throw ValueError if incorrect ‘orient’ since others are not list-like.
compression ({'infer', 'gzip', 'bz2', 'zip', 'xz', None}) – A string representing the compression to use in the output file, only used when the first argument is a filename. By default, the compression is inferred from the filename.
index (bool, default True) – Whether to include the index values in the JSON string. Not including the index (
index=False
) is only supported when orient is ‘split’ or ‘table’.indent (int, optional) –
Length of whitespace used to indent each record.
New in version 1.0.0.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
- Returns
If path_or_buf is None, returns the resulting json format as a string. Otherwise returns None.
- Return type
None or str
See also
read_json
Convert a JSON string to pandas object.
Notes
See pandas API documentation for pandas.DataFrame.to_json for more. The behavior of
indent=0
varies from the stdlib, which does not indent the output but does insert newlines. Currently,indent=0
and the defaultindent=None
are equivalent in pandas, though this may change in a future release.orient='table'
contains a ‘pandas_version’ field under ‘schema’. This stores the version of pandas used in the latest revision of the schema.Examples
>>> import json >>> df = pd.DataFrame( ... [["a", "b"], ["c", "d"]], ... index=["row 1", "row 2"], ... columns=["col 1", "col 2"], ... )
>>> result = df.to_json(orient="split") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "columns": [ "col 1", "col 2" ], "index": [ "row 1", "row 2" ], "data": [ [ "a", "b" ], [ "c", "d" ] ] }
Encoding/decoding a Dataframe using
'records'
formatted JSON. Note that index labels are not preserved with this encoding.>>> result = df.to_json(orient="records") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) [ { "col 1": "a", "col 2": "b" }, { "col 1": "c", "col 2": "d" } ]
Encoding/decoding a Dataframe using
'index'
formatted JSON:>>> result = df.to_json(orient="index") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "row 1": { "col 1": "a", "col 2": "b" }, "row 2": { "col 1": "c", "col 2": "d" } }
Encoding/decoding a Dataframe using
'columns'
formatted JSON:>>> result = df.to_json(orient="columns") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "col 1": { "row 1": "a", "row 2": "c" }, "col 2": { "row 1": "b", "row 2": "d" } }
Encoding/decoding a Dataframe using
'values'
formatted JSON:>>> result = df.to_json(orient="values") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) [ [ "a", "b" ], [ "c", "d" ] ]
Encoding with Table Schema:
>>> result = df.to_json(orient="table") >>> parsed = json.loads(result) >>> json.dumps(parsed, indent=4) { "schema": { "fields": [ { "name": "index", "type": "string" }, { "name": "col 1", "type": "string" }, { "name": "col 2", "type": "string" } ], "primaryKey": [ "index" ], "pandas_version": "0.20.0" }, "data": [ { "index": "row 1", "col 1": "a", "col 2": "b" }, { "index": "row 2", "col 1": "c", "col 2": "d" } ] }
- to_latex(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, bold_rows=False, column_format=None, longtable=None, escape=None, encoding=None, decimal='.', multicolumn=None, multicolumn_format=None, multirow=None, caption=None, label=None, position=None)
Render object to a LaTeX tabular, longtable, or nested table/tabular.
Requires
\usepackage{booktabs}
. The output can be copy/pasted into a main LaTeX document or read from an external file with\input{table.tex}
.Changed in version 1.0.0: Added caption and label arguments.
Changed in version 1.2.0: Added position argument, changed meaning of caption argument.
- Parameters
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
columns (list of label, optional) – The subset of columns to write. Writes all columns by default.
col_space (int, optional) – The minimum width of each column.
header (bool or list of str, default True) – Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names.
index (bool, default True) – Write row names (index).
na_rep (str, default 'NaN') – Missing data representation.
formatters (list of functions or dict of {str: function}, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List must be of length equal to the number of columns.
float_format (one-parameter function or str, optional, default None) – Formatter for floating point numbers. For example
float_format="%.2f"
andfloat_format="{:0.2f}".format
will both result in 0.1234 being formatted as 0.12.sparsify (bool, optional) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row. By default, the value will be read from the config module.
index_names (bool, default True) – Prints the names of the indexes.
bold_rows (bool, default False) – Make the row labels bold in the output.
column_format (str, optional) – The columns format as specified in LaTeX table format e.g. ‘rcl’ for 3 columns. By default, ‘l’ will be used for all columns except columns of numbers, which default to ‘r’.
longtable (bool, optional) – By default, the value will be read from the pandas config module. Use a longtable environment instead of tabular. Requires adding a usepackage{longtable} to your LaTeX preamble.
escape (bool, optional) – By default, the value will be read from the pandas config module. When set to False prevents from escaping latex special characters in column names.
encoding (str, optional) – A string representing the encoding to use in the output file, defaults to ‘utf-8’.
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
multicolumn (bool, default True) – Use multicolumn to enhance MultiIndex columns. The default will be read from the config module.
multicolumn_format (str, default 'l') – The alignment for multicolumns, similar to column_format The default will be read from the config module.
multirow (bool, default False) – Use multirow to enhance MultiIndex rows. Requires adding a usepackage{multirow} to your LaTeX preamble. Will print centered labels (instead of top-aligned) across the contained rows, separating groups via clines. The default will be read from the pandas config module.
caption (str or tuple, optional) –
Tuple (full_caption, short_caption), which results in
\caption[short_caption]{full_caption}
; if a single string is passed, no short caption will be set.New in version 1.0.0.
Changed in version 1.2.0: Optionally allow caption to be a tuple
(full_caption, short_caption)
.label (str, optional) –
The LaTeX label to be placed inside
\label{}
in the output. This is used with\ref{}
in the main.tex
file.New in version 1.0.0.
position (str, optional) –
The LaTeX positional argument for tables, to be placed after
\begin{}
in the output.New in version 1.2.0:
- str or None
If buf is None, returns the result as a string. Otherwise returns None.
See also
DataFrame.to_string
Render a DataFrame to a console-friendly tabular output.
DataFrame.to_html
Render a DataFrame as an HTML table.
Examples
>>> df = pd.DataFrame(dict(name=['Raphael', 'Donatello'], ... mask=['red', 'purple'], ... weapon=['sai', 'bo staff'])) >>> print(df.to_latex(index=False)) \begin{tabular}{lll} \toprule name & mask & weapon \\ \midrule Raphael & red & sai \\ Donatello & purple & bo staff \\ \bottomrule \end{tabular}
Notes
See pandas API documentation for pandas.DataFrame.to_latex for more.
- to_markdown(buf=None, mode: str = 'wt', index: modin.pandas.base.BasePandasDataset.bool = True, storage_options: Optional[Dict[str, Any]] = None, **kwargs)
Print DataFrame in Markdown-friendly format.
New in version 1.0.0.
- Parameters
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
mode (str, optional) – Mode in which file is opened, “wt” by default.
index (bool, optional, default True) –
Add index (row) labels.
New in version 1.1.0.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
**kwargs – These parameters will be passed to tabulate.
- Returns
DataFrame in Markdown-friendly format.
- Return type
str
Notes
See pandas API documentation for pandas.DataFrame.to_markdown for more. Requires the tabulate package.
Examples
>>> s = pd.Series(["elk", "pig", "dog", "quetzal"], name="animal") >>> print(s.to_markdown()) | | animal | |---:|:---------| | 0 | elk | | 1 | pig | | 2 | dog | | 3 | quetzal |
Output markdown with a tabulate option.
>>> print(s.to_markdown(tablefmt="grid")) +----+----------+ | | animal | +====+==========+ | 0 | elk | +----+----------+ | 1 | pig | +----+----------+ | 2 | dog | +----+----------+ | 3 | quetzal | +----+----------+
- to_numpy(dtype=None, copy=False, na_value=NoDefault.no_default)
Convert the DataFrame to a NumPy array.
By default, the dtype of the returned array will be the common NumPy dtype of all types in the DataFrame. For example, if the dtypes are
float16
andfloat32
, the results dtype will befloat32
. This may require copying data and coercing values, which may be expensive.- Parameters
dtype (str or numpy.dtype, optional) – The dtype to pass to
numpy.asarray()
.copy (bool, default False) – Whether to ensure that the returned value is not a view on another array. Note that
copy=False
does not ensure thatto_numpy()
is no-copy. Rather,copy=True
ensure that a copy is made, even if not strictly necessary.na_value (Any, optional) –
The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.
New in version 1.1.0.
- Returns
- Return type
numpy.ndarray
See also
Series.to_numpy
Similar method for Series.
Examples
>>> pd.DataFrame({"A": [1, 2], "B": [3, 4]}).to_numpy() array([[1, 3], [2, 4]])
With heterogeneous data, the lowest common type will have to be used.
>>> df = pd.DataFrame({"A": [1, 2], "B": [3.0, 4.5]}) >>> df.to_numpy() array([[1. , 3. ], [2. , 4.5]])
For a mix of numeric and non-numeric types, the output array will have object dtype.
>>> df['C'] = pd.date_range('2000', periods=2) >>> df.to_numpy() array([[1, 3.0, Timestamp('2000-01-01 00:00:00')], [2, 4.5, Timestamp('2000-01-02 00:00:00')]], dtype=object)
Notes
See pandas API documentation for pandas.DataFrame.to_numpy for more.
- to_period(freq=None, axis=0, copy=True)
Convert DataFrame from DatetimeIndex to PeriodIndex.
Convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency (inferred from index if not passed).
- Parameters
freq (str, default) – Frequency of the PeriodIndex.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default).
copy (bool, default True) – If False then underlying input data is not copied.
- Returns
- Return type
DataFrame with PeriodIndex
Notes
See pandas API documentation for pandas.DataFrame.to_period for more.
- to_pickle(path: Union[PathLike[str], str, IO, io.RawIOBase, io.BufferedIOBase, io.TextIOBase, _io.TextIOWrapper, mmap.mmap], compression: Optional[Union[str, Dict[str, Any]]] = 'infer', protocol: int = 4, storage_options: Optional[Dict[str, Any]] = None)
Pickle (serialize) object to file.
- Parameters
path (str) – File path where the pickled object will be stored.
compression ({'infer', 'gzip', 'bz2', 'zip', 'xz', None}, default 'infer') – A string representing the compression to use in the output file. By default, infers from the file extension in specified path. Compression mode may be any of the following possible values: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}. If compression mode is ‘infer’ and path_or_buf is path-like, then detect compression mode from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’ or ‘.xz’. (otherwise no compression). If dict given and mode is ‘zip’ or inferred as ‘zip’, other entries passed as additional compression options.
protocol (int) –
Int which indicates which protocol should be used by the pickler, default HIGHEST_PROTOCOL (see [1]_ paragraph 12.1.2). The possible values are 0, 1, 2, 3, 4, 5. A negative value for the protocol parameter is equivalent to setting its value to HIGHEST_PROTOCOL.
storage_options (dict, optional) –
Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to
urllib
as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded tofsspec
. Please seefsspec
andurllib
for more details.New in version 1.2.0.
See also
read_pickle
Load pickled pandas object (or any object) from file.
DataFrame.to_hdf
Write DataFrame to an HDF5 file.
DataFrame.to_sql
Write DataFrame to a SQL database.
DataFrame.to_parquet
Write a DataFrame to the binary parquet format.
Examples
>>> original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)}) >>> original_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9 >>> original_df.to_pickle("./dummy.pkl")
>>> unpickled_df = pd.read_pickle("./dummy.pkl") >>> unpickled_df foo bar 0 0 5 1 1 6 2 2 7 3 3 8 4 4 9
>>> import os >>> os.remove("./dummy.pkl")
Notes
See pandas API documentation for pandas.DataFrame.to_pickle for more.
- to_sql(name, con, schema=None, if_exists='fail', index=True, index_label=None, chunksize=None, dtype=None, method=None)
Write records stored in a DataFrame to a SQL database.
Databases supported by SQLAlchemy [1]_ are supported. Tables can be newly created, appended to, or overwritten.
- Parameters
name (str) – Name of SQL table.
con (sqlalchemy.engine.(Engine or Connection) or sqlite3.Connection) – Using SQLAlchemy makes it possible to use any DB supported by that library. Legacy support is provided for sqlite3.Connection objects. The user is responsible for engine disposal and connection closure for the SQLAlchemy connectable See here.
schema (str, optional) – Specify the schema (if database flavor supports this). If None, use default schema.
if_exists ({'fail', 'replace', 'append'}, default 'fail') –
How to behave if the table already exists.
fail: Raise a ValueError.
replace: Drop the table before inserting new values.
append: Insert new values to the existing table.
index (bool, default True) – Write DataFrame index as a column. Uses index_label as the column name in the table.
index_label (str or sequence, default None) – Column label for index column(s). If None is given (default) and index is True, then the index names are used. A sequence should be given if the DataFrame uses MultiIndex.
chunksize (int, optional) – Specify the number of rows in each batch to be written at a time. By default, all rows will be written at once.
dtype (dict or scalar, optional) – Specifying the datatype for columns. If a dictionary is used, the keys should be the column names and the values should be the SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided, it will be applied to all columns.
method ({None, 'multi', callable}, optional) –
Controls the SQL insertion clause used:
None : Uses standard SQL
INSERT
clause (one per row).’multi’: Pass multiple values in a single
INSERT
clause.callable with signature
(pd_table, conn, keys, data_iter)
.
Details and a sample callable implementation can be found in the section insert method.
- Raises
ValueError – When the table already exists and if_exists is ‘fail’ (the default).
See also
read_sql
Read a DataFrame from a table.
Notes
See pandas API documentation for pandas.DataFrame.to_sql for more. Timezone aware datetime columns will be written as
Timestamp with timezone
type with SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone unaware timestamps local to the original timezone.References
Examples
Create an in-memory SQLite database.
>>> from sqlalchemy import create_engine >>> engine = create_engine('sqlite://', echo=False)
Create a table from scratch with 3 rows.
>>> df = pd.DataFrame({'name' : ['User 1', 'User 2', 'User 3']}) >>> df name 0 User 1 1 User 2 2 User 3
>>> df.to_sql('users', con=engine) >>> engine.execute("SELECT * FROM users").fetchall() [(0, 'User 1'), (1, 'User 2'), (2, 'User 3')]
An sqlalchemy.engine.Connection can also be passed to con:
>>> with engine.begin() as connection: ... df1 = pd.DataFrame({'name' : ['User 4', 'User 5']}) ... df1.to_sql('users', con=connection, if_exists='append')
This is allowed to support operations that require that the same DBAPI connection is used for the entire operation.
>>> df2 = pd.DataFrame({'name' : ['User 6', 'User 7']}) >>> df2.to_sql('users', con=engine, if_exists='append') >>> engine.execute("SELECT * FROM users").fetchall() [(0, 'User 1'), (1, 'User 2'), (2, 'User 3'), (0, 'User 4'), (1, 'User 5'), (0, 'User 6'), (1, 'User 7')]
Overwrite the table with just
df2
.>>> df2.to_sql('users', con=engine, if_exists='replace', ... index_label='id') >>> engine.execute("SELECT * FROM users").fetchall() [(0, 'User 6'), (1, 'User 7')]
Specify the dtype (especially useful for integers with missing values). Notice that while pandas is forced to store the data as floating point, the database supports nullable integers. When fetching the data with Python, we get back integer scalars.
>>> df = pd.DataFrame({"A": [1, None, 2]}) >>> df A 0 1.0 1 NaN 2 2.0
>>> from sqlalchemy.types import Integer >>> df.to_sql('integers', con=engine, index=False, ... dtype={"A": Integer()})
>>> engine.execute("SELECT * FROM integers").fetchall() [(1,), (None,), (2,)]
- to_string(buf=None, columns=None, col_space=None, header=True, index=True, na_rep='NaN', formatters=None, float_format=None, sparsify=None, index_names=True, justify=None, max_rows=None, min_rows=None, max_cols=None, show_dimensions=False, decimal='.', line_width=None, max_colwidth=None, encoding=None)
Render a DataFrame to a console-friendly tabular output.
- Parameters
buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.
columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.
col_space (int, list or dict of int, optional) – The minimum width of each column.
header (bool or sequence, optional) – Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names.
index (bool, optional, default True) – Whether to print index (row) labels.
na_rep (str, optional, default 'NaN') – String representation of
NaN
to use.formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.
float_format (one-parameter function, optional, default None) –
Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-
NaN
elements, withNaN
being handled byna_rep
.Changed in version 1.2.0.
sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.
index_names (bool, optional, default True) – Prints the names of the indexes.
justify (str, default None) –
How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are
left
right
center
justify
justify-all
start
end
inherit
match-parent
initial
unset.
max_rows (int, optional) – Maximum number of rows to display in the console.
min_rows (int, optional) – The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).
max_cols (int, optional) – Maximum number of columns to display in the console.
show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).
decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.
line_width (int, optional) – Width to wrap a line in characters.
max_colwidth (int, optional) –
Max width to truncate each column in characters. By default, no limit.
New in version 1.0.0.
encoding (str, default "utf-8") –
Set character encoding.
New in version 1.0.
- Returns
If buf is None, returns the result as a string. Otherwise returns None.
- Return type
str or None
See also
to_html
Convert DataFrame to HTML.
Examples
>>> d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]} >>> df = pd.DataFrame(d) >>> print(df.to_string()) col1 col2 0 1 4 1 2 5 2 3 6
Notes
See pandas API documentation for pandas.DataFrame.to_string for more.
- to_timestamp(freq=None, how='start', axis=0, copy=True)
Cast to DatetimeIndex of timestamps, at beginning of period.
- Parameters
freq (str, default frequency of PeriodIndex) – Desired frequency.
how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end.
axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default).
copy (bool, default True) – If False then underlying input data is not copied.
- Returns
- Return type
DataFrame with DatetimeIndex
Notes
See pandas API documentation for pandas.DataFrame.to_timestamp for more.
- to_xarray()
Return an xarray object from the pandas object.
- Returns
Data in the pandas structure converted to Dataset if the object is a DataFrame, or a DataArray if the object is a Series.
- Return type
xarray.DataArray or xarray.Dataset
See also
DataFrame.to_hdf
Write DataFrame to an HDF5 file.
DataFrame.to_parquet
Write a DataFrame to the binary parquet format.
Notes
See pandas API documentation for pandas.DataFrame.to_xarray for more. See the xarray docs
Examples
>>> df = pd.DataFrame([('falcon', 'bird', 389.0, 2), ... ('parrot', 'bird', 24.0, 2), ... ('lion', 'mammal', 80.5, 4), ... ('monkey', 'mammal', np.nan, 4)], ... columns=['name', 'class', 'max_speed', ... 'num_legs']) >>> df name class max_speed num_legs 0 falcon bird 389.0 2 1 parrot bird 24.0 2 2 lion mammal 80.5 4 3 monkey mammal NaN 4
>>> df.to_xarray() <xarray.Dataset> Dimensions: (index: 4) Coordinates: * index (index) int64 0 1 2 3 Data variables: name (index) object 'falcon' 'parrot' 'lion' 'monkey' class (index) object 'bird' 'bird' 'mammal' 'mammal' max_speed (index) float64 389.0 24.0 80.5 nan num_legs (index) int64 2 2 4 4
>>> df['max_speed'].to_xarray() <xarray.DataArray 'max_speed' (index: 4)> array([389. , 24. , 80.5, nan]) Coordinates: * index (index) int64 0 1 2 3
>>> dates = pd.to_datetime(['2018-01-01', '2018-01-01', ... '2018-01-02', '2018-01-02']) >>> df_multiindex = pd.DataFrame({'date': dates, ... 'animal': ['falcon', 'parrot', ... 'falcon', 'parrot'], ... 'speed': [350, 18, 361, 15]}) >>> df_multiindex = df_multiindex.set_index(['date', 'animal'])
>>> df_multiindex speed date animal 2018-01-01 falcon 350 parrot 18 2018-01-02 falcon 361 parrot 15
>>> df_multiindex.to_xarray() <xarray.Dataset> Dimensions: (animal: 2, date: 2) Coordinates: * date (date) datetime64[ns] 2018-01-01 2018-01-02 * animal (animal) object 'falcon' 'parrot' Data variables: speed (date, animal) int64 350 18 361 15
- transform(func, axis=0, *args, **kwargs)
Call
func
on self producing a DataFrame with transformed values.Produced DataFrame will have same axis length as self.
- Parameters
func (function, str, list-like or dict-like) –
Function to use for transforming the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.
Accepted combinations are:
function
string function name
list-like of functions and/or function names, e.g.
[np.exp, 'sqrt']
dict-like of axis labels -> functions, function names or list-like of such.
axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.
*args – Positional arguments to pass to func.
**kwargs – Keyword arguments to pass to func.
- Returns
A DataFrame that must have the same length as self.
- Return type
:raises ValueError : If the returned DataFrame has a different length than self.:
See also
DataFrame.agg
Only perform aggregating type operations.
DataFrame.apply
Invoke function on a DataFrame.
Notes
See pandas API documentation for pandas.DataFrame.transform for more. Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.
Examples
>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)}) >>> df A B 0 0 1 1 1 2 2 2 3 >>> df.transform(lambda x: x + 1) A B 0 1 2 1 2 3 2 3 4
Even though the resulting DataFrame must have the same length as the input DataFrame, it is possible to provide several input functions:
>>> s = pd.Series(range(3)) >>> s 0 0 1 1 2 2 dtype: int64 >>> s.transform([np.sqrt, np.exp]) sqrt exp 0 0.000000 1.000000 1 1.000000 2.718282 2 1.414214 7.389056
You can call transform on a GroupBy object:
>>> df = pd.DataFrame({ ... "Date": [ ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05", ... "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"], ... "Data": [5, 8, 6, 1, 50, 100, 60, 120], ... }) >>> df Date Data 0 2015-05-08 5 1 2015-05-07 8 2 2015-05-06 6 3 2015-05-05 1 4 2015-05-08 50 5 2015-05-07 100 6 2015-05-06 60 7 2015-05-05 120 >>> df.groupby('Date')['Data'].transform('sum') 0 55 1 108 2 66 3 121 4 55 5 108 6 66 7 121 Name: Data, dtype: int64
>>> df = pd.DataFrame({ ... "c": [1, 1, 1, 2, 2, 2, 2], ... "type": ["m", "n", "o", "m", "m", "n", "n"] ... }) >>> df c type 0 1 m 1 1 n 2 1 o 3 2 m 4 2 m 5 2 n 6 2 n >>> df['size'] = df.groupby('c')['type'].transform(len) >>> df c type size 0 1 m 3 1 1 n 3 2 1 o 3 3 2 m 4 4 2 m 4 5 2 n 4 6 2 n 4
- truediv(other, axis='columns', level=None, fill_value=None)
Get Floating division of dataframe and other, element-wise (binary operator truediv).
Equivalent to
dataframe / other
, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.
- Parameters
other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.
axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.
fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.
- Returns
Result of the arithmetic operation.
- Return type
See also
DataFrame.add
Add DataFrames.
DataFrame.sub
Subtract DataFrames.
DataFrame.mul
Multiply DataFrames.
DataFrame.div
Divide DataFrames (float division).
DataFrame.truediv
Divide DataFrames (float division).
DataFrame.floordiv
Divide DataFrames (integer division).
DataFrame.mod
Calculate modulo (remainder after division).
DataFrame.pow
Calculate exponential power.
Notes
See pandas API documentation for pandas.DataFrame.truediv for more. Mismatched indices will be unioned together.
Examples
>>> df = pd.DataFrame({'angles': [0, 3, 4], ... 'degrees': [360, 180, 360]}, ... index=['circle', 'triangle', 'rectangle']) >>> df angles degrees circle 0 360 triangle 3 180 rectangle 4 360
Add a scalar with operator version which return the same results.
>>> df + 1 angles degrees circle 1 361 triangle 4 181 rectangle 5 361
>>> df.add(1) angles degrees circle 1 361 triangle 4 181 rectangle 5 361
Divide by constant with reverse version.
>>> df.div(10) angles degrees circle 0.0 36.0 triangle 0.3 18.0 rectangle 0.4 36.0
>>> df.rdiv(10) angles degrees circle inf 0.027778 triangle 3.333333 0.055556 rectangle 2.500000 0.027778
Subtract a list and Series by axis with operator version.
>>> df - [1, 2] angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub([1, 2], axis='columns') angles degrees circle -1 358 triangle 2 178 rectangle 3 358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']), ... axis='index') angles degrees circle -1 359 triangle 2 179 rectangle 3 359
Multiply a DataFrame of different shape with operator version.
>>> other = pd.DataFrame({'angles': [0, 3, 4]}, ... index=['circle', 'triangle', 'rectangle']) >>> other angles circle 0 triangle 3 rectangle 4
>>> df * other angles degrees circle 0 NaN triangle 9 NaN rectangle 16 NaN
>>> df.mul(other, fill_value=0) angles degrees circle 0 0.0 triangle 9 0.0 rectangle 16 0.0
Divide by a MultiIndex by level.
>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6], ... 'degrees': [360, 180, 360, 360, 540, 720]}, ... index=[['A', 'A', 'A', 'B', 'B', 'B'], ... ['circle', 'triangle', 'rectangle', ... 'square', 'pentagon', 'hexagon']]) >>> df_multindex angles degrees A circle 0 360 triangle 3 180 rectangle 4 360 B square 4 360 pentagon 5 540 hexagon 6 720
>>> df.div(df_multindex, level=1, fill_value=0) angles degrees A circle NaN 1.0 triangle 1.0 1.0 rectangle 1.0 1.0 B square 0.0 0.0 pentagon 0.0 0.0 hexagon 0.0 0.0
- truncate(before=None, after=None, axis=None, copy=True)
Truncate a Series or DataFrame before and after some index value.
This is a useful shorthand for boolean indexing based on index values above or below certain thresholds.
- Parameters
before (date, str, int) – Truncate all rows before this index value.
after (date, str, int) – Truncate all rows after this index value.
axis ({0 or 'index', 1 or 'columns'}, optional) – Axis to truncate. Truncates the index (rows) by default.
copy (bool, default is True,) – Return a copy of the truncated section.
- Returns
The truncated Series or DataFrame.
- Return type
type of caller
See also
DataFrame.loc
Select a subset of a DataFrame by label.
DataFrame.iloc
Select a subset of a DataFrame by position.
Notes
See pandas API documentation for pandas.DataFrame.truncate for more. If the index being truncated contains only datetime values, before and after may be specified as strings instead of Timestamps.
Examples
>>> df = pd.DataFrame({'A': ['a', 'b', 'c', 'd', 'e'], ... 'B': ['f', 'g', 'h', 'i', 'j'], ... 'C': ['k', 'l', 'm', 'n', 'o']}, ... index=[1, 2, 3, 4, 5]) >>> df A B C 1 a f k 2 b g l 3 c h m 4 d i n 5 e j o
>>> df.truncate(before=2, after=4) A B C 2 b g l 3 c h m 4 d i n
The columns of a DataFrame can be truncated.
>>> df.truncate(before="A", after="B", axis="columns") A B 1 a f 2 b g 3 c h 4 d i 5 e j
For Series, only rows can be truncated.
>>> df['A'].truncate(before=2, after=4) 2 b 3 c 4 d Name: A, dtype: object
The index values in
truncate
can be datetimes or string dates.>>> dates = pd.date_range('2016-01-01', '2016-02-01', freq='s') >>> df = pd.DataFrame(index=dates, data={'A': 1}) >>> df.tail() A 2016-01-31 23:59:56 1 2016-01-31 23:59:57 1 2016-01-31 23:59:58 1 2016-01-31 23:59:59 1 2016-02-01 00:00:00 1
>>> df.truncate(before=pd.Timestamp('2016-01-05'), ... after=pd.Timestamp('2016-01-10')).tail() A 2016-01-09 23:59:56 1 2016-01-09 23:59:57 1 2016-01-09 23:59:58 1 2016-01-09 23:59:59 1 2016-01-10 00:00:00 1
Because the index is a DatetimeIndex containing only dates, we can specify before and after as strings. They will be coerced to Timestamps before truncation.
>>> df.truncate('2016-01-05', '2016-01-10').tail() A 2016-01-09 23:59:56 1 2016-01-09 23:59:57 1 2016-01-09 23:59:58 1 2016-01-09 23:59:59 1 2016-01-10 00:00:00 1
Note that
truncate
assumes a 0 value for any unspecified time component (midnight). This differs from partial string slicing, which returns any partially matching dates.>>> df.loc['2016-01-05':'2016-01-10', :].tail() A 2016-01-10 23:59:55 1 2016-01-10 23:59:56 1 2016-01-10 23:59:57 1 2016-01-10 23:59:58 1 2016-01-10 23:59:59 1
- tshift(periods=1, freq=None, axis=0)
Shift the time index, using the index’s frequency if available.
Deprecated since version 1.1.0: Use shift instead.
- Parameters
periods (int) – Number of periods to move, can be positive or negative.
freq (DateOffset, timedelta, or str, default None) – Increment to use from the tseries module or time rule expressed as a string (e.g. ‘EOM’).
axis ({0 or ‘index’, 1 or ‘columns’, None}, default 0) – Corresponds to the axis that contains the Index.
- Returns
shifted
- Return type
Series/DataFrame
Notes
See pandas API documentation for pandas.DataFrame.tshift for more. If freq is not specified then tries to use the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown
- tz_convert(tz, axis=0, level=None, copy=True)
Convert tz-aware axis to target time zone.
- Parameters
tz (str or tzinfo object) –
axis (the axis to convert) –
level (int, str, default None) – If axis is a MultiIndex, convert a specific level. Otherwise must be None.
copy (bool, default True) – Also make a copy of the underlying data.
- Returns
Object with time zone converted axis.
- Return type
{klass}
- Raises
TypeError – If the axis is tz-naive.
Notes
See pandas API documentation for pandas.DataFrame.tz_convert for more.
- tz_localize(tz, axis=0, level=None, copy=True, ambiguous='raise', nonexistent='raise')
Localize tz-naive index of a Series or DataFrame to target time zone.
This operation localizes the Index. To localize the values in a timezone-naive Series, use
Series.dt.tz_localize()
.- Parameters
tz (str or tzinfo) –
axis (the axis to localize) –
level (int, str, default None) – If axis ia a MultiIndex, localize a specific level. Otherwise must be None.
copy (bool, default True) – Also make a copy of the underlying data.
ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –
When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.
’infer’ will attempt to infer fall dst-transition hours based on order
bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)
’NaT’ will return NaT where there are ambiguous times
’raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent (str, default 'raise') –
A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST. Valid values are:
’shift_forward’ will shift the nonexistent time forward to the closest existing time
’shift_backward’ will shift the nonexistent time backward to the closest existing time
’NaT’ will return NaT where there are nonexistent times
timedelta objects will shift nonexistent times by the timedelta
’raise’ will raise an NonExistentTimeError if there are nonexistent times.
- Returns
Same type as the input.
- Return type
- Raises
TypeError – If the TimeSeries is tz-aware and tz is not None.
Examples
Localize local times:
>>> s = pd.Series([1], ... index=pd.DatetimeIndex(['2018-09-15 01:30:00'])) >>> s.tz_localize('CET') 2018-09-15 01:30:00+02:00 1 dtype: int64
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
>>> s = pd.Series(range(7), ... index=pd.DatetimeIndex(['2018-10-28 01:30:00', ... '2018-10-28 02:00:00', ... '2018-10-28 02:30:00', ... '2018-10-28 02:00:00', ... '2018-10-28 02:30:00', ... '2018-10-28 03:00:00', ... '2018-10-28 03:30:00'])) >>> s.tz_localize('CET', ambiguous='infer') 2018-10-28 01:30:00+02:00 0 2018-10-28 02:00:00+02:00 1 2018-10-28 02:30:00+02:00 2 2018-10-28 02:00:00+01:00 3 2018-10-28 02:30:00+01:00 4 2018-10-28 03:00:00+01:00 5 2018-10-28 03:30:00+01:00 6 dtype: int64
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous parameter to set the DST explicitly
>>> s = pd.Series(range(3), ... index=pd.DatetimeIndex(['2018-10-28 01:20:00', ... '2018-10-28 02:36:00', ... '2018-10-28 03:46:00'])) >>> s.tz_localize('CET', ambiguous=np.array([True, True, False])) 2018-10-28 01:20:00+02:00 0 2018-10-28 02:36:00+02:00 1 2018-10-28 03:46:00+01:00 2 dtype: int64
If the DST transition causes nonexistent times, you can shift these dates forward or backward with a timedelta object or ‘shift_forward’ or ‘shift_backward’.
>>> s = pd.Series(range(2), ... index=pd.DatetimeIndex(['2015-03-29 02:30:00', ... '2015-03-29 03:30:00'])) >>> s.tz_localize('Europe/Warsaw', nonexistent='shift_forward') 2015-03-29 03:00:00+02:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64 >>> s.tz_localize('Europe/Warsaw', nonexistent='shift_backward') 2015-03-29 01:59:59.999999999+01:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64 >>> s.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H')) 2015-03-29 03:30:00+02:00 0 2015-03-29 03:30:00+02:00 1 dtype: int64
Notes
See pandas API documentation for pandas.DataFrame.tz_localize for more.
- value_counts(subset: Optional[Sequence[Hashable]] = None, normalize: modin.pandas.base.BasePandasDataset.bool = False, sort: modin.pandas.base.BasePandasDataset.bool = True, ascending: modin.pandas.base.BasePandasDataset.bool = False, dropna: modin.pandas.base.BasePandasDataset.bool = True)
Return a Series containing counts of unique rows in the DataFrame.
New in version 1.1.0.
- Parameters
subset (list-like, optional) – Columns to use when counting unique combinations.
normalize (bool, default False) – Return proportions rather than frequencies.
sort (bool, default True) – Sort by frequencies.
ascending (bool, default False) – Sort in ascending order.
dropna (bool, default True) –
Don’t include counts of rows that contain NA values.
New in version 1.3.0.
- Returns
- Return type
See also
Series.value_counts
Equivalent method on Series.
Notes
See pandas API documentation for pandas.DataFrame.value_counts for more. The returned Series will have a MultiIndex with one level per input column. By default, rows that contain any NA values are omitted from the result. By default, the resulting Series will be in descending order so that the first element is the most frequently-occurring row.
Examples
>>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6], ... 'num_wings': [2, 0, 0, 0]}, ... index=['falcon', 'dog', 'cat', 'ant']) >>> df num_legs num_wings falcon 2 2 dog 4 0 cat 4 0 ant 6 0
>>> df.value_counts() num_legs num_wings 4 0 2 2 2 1 6 0 1 dtype: int64
>>> df.value_counts(sort=False) num_legs num_wings 2 2 1 4 0 2 6 0 1 dtype: int64
>>> df.value_counts(ascending=True) num_legs num_wings 2 2 1 6 0 1 4 0 2 dtype: int64
>>> df.value_counts(normalize=True) num_legs num_wings 4 0 0.50 2 2 0.25 6 0 0.25 dtype: float64
With dropna set to False we can also count rows with NA values.
>>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'], ... 'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']}) >>> df first_name middle_name 0 John Smith 1 Anne <NA> 2 John <NA> 3 Beth Louise
>>> df.value_counts() first_name middle_name Beth Louise 1 John Smith 1 dtype: int64
>>> df.value_counts(dropna=False) first_name middle_name Anne NaN 1 Beth Louise 1 John Smith 1 NaN 1 dtype: int64
- property values
Return a Numpy representation of the DataFrame.
Warning
We recommend using
DataFrame.to_numpy()
instead.Only the values in the DataFrame will be returned, the axes labels will be removed.
- Returns
The values of the DataFrame.
- Return type
numpy.ndarray
See also
DataFrame.to_numpy
Recommended alternative to this method.
DataFrame.index
Retrieve the index labels.
DataFrame.columns
Retrieving the column names.
Notes
See pandas API documentation for pandas.DataFrame.values for more. The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By
numpy.find_common_type()
convention, mixing int64 and uint64 will result in a float64 dtype.Examples
A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
>>> df = pd.DataFrame({'age': [ 3, 29], ... 'height': [94, 170], ... 'weight': [31, 115]}) >>> df age height weight 0 3 94 31 1 29 170 115 >>> df.dtypes age int64 height int64 weight int64 dtype: object >>> df.values array([[ 3, 94, 31], [ 29, 170, 115]])
A DataFrame with mixed type columns(e.g., str/object, int64, float32) results in an ndarray of the broadest type that accommodates these mixed types (e.g., object).
>>> df2 = pd.DataFrame([('parrot', 24.0, 'second'), ... ('lion', 80.5, 1), ... ('monkey', np.nan, None)], ... columns=('name', 'max_speed', 'rank')) >>> df2.dtypes name object max_speed float64 rank object dtype: object >>> df2.values array([['parrot', 24.0, 'second'], ['lion', 80.5, 1], ['monkey', nan, None]], dtype=object)
- var(axis=None, skipna=None, level=None, ddof=1, numeric_only=None, **kwargs)
Return unbiased variance over requested axis.
Normalized by N-1 by default. This can be changed using the ddof argument
- Parameters
axis ({index (0), columns (1)}) –
skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.
level (int or level name, default None) – If the axis is a MultiIndex (hierarchical), count along a particular level, collapsing into a Series.
ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.
numeric_only (bool, default None) – Include only float, int, boolean columns. If None, will attempt to use everything, then use only numeric data. Not implemented for Series.
- Returns
- Return type
Notes
See pandas API documentation for pandas.DataFrame.var for more. To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
DataFrame Module Overview¶
Modin’s pandas.DataFrame
API¶
Modin’s pandas.DataFrame
API is backed by a distributed object providing an identical
API to pandas. After the user calls some DataFrame
function, this call is internally
rewritten into a representation that can be processed in parallel by the partitions. These
results can be e.g., reduced to single output, identical to the single threaded
pandas DataFrame
method output.
- class modin.pandas.dataframe.DataFrame(data=None, index=None, columns=None, dtype=None, copy=None, query_compiler=None)¶
Modin distributed representation of
pandas.DataFrame
.Internally, the data can be divided into partitions along both columns and rows in order to parallelize computations and utilize the user’s hardware as much as possible.
Inherit common for
DataFrame
-s andSeries
functionality from the BasePandasDataset class.- Parameters
data (DataFrame, Series, pandas.DataFrame, ndarray, Iterable or dict, optional) – Dict can contain
Series
, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.index (Index or array-like, optional) – Index to use for resulting frame. Will default to
RangeIndex
if no indexing information part of input data and no index provided.columns (Index or array-like, optional) – Column labels to use for resulting frame. Will default to
RangeIndex
if no column labels are provided.dtype (str, np.dtype, or pandas.ExtensionDtype, optional) – Data type to force. Only a single dtype is allowed. If None, infer.
copy (bool, default: False) – Copy data from inputs. Only affects
pandas.DataFrame
/ 2d ndarray input.query_compiler (BaseQueryCompiler, optional) – A query compiler object to create the
DataFrame
from.
Notes
See pandas API documentation for pandas.DataFrame for more.
DataFrame
can be created either from passed data or query_compiler. If both parameters are provided, data source will be prioritized in the next order:Modin
DataFrame
orSeries
passed with data parameter.Query compiler from the query_compiler parameter.
Various pandas/NumPy/Python data structures passed with data parameter.
The last option is less desirable since import of such data structures is very inefficient, please use previously created Modin structures from the fist two options or import data using highly efficient Modin IO tools (for example
pd.read_csv
).
Usage Guide¶
The most efficient way to create Modin DataFrame
is to import data from external
storage using the highly efficient Modin IO methods (for example using pd.read_csv
,
see details for Modin IO methods in the separate section),
but even if the data does not originate from a file, any pandas supported data type or
pandas.DataFrame
can be used. Internally, the DataFrame
data is divided into
partitions, which number along an axis usually corresponds to the number of the user’s hardware CPUs. If needed,
the number of partitions can be changed by setting modin.config.NPartitions
.
Let’s consider simple example of creation and interacting with Modin DataFrame
:
import modin.config
# This explicitly sets the number of partitions
modin.config.NPartitions.put(4)
import modin.pandas as pd
import pandas
# Create Modin DataFrame from the external file
pd_dataframe = pd.read_csv("test_data.csv")
# Create Modin DataFrame from the python object
# data = {f'col{x}': [f'col{x}_{y}' for y in range(100, 356)] for x in range(4)}
# pd_dataframe = pd.DataFrame(data)
# Create Modin DataFrame from the pandas object
# pd_dataframe = pd.DataFrame(pandas.DataFrame(data))
# Show created DataFrame
print(pd_dataframe)
# List DataFrame partitions. Note, that internal API is intended for
# developers needs and was used here for presentation purposes
# only.
partitions = pd_dataframe._query_compiler._modin_frame._partitions
print(partitions)
# Show the first DataFrame partition
print(partitions[0][0].get())
Output:
# created DataFrame
col0 col1 col2 col3
0 col0_100 col1_100 col2_100 col3_100
1 col0_101 col1_101 col2_101 col3_101
2 col0_102 col1_102 col2_102 col3_102
3 col0_103 col1_103 col2_103 col3_103
4 col0_104 col1_104 col2_104 col3_104
.. ... ... ... ...
251 col0_351 col1_351 col2_351 col3_351
252 col0_352 col1_352 col2_352 col3_352
253 col0_353 col1_353 col2_353 col3_353
254 col0_354 col1_354 col2_354 col3_354
255 col0_355 col1_355 col2_355 col3_355
[256 rows x 4 columns]
# List of DataFrame partitions
[[<modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition object at 0x000002F4ABDFEB20>]
[<modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition object at 0x000002F4ABDFEC10>]
[<modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition object at 0x000002F4ABDFED00>]
[<modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition object at 0x000002F4ABDFEDF0>]]
# The first DataFrame partition
col0 col1 col2 col3
0 col0_100 col1_100 col2_100 col3_100
1 col0_101 col1_101 col2_101 col3_101
2 col0_102 col1_102 col2_102 col3_102
3 col0_103 col1_103 col2_103 col3_103
4 col0_104 col1_104 col2_104 col3_104
.. ... ... ... ...
60 col0_160 col1_160 col2_160 col3_160
61 col0_161 col1_161 col2_161 col3_161
62 col0_162 col1_162 col2_162 col3_162
63 col0_163 col1_163 col2_163 col3_163
64 col0_164 col1_164 col2_164 col3_164
[65 rows x 4 columns]
As we show in the example above, Modin DataFrame
can be easily created, and supports any input that pandas DataFrame
supports.
Also note that tuning of the DataFrame
partitioning can be done by just setting a single config.
Series Module Overview¶
Modin’s pandas.Series
API¶
Modin’s pandas.Series
API is backed by a distributed object providing an identical
API to pandas. After the user calls some Series
function, this call is internally rewritten
into a representation that can be processed in parallel by the partitions. These
results can be e.g., reduced to single output, identical to the single threaded
pandas Series
method output.
- class modin.pandas.series.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False, query_compiler=None)¶
Modin distributed representation of pandas.Series.
Internally, the data can be divided into partitions in order to parallelize computations and utilize the user’s hardware as much as possible.
Inherit common for DataFrames and Series functionality from the BasePandasDataset class.
- Parameters
data (modin.pandas.Series, array-like, Iterable, dict, or scalar value, optional) – Contains data stored in Series. If data is a dict, argument order is maintained.
index (array-like or Index (1d), optional) – Values must be hashable and have the same length as data.
dtype (str, np.dtype, or pandas.ExtensionDtype, optional) – Data type for the output Series. If not specified, this will be inferred from data.
name (str, optional) – The name to give to the Series.
copy (bool, default: False) – Copy input data.
fastpath (bool, default: False) – pandas internal parameter.
query_compiler (BaseQueryCompiler, optional) – A query compiler object to create the Series from.
Notes
See pandas API documentation for pandas.Series for more.
Usage Guide¶
The most efficient way to create Modin Series
is to import data from external
storage using the highly efficient Modin IO methods (for example using pd.read_csv
,
see details for Modin IO methods in the separate section),
but even if the data does not originate from a file, any pandas supported data type or
pandas.Series
can be used. Internally, the Series
data is divided into
partitions, which number along an axis usually corresponds to the number of the user’s hardware CPUs. If needed,
the number of partitions can be changed by setting modin.config.NPartitions
.
Let’s consider simple example of creation and interacting with Modin Series
:
import modin.config
# This explicitly sets the number of partitions
modin.config.NPartitions.put(4)
import modin.pandas as pd
import pandas
# Create Modin Series from the external file
pd_series = pd.read_csv("test_data.csv", header=None).squeeze()
# Create Modin Series from the python object
# pd_series = pd.Series([x for x in range(256)])
# Create Modin Series from the pandas object
# pd_series = pd.Series(pandas.Series([x for x in range(256)]))
# Show created `Series`
print(pd_series)
# List `Series` partitions. Note, that internal API is intended for
# developers needs and was used here for presentation purposes
# only.
partitions = pd_series._query_compiler._modin_frame._partitions
print(partitions)
# Show the first `Series` partition
print(partitions[0][0].get())
Output:
# created `Series`
0 100
1 101
2 102
3 103
4 104
...
251 351
252 352
253 353
254 354
255 355
Name: 0, Length: 256, dtype: int64
# List of `Series` partitions
[[<modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition object at 0x000001E7CD11BD60>]
[<modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition object at 0x000001E7CD11BE50>]
[<modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition object at 0x000001E7CD11BF40>]
[<modin.engines.ray.pandas_on_ray.frame.partition.PandasOnRayFramePartition object at 0x000001E7CD13E070>]]
# The first `Series` partition
0
0 100
1 101
2 102
3 103
4 104
.. ...
60 160
61 161
62 162
63 163
64 164
[65 rows x 1 columns]
As we show in the example above, Modin Series
can be easily created, and supports any input that pandas Series
supports.
Also note that tuning of the Series
partitioning can be done by just setting a single config.
Query Compiler¶
The Query Compiler receives queries from the pandas API layer. The API layer’s responsibility is to ensure clean input to the Query Compiler. The Query Compiler must have knowledge of the compute kernels/in-memory format of the data in order to efficiently compile the queries.
The Query Compiler is responsible for sending the compiled query to the Modin DataFrame. In this design, the Query Compiler does not have information about where or when the query will be executed, and gives the control of the partition layout to the Modin DataFrame.
In the interest of reducing the pandas API, the Query Compiler layer closely follows the pandas API, but cuts out a large majority of the repetition.
Modin DataFrame¶
At this layer, operations can be performed lazily. Currently, Modin executes most
operations eagerly in an attempt to behave as pandas does. Some operations, e.g.
transpose
are expensive and create full copies of the data in-memory. In these
cases, we can wait until another operation triggers computation. In the future, we plan
to add additional query planning and laziness to Modin to ensure that queries are
performed efficiently.
The structure of the Modin DataFrame is extensible, such that any operation that could be better optimized for a given backend can be overridden and optimized in that way.
This layer has a significantly reduced API from the QueryCompiler and the user-facing API. Each of these APIs represents a single way of performing a given operation or behavior. Some of these are expanded for convenience/understanding. The API abstractions are as follows:
Modin DataFrame API¶
mask
: Indexing/masking/selecting on the data (by label or by integer index).copy
: Create a copy of the data.mapreduce
: Reduce the dimension of the data.foldreduce
: Reduce the dimension of the data, but entire column/row information is needed.map
: Perform a map.fold
: Perform a fold.apply_<type>
: Apply a function that may or may not change the shape of the data.full_axis
: Apply a function requires knowledge of the entire axis.full_axis_select_indices
: Apply a function performed on a subset of the data that requires knowledge of the entire axis.select_indices
: Apply a function to a subset of the data. This is mainly used for indexing.
binary_op
: Perform a function between two dataframes.concat
: Append one or more dataframes to either axis of this dataframe.transpose
: Swap the axes (columns become rows, rows become columns).groupby
:groupby_reduce
: Perform a reduction on each group.groupby_apply
: Apply a function to each group.
- take functions
head
: Take the firstn
rows.tail
: Take the lastn
rows.front
: Take the firstn
columns.back
: Take the lastn
columns.
- import/export functions
from_pandas
: Convert a pandas dataframe to a Modin dataframe.to_pandas
: Convert a Modin dataframe to a pandas dataframe.to_numpy
: Convert a Modin dataframe to a numpy array.
More documentation can be found internally in the code. This API is not complete, but represents an overwhelming majority of operations and behaviors.
This API can be implemented by other distributed/parallel DataFrame libraries and plugged in to Modin as well. Create an issue or discuss on our Discourse for more information!
The Modin DataFrame is responsible for the data layout and shuffling, partitioning, and serializing the tasks that get sent to each partition. Other implementations of the Modin DataFrame interface will have to handle these as well.
Execution Engine/Framework¶
This layer is what Modin uses to perform computation on a partition of the data. The Modin DataFrame is designed to work with task parallel frameworks, but with some effort, a data parallel framework is possible.
Internal abstractions¶
These abstractions are not included in the above architecture, but are important to the internals of Modin.
Partition Manager¶
The Partition Manager can change the size and shape of the partitions based on the type of operation. For example, certain operations are complex and require access to an entire column or row. The Partition Manager can convert the block partitions to row partitions or column partitions. This gives Modin the flexibility to perform operations that are difficult in row-only or column-only partitioning schemas.
Another important component of the Partition Manager is the serialization and shipment
of compiled queries to the Partitions. It maintains metadata for the length and width of
each partition, so when operations only need to operate on or extract a subset of the
data, it can ship those queries directly to the correct partition. This is particularly
important for some operations in pandas which can accept different arguments and
operations for different columns, e.g. fillna
with a dictionary.
This abstraction separates the actual data movement and function application from the DataFrame layer to keep the DataFrame API small and separately optimize the data movement and metadata management.
Partition¶
Partitions are responsible for managing a subset of the DataFrame. As is mentioned above, the DataFrame is partitioned both row and column-wise. This gives Modin scalability in both directions and flexibility in data layout. There are a number of optimizations in Modin that are implemented in the partitions. Partitions are specific to the execution framework and in-memory format of the data. This allows Modin to exploit potential optimizations across both of these. These optimizations are explained further on the pages specific to the execution framework.
Supported Execution Frameworks and Memory Formats¶
This is the list of execution frameworks and memory formats supported in Modin. If you would like to contribute a new execution framework or memory format, please see the documentation page on contributing.
- Pandas on Ray
Uses the Ray execution framework.
The compute kernel/in-memory format is a pandas DataFrame.
- Pandas on Dask
Uses the Dask Futures execution framework.
The compute kernel/in-memory format is a pandas DataFrame.
- Omnisci
Uses OmniSciDB as an engine.
The compute kernel/in-memory format is a pyarrow Table or pandas DataFrame when defaulting to pandas.
- Pyarrow on Ray (experimental)
Uses the Ray execution framework.
The compute kernel/in-memory format is a pyarrow Table.
Module/Class View¶
Modin modules layout is shown below. To deep dive into Modin internal implementation details just pick module you are interested in (only some of the modules are covered by documentation for now, the rest is coming soon…).
├───.github ├───asv_bench ├───ci ├───docker ├───docs ├───examples ├───modin │ ├─── backends │ │ ├───base │ │ │ └─── query_compiler │ │ ├─── pandas │ │ | ├─── parsers │ │ │ └─── query_compiler │ │ └─── pyarrow │ │ | ├─── parsers │ │ │ └─── query_compiler │ ├─── config │ ├───data_management │ │ ├─── factories │ │ └─── functions │ ├───distributed │ │ └───dataframe │ │ └─── pandas │ ├───engines │ │ ├───base │ │ │ ├─── frame │ │ │ └─── io │ │ ├───dask │ │ │ └───pandas_on_dask │ │ | └─── frame │ │ ├───python │ │ │ └───pandas_on_python │ │ │ └─── frame │ │ └───ray │ │ ├─── generic │ │ ├───cudf_on_ray │ │ │ ├─── frame │ │ │ └─── io │ │ └───pandas_on_ray │ │ └─── frame │ ├── experimental │ │ ├─── backends │ │ │ └─── omnisci │ │ │ └─── query_compiler │ │ ├───cloud │ │ ├───engines │ │ │ ├─── omnisci_on_native │ │ │ ├─── pandas_on_ray │ │ │ └─── pyarrow_on_ray │ │ ├─── pandas │ │ ├─── sklearn │ │ ├───sql │ │ └─── xgboost │ ├───pandas │ │ ├─── dataframe │ │ └─── series │ ├───spreadsheet │ └───sql ├───requirements ├───scripts └───stress_tests
Partition API in Modin¶
When you are working with a Modin Dataframe, you can unwrap its remote partitions
to get the raw futures objects compatible with the execution engine (e.g. ray.ObjectRef
for Ray).
In addition to unwrapping of the remote partitions we also provide an API to construct a modin.pandas.DataFrame
from raw futures objects.
Partition IPs¶
For finer grained placement control, Modin also provides an API to get the IP addresses of the nodes that hold each partition. You can pass the partitions having needed IPs to your function. It can help with minimazing of data movement between nodes.
Partition API implementations¶
By default, a Modin Dataframe stores underlying partitions as pandas.DataFrame
objects.
You can find the specific implementation of Modin’s Partition Interface in Pandas Partition API.
Pandas Partition API¶
This page contains a description of the API to extract partitions from and build Modin Dataframes.
unwrap_partitions¶
- modin.distributed.dataframe.pandas.unwrap_partitions(api_layer_object, axis=None, get_ip=False)¶
Unwrap partitions of the
api_layer_object
.- Parameters
api_layer_object (DataFrame or Series) – The API layer object.
axis ({None, 0, 1}, default: None) – The axis to unwrap partitions for (0 - row partitions, 1 - column partitions). If
axis is None
, the partitions are unwrapped as they are currently stored.get_ip (bool, default: False) – Whether to get node ip address to each partition or not.
- Returns
A list of Ray.ObjectRef/Dask.Future to partitions of the
api_layer_object
if Ray/Dask is used as an engine.- Return type
list
Notes
If
get_ip=True
, a list of tuples of Ray.ObjectRef/Dask.Future to node ip addresses and partitions of theapi_layer_object
, respectively, is returned if Ray/Dask is used as an engine (i.e.[(Ray.ObjectRef/Dask.Future, Ray.ObjectRef/Dask.Future), ...]
).
from_partitions¶
- modin.distributed.dataframe.pandas.from_partitions(partitions, axis, index=None, columns=None, row_lengths=None, column_widths=None)¶
Create DataFrame from remote partitions.
- Parameters
partitions (list) – A list of Ray.ObjectRef/Dask.Future to partitions depending on the engine used. Or a list of tuples of Ray.ObjectRef/Dask.Future to node ip addresses and partitions depending on the engine used (i.e.
[(Ray.ObjectRef/Dask.Future, Ray.ObjectRef/Dask.Future), ...]
).axis ({None, 0 or 1}) –
The
axis
parameter is used to identify what are the partitions passed. You have to set:axis=0
if you want to create DataFrame from row partitionsaxis=1
if you want to create DataFrame from column partitionsaxis=None
if you want to create DataFrame from 2D list of partitions
index (sequence, optional) – The index for the DataFrame. Is computed if not provided.
columns (sequence, optional) – The columns for the DataFrame. Is computed if not provided.
row_lengths (list, optional) – The length of each partition in the rows. The “height” of each of the block partitions. Is computed if not provided.
column_widths (list, optional) – The width of each partition in the columns. The “width” of each of the block partitions. Is computed if not provided.
- Returns
DataFrame instance created from remote partitions.
- Return type
modin.pandas.DataFrame
Notes
Pass index, columns, row_lengths and column_widths to avoid triggering extra computations of the metadata when creating a DataFrame.
Example¶
import modin.pandas as pd
from modin.distributed.dataframe.pandas import unwrap_partitions, from_partitions
import numpy as np
data = np.random.randint(0, 100, size=(2 ** 10, 2 ** 8))
df = pd.DataFrame(data)
partitions = unwrap_partitions(df, axis=0, get_ip=True)
print(partitions)
new_df = from_partitions(partitions, axis=0)
print(new_df)
Ray engine¶
However, it is worth noting that for Modin on Ray
engine with pandas
backend IPs of the remote partitions may not match
actual locations if the partitions are lower than 100 kB. Ray saves such objects (<= 100 kB, by default) in in-process store
of the calling process (please, refer to Ray documentation for more information). We can’t get IPs for such objects while maintaining good performance.
So, you should keep in mind this for unwrapping of the remote partitions with their IPs. Several options are provided to handle the case in
How to handle Ray objects that are lower 100 kB
section.
Dask engine¶
There is no mentioned above issue for Modin on Dask
engine with pandas
backend because Dask
saves any objects
in the worker process that processes a function (please, refer to Dask documentation for more information).
How to handle Ray objects that are lower than 100 kB¶
If you are sure that each of the remote partitions being unwrapped is higher than 100 kB, you can just import Modin or perform
ray.init()
manually.If you don’t know partition sizes you can pass the option
_system_config={"max_direct_call_object_size": <nbytes>,}
, wherenbytes
is threshold for objects that will be stored in in-process store, toray.init()
.You can also start Ray as follows:
ray start --head --system-config='{"max_direct_call_object_size":<nbytes>}'
.
Note that when specifying the threshold the performance of some Modin operations may change.
pandas on Ray¶
This section describes usage related documents for the pandas on Ray component of Modin.
Modin uses pandas as a primary memory format of the underlying partitions and optimizes queries ingested from the API layer in a specific way to this format. Thus, there is no need to care of choosing it but you can explicitly specify it anyway as shown below.
One of the execution engines that Modin uses is Ray. If you have Ray installed in your system, Modin also uses it by default to distribute computations.
If you want to be explicit, you could set the following environment variables:
export MODIN_ENGINE=ray
export MODIN_BACKEND=pandas
or turn it on in source code:
import modin.config as cfg
cfg.Engine.put('ray')
cfg.Backend.put('pandas')
Pandas on Dask¶
The Dask engine and documentation could use your help! Consider opening a pull request or an issue to contribute or ask clarifying questions.
OmniSci¶
This section describes usage related documents for the OmniSciDB-based engine of Modin.
This engine uses analytical database OmniSciDB to obtain high single-node scalability for specific set of dataframe operations. To enable this engine you could set the following environment variables:
export MODIN_ENGINE=native
export MODIN_BACKEND=omnisci
export MODIN_EXPERIMENTAL=true
or turn it on in source code:
import modin.config as cfg
cfg.Engine.put('native')
cfg.Backend.put('omnisci')
cfg.IsExperimental.put(True)
Pyarrow on Ray¶
Coming Soon!
Troubleshooting¶
We hope your experience with Modin is bug-free, but there are some quirks about Modin that may require troubleshooting.
Frequently encountered issues¶
This is a list of the most frequently encountered issues when using Modin. Some of these are working as intended, while others are known bugs that are being actively worked on.
Error During execution: ArrowIOError: Broken Pipe
¶
One of the more frequently encountered issues is an ArrowIOError: Broken Pipe
. This
error can happen in a couple of different ways. One of the most common ways this is
encountered is from pressing CTRL + C sending a KeyboardInterrupt
to Modin. In
Ray, when a KeyboardInterrupt
is sent, Ray will shutdown. This causes the
ArrowIOError: Broken Pipe
because there is no longer an available plasma store for
working on remote tasks. This is working as intended, as it is not yet possible in Ray
to kill a task that has already started computation.
The other common way this Error
is encountered is to let your computer go to sleep.
As an optimization, Ray will shutdown whenever the computer goes to sleep. This will
result in the same issue as above, because there is no longer a running instance of the
plasma store.
Solution
Restart your interpreter or notebook kernel.
Avoiding this Error
Avoid using KeyboardInterrupt
and keeping your notebook or terminal running while
your machine is asleep. If you do KeyboardInterrupt
, you must restart the kernel or
interpreter.
Error during execution: ArrowInvalid: Maximum size exceeded (2GB)
¶
Encountering this issue means that the limits of the Arrow plasma store have been exceeded by the partitions of your data. This can be encountered during shuffling data or operations that require multiple datasets. This will only affect extremely large DataFrames, and can potentially be worked around by setting the number of partitions. This error is being actively worked on and should be resolved in a future release.
Solution
import modin.pandas as pd
pd.DEFAULT_NPARTITIONS = 2 * pd.DEFAULT_NPARTITIONS
This will set the number of partitions to a higher count, and reduce the size in each. If this does not work for you, please open an issue.
Hanging on import modin.pandas as pd
¶
This can happen when Ray fails to start. It will keep retrying, but often it is faster to just restart the notebook or interpreter. Generally, this should not happen. Most commonly this is encountered when starting multiple notebooks or interpreters in quick succession.
Solution
Restart your interpreter or notebook kernel.
Avoiding this Error
Avoid starting many Modin notebooks or interpreters in quick succession. Wait 2-3 seconds before starting the next one.
Importing heterogeneous data by read_csv
¶
Since Modin read_csv
imports data in parallel, it can occur that data read by
different partitions can have different type (this happens when columns contains
heterogeneous data, i.e. column values are of different types), which are handled
differntly. Example of such behaviour is shown below.
import os
import pandas
import modin.pandas as pd
from modin.config import NPartitions
NPartitions.put(2)
test_filename = "test.csv"
# data with heterogeneous values in the first column
data = """one,2
3,4
5,6
7,8
9.0,10
"""
kwargs = {
# names of the columns to set, if `names` parameter is set,
# header inffering from the first data row/rows will be disabled
"names": ["col1", "col2"],
# explicit setting of data type of column/columns with heterogeneous
# data will force partitions to read data with correct dtype
# "dtype": {"col1": str},
}
try :
with open(test_filename, "w") as f:
f.write(data)
pandas_df = pandas.read_csv(test_filename, **kwargs)
pd_df = pd.read_csv(test_filename, **kwargs)
print(pandas_df)
print(pd_df)
finally:
os.remove(test_filename)
Output:
pandas_df:
col1 col2
0 one 2
1 3 4
2 5 6
3 7 8
4 9.0 10
pd_df:
col1 col2
0 one 2
1 3 4
2 5 6
3 7.0 8
4 9.0 10
In this case DataFrame read by pandas in the column col1
contain only str
data
because of the first string value (“one”), that forced pandas to handle full column
data as strings. Modin the fisrt partition (the first three rows) read data similary
to pandas, but the second partition (the last two rows) doesn’t contain any strings
in the first column and it’s data is read as floats because of the last column
value and as a result 7 value was read as 7.0, that differs from pandas output.
The above example showed the mechanism of occurence of pandas and Modin read_csv
outputs discrepancy during heterogeneous data import. Please note, that similar
situations can occur during different data/parameters combinations.
Solution
In the case if heterogeneous data is detected, corresponding warning will be showed in
the user’s console. Currently, the discrepancies of such type doesn’t properly handled
by Modin, and to avoid this issue, it is needed to set dtype
parameter of read_csv
function manually to force correct data type definition during data import by
partitions. Note, that to avoid excessive performance degradation, dtype
value should
be set fine-grained as it possible (specify dtype
parameter only for columns with
heterogeneous data).
Setting of dtype
parameter works well for most of the cases, but, unfortunetely, it is
ineffective if data file contain column which should be interpreted as index
(index_col
parameter is used) since dtype
parameter is responsible only for data
fields. For example, if in the above example, kwargs
will be set in the next way:
kwargs = {
"names": ["col1", "col2"],
"dtype": {"col1": str},
"index_col": "col1",
}
Resulting Modin DataFrame will contain incorrect value as in the case if dtype
is not set:
col1
one 2
3 4
5 6
7.0 8
9.0 10
In this case data should be imported without setting of index_col
parameter
and only then index column should be set as index (by using DataFrame.set_index
funcion for example) as it is shown in the example below:
pd_df = pd.read_csv(filename, dtype=data_dtype, index_col=None)
pd_df = pd_df.set_index(index_col_name)
pd_df.index.name = None
Contact¶
Mailing List¶
https://groups.google.com/forum/#!forum/modin-dev
General questions, potential contributors, and ideas should be directed to the developer mailing list. It is an open Google Group, so feel free to join anytime! If you are unsure about where to ask or post something, the mailing list is a good place to ask as well.
Issues¶
https://github.com/modin-project/modin/issues
Bug reports and feature requests should be directed to the issues page of the Modin GitHub repo.