20  Aggregate

import dask.dataframe as dd
import pandas as pd
import numpy as np
df = pd.DataFrame([('bird', 'Falconiformes', 389.0),
                   ('bird', 'Psittaciformes', 24.0),
                   ('mammal', 'Carnivora', 80.2),
                   ('mammal', 'Primates', np.nan),
                   ('mammal', 'Carnivora', 58)],
                  index=['falcon', 'parrot', 'lion', 'monkey', 'leopard'],
                  columns=('class', 'order', 'max_speed'))
df = dd.from_pandas(df, npartitions=1)
df
Dask DataFrame Structure:
class order max_speed
npartitions=1
falcon string string float64
parrot ... ... ...
Dask Name: frompandas, 1 expression

20.0.1 groupby(): group by categorical

grouped = df.groupby('class')
grouped2 = df.groupby(['class', 'order'])
grouped.size().compute()
class
bird      2
mammal    3
dtype: int64
grouped2.mean().compute()
max_speed
class order
bird Falconiformes 389.0
Psittaciformes 24.0
mammal Carnivora 69.1
Primates NaN

or in a single step:

df.groupby(['class', 'order']).mean().compute()
max_speed
class order
bird Falconiformes 389.0
Psittaciformes 24.0
mammal Carnivora 69.1
Primates NaN

20.1 Resources