import polars as pl
import re
22 Data I/O
22.1 Read CSV
= pl.read_csv("~/icloud/Data/iris.csv")
iris iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | str |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
… | … | … | … | … |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.2 Lazy Read CSV
= pl.scan_csv("~/icloud/Data/iris.csv")
iris type(iris)
polars.lazyframe.frame.LazyFrame
iris
NAIVE QUERY PLAN
run LazyFrame.show_graph() to see the optimized version
Fetch the lazy-read DataFrame:
= iris.fetch()
iris type(iris)
/var/folders/rb/99nqfz7s2rb6d_p0d6yxtbxc0000gn/T/ipykernel_55863/3654851018.py:1: DeprecationWarning:
`LazyFrame.fetch` is deprecated. `LazyFrame.fetch` is deprecated; use `LazyFrame.collect` instead, in conjunction with a call to `head`.
polars.dataframe.frame.DataFrame
iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | str |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
… | … | … | … | … |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.3 Column names
Get column names:
iris.columns
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
Set column names:
= [re.sub("\.", "_", col) for col in iris.columns]
iris.columns iris.columns
<>:1: SyntaxWarning:
invalid escape sequence '\.'
<>:1: SyntaxWarning:
invalid escape sequence '\.'
/var/folders/rb/99nqfz7s2rb6d_p0d6yxtbxc0000gn/T/ipykernel_55863/3413779961.py:1: SyntaxWarning:
invalid escape sequence '\.'
['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']
22.4 Get column names from LazyFrame
You can get the column names from a file without reading the entire file into memory. This is useful if you have a large file and only want to know the column names, e.g. to then set the schema.
= "~/Data/iris.csv"
fpath # Get column names
= pl.scan_csv(fpath).collect_schema().names()
columns columns
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
22.5 Apply function to column names at read time
Need to use pl.scan_csv()
to allow setting argument with_column_names
to a function that works on the column names. Here, we convert all symbols to underscores using rtemis.strng.clean_names()
.
from rtemis.utils.strng import clean_names
= pl.scan_csv(
iris "~/icloud/Data/iris.csv",
= clean_names).collect()
with_column_names iris
▄▄▄▄ ▄▄▄▄▄▄▄▄ .• ▌ ▄ ·. ▪ .▄▄ · ▀▄ █·•██ ▀▄.▀··██ ▐███▪██ ▐█ ▀. ▐▀▀▀▄ ▐█.▪▐▀▀▪▄▐█ ▌▐▌▐█·▐█·▄▀▀▀█▄ ▐█• █▌ ▐█▌·▐█▄▄▌██ ██▌▐█▌▐█▌▐█▄▪▐█ .▀ ▀ ▀▀▀ ▀▀▀ ▀▀ █▪▀▀▀▀▀▀ ▀▀▀▀-py .:rtemispy v.0.3.5 🏝 macOS-15.4.1-arm64-arm-64bit-Mach-O PSA: Do not throw data at algorithms. Compute responsibly!
Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | str |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
… | … | … | … | … |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.6 Unique rows
= iris.unique()
iris iris.shape
(149, 5)
22.7 Types
22.7.1 Convert column to Categorical
= iris.with_columns(
iris "Species").cast(pl.Categorical)
pl.col(
)list(zip(iris.columns, iris.dtypes))
[('Sepal_Length', Float64),
('Sepal_Width', Float64),
('Petal_Length', Float64),
('Petal_Width', Float64),
('Species', Categorical(ordering='physical'))]
22.7.2 Specify data types at read time
Define the schema of the DataFrame at read time using the schema
argument. This is useful if you want to specify the dtypes of all columns.
= pl.read_csv("~/icloud/Data/iris.csv",
iris = {'Sepal_Length': pl.Float64,
schema 'Sepal_Width': pl.Float64,
'Petal_Length': pl.Float64,
'Petal_Width': pl.Float64,
'Species': pl.Categorical}
) iris
Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | cat |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
… | … | … | … | … |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
Override a subset of columns using the schema_overrides
argument.
= pl.read_csv("~/icloud/Data/iris.csv",
iris = {"Species": pl.Categorical})
schema_overrides iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
f64 | f64 | f64 | f64 | cat |
5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
… | … | … | … | … |
6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.7.3 Get all columns of type
Select Float64 columns:
iris.select(pl.col(pl.Float64))
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
---|---|---|---|
f64 | f64 | f64 | f64 |
5.1 | 3.5 | 1.4 | 0.2 |
4.9 | 3.0 | 1.4 | 0.2 |
4.7 | 3.2 | 1.3 | 0.2 |
4.6 | 3.1 | 1.5 | 0.2 |
5.0 | 3.6 | 1.4 | 0.2 |
… | … | … | … |
6.7 | 3.0 | 5.2 | 2.3 |
6.3 | 2.5 | 5.0 | 1.9 |
6.5 | 3.0 | 5.2 | 2.0 |
6.2 | 3.4 | 5.4 | 2.3 |
5.9 | 3.0 | 5.1 | 1.8 |
Get names of all Float64 columns:
iris.select(pl.col(pl.Float64)).columns
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
Select Categorical columns:
iris.select(pl.col(pl.Categorical))
Species |
---|
cat |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
"setosa" |
… |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
"virginica" |
22.8 Write CSV
"~/icloud/Data/iris_p.csv") iris.write_csv(
22.9 Write Arrow parquet
You can easily save a polars DataFrame as a parquet file:
"~/icloud/Data/iris.parquet") iris.write_parquet(