22  Data I/O

import polars as pl
import re

22.1 Read CSV

iris = pl.read_csv("~/icloud/Data/iris.csv")
iris
shape: (150, 5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

22.2 Lazy Read CSV

iris = pl.scan_csv("~/icloud/Data/iris.csv")
type(iris)
polars.lazyframe.frame.LazyFrame
iris

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

polars_query p1 Csv SCAN [/Users/egenn/icloud/Data/iris.csv] π */5;

Fetch the lazy-read DataFrame:

iris = iris.fetch()
type(iris)
/var/folders/rb/99nqfz7s2rb6d_p0d6yxtbxc0000gn/T/ipykernel_55863/3654851018.py:1: DeprecationWarning:

`LazyFrame.fetch` is deprecated. `LazyFrame.fetch` is deprecated; use `LazyFrame.collect` instead, in conjunction with a call to `head`.
polars.dataframe.frame.DataFrame
iris
shape: (150, 5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

22.3 Column names

Get column names:

iris.columns
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']

Set column names:

iris.columns = [re.sub("\.", "_", col) for col in iris.columns]
iris.columns
<>:1: SyntaxWarning:

invalid escape sequence '\.'

<>:1: SyntaxWarning:

invalid escape sequence '\.'

/var/folders/rb/99nqfz7s2rb6d_p0d6yxtbxc0000gn/T/ipykernel_55863/3413779961.py:1: SyntaxWarning:

invalid escape sequence '\.'
['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']

22.4 Get column names from LazyFrame

You can get the column names from a file without reading the entire file into memory. This is useful if you have a large file and only want to know the column names, e.g. to then set the schema.

fpath = "~/Data/iris.csv"
# Get column names
columns = pl.scan_csv(fpath).collect_schema().names()
columns
['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

22.5 Apply function to column names at read time

Need to use pl.scan_csv() to allow setting argument with_column_names to a function that works on the column names. Here, we convert all symbols to underscores using rtemis.strng.clean_names().

from rtemis.utils.strng import clean_names
iris = pl.scan_csv(
    "~/icloud/Data/iris.csv",
    with_column_names = clean_names).collect()
iris
▄▄▄▄  ▄▄▄▄▄▄▄▄ .• ▌ ▄ ·. ▪  .▄▄ ·

▀▄  █·•██  ▀▄.▀··██ ▐███▪██ ▐█ ▀.

▐▀▀▀▄  ▐█.▪▐▀▀▪▄▐█ ▌▐▌▐█·▐█·▄▀▀▀█▄

▐█• █▌ ▐█▌·▐█▄▄▌██ ██▌▐█▌▐█▌▐█▄▪▐█

.▀  ▀  ▀▀▀  ▀▀▀ ▀▀  █▪▀▀▀▀▀▀ ▀▀▀▀-py

.:rtemispy v.0.3.5 🏝 macOS-15.4.1-arm64-arm-64bit-Mach-O



PSA: Do not throw data at algorithms. Compute responsibly!
shape: (150, 5)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
f64 f64 f64 f64 str
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

22.6 Unique rows

iris = iris.unique()
iris.shape
(149, 5)

22.7 Types

22.7.1 Convert column to Categorical

iris = iris.with_columns(
    pl.col("Species").cast(pl.Categorical)
)
list(zip(iris.columns, iris.dtypes))
[('Sepal_Length', Float64),
 ('Sepal_Width', Float64),
 ('Petal_Length', Float64),
 ('Petal_Width', Float64),
 ('Species', Categorical(ordering='physical'))]

22.7.2 Specify data types at read time

Define the schema of the DataFrame at read time using the schema argument. This is useful if you want to specify the dtypes of all columns.

iris = pl.read_csv("~/icloud/Data/iris.csv",
    schema = {'Sepal_Length': pl.Float64,
              'Sepal_Width': pl.Float64,
              'Petal_Length': pl.Float64,
              'Petal_Width': pl.Float64,
              'Species': pl.Categorical}
    )
iris
shape: (150, 5)
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
f64 f64 f64 f64 cat
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

Override a subset of columns using the schema_overrides argument.

iris = pl.read_csv("~/icloud/Data/iris.csv",
    schema_overrides = {"Species": pl.Categorical})
iris
shape: (150, 5)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
f64 f64 f64 f64 cat
5.1 3.5 1.4 0.2 "setosa"
4.9 3.0 1.4 0.2 "setosa"
4.7 3.2 1.3 0.2 "setosa"
4.6 3.1 1.5 0.2 "setosa"
5.0 3.6 1.4 0.2 "setosa"
6.7 3.0 5.2 2.3 "virginica"
6.3 2.5 5.0 1.9 "virginica"
6.5 3.0 5.2 2.0 "virginica"
6.2 3.4 5.4 2.3 "virginica"
5.9 3.0 5.1 1.8 "virginica"

22.7.3 Get all columns of type

Select Float64 columns:

iris.select(pl.col(pl.Float64))
shape: (150, 4)
Sepal.Length Sepal.Width Petal.Length Petal.Width
f64 f64 f64 f64
5.1 3.5 1.4 0.2
4.9 3.0 1.4 0.2
4.7 3.2 1.3 0.2
4.6 3.1 1.5 0.2
5.0 3.6 1.4 0.2
6.7 3.0 5.2 2.3
6.3 2.5 5.0 1.9
6.5 3.0 5.2 2.0
6.2 3.4 5.4 2.3
5.9 3.0 5.1 1.8

Get names of all Float64 columns:

iris.select(pl.col(pl.Float64)).columns
['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']

Select Categorical columns:

iris.select(pl.col(pl.Categorical))
shape: (150, 1)
Species
cat
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"

22.8 Write CSV

iris.write_csv("~/icloud/Data/iris_p.csv")

22.9 Write Arrow parquet

You can easily save a polars DataFrame as a parquet file:

iris.write_parquet("~/icloud/Data/iris.parquet")

22.10 Resources