22 Data I/O

22.1 Read CSV

iris = pl.read_csv("~/icloud/Data/iris.csv")
iris

shape: (150, 5)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
f64	f64	f64	f64	str
5.1	3.5	1.4	0.2	"setosa"
4.9	3.0	1.4	0.2	"setosa"
4.7	3.2	1.3	0.2	"setosa"
4.6	3.1	1.5	0.2	"setosa"
5.0	3.6	1.4	0.2	"setosa"
…	…	…	…	…
6.7	3.0	5.2	2.3	"virginica"
6.3	2.5	5.0	1.9	"virginica"
6.5	3.0	5.2	2.0	"virginica"
6.2	3.4	5.4	2.3	"virginica"
5.9	3.0	5.1	1.8	"virginica"

22.2 Lazy Read CSV

iris = pl.scan_csv("~/icloud/Data/iris.csv")
type(iris)

polars.lazyframe.frame.LazyFrame

iris

NAIVE QUERY PLAN

run LazyFrame.show_graph() to see the optimized version

Fetch the lazy-read DataFrame:

iris = iris.fetch()
type(iris)

/var/folders/rb/99nqfz7s2rb6d_p0d6yxtbxc0000gn/T/ipykernel_55863/3654851018.py:1: DeprecationWarning:

`LazyFrame.fetch` is deprecated. `LazyFrame.fetch` is deprecated; use `LazyFrame.collect` instead, in conjunction with a call to `head`.

polars.dataframe.frame.DataFrame

iris

shape: (150, 5)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
f64	f64	f64	f64	str
5.1	3.5	1.4	0.2	"setosa"
4.9	3.0	1.4	0.2	"setosa"
4.7	3.2	1.3	0.2	"setosa"
4.6	3.1	1.5	0.2	"setosa"
5.0	3.6	1.4	0.2	"setosa"
…	…	…	…	…
6.7	3.0	5.2	2.3	"virginica"
6.3	2.5	5.0	1.9	"virginica"
6.5	3.0	5.2	2.0	"virginica"
6.2	3.4	5.4	2.3	"virginica"
5.9	3.0	5.1	1.8	"virginica"

22.3 Column names

Get column names:

iris.columns

['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']

Set column names:

iris.columns = [re.sub("\.", "_", col) for col in iris.columns]
iris.columns

<>:1: SyntaxWarning:

invalid escape sequence '\.'

<>:1: SyntaxWarning:

invalid escape sequence '\.'

/var/folders/rb/99nqfz7s2rb6d_p0d6yxtbxc0000gn/T/ipykernel_55863/3413779961.py:1: SyntaxWarning:

invalid escape sequence '\.'

['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']

22.4 Get column names from LazyFrame

You can get the column names from a file without reading the entire file into memory. This is useful if you have a large file and only want to know the column names, e.g. to then set the schema.

fpath = "~/Data/iris.csv"
# Get column names
columns = pl.scan_csv(fpath).collect_schema().names()
columns

['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']

22.5 Apply function to column names at read time

Need to use pl.scan_csv() to allow setting argument with_column_names to a function that works on the column names. Here, we convert all symbols to underscores using rtemis.strng.clean_names().

from rtemis.utils.strng import clean_names
iris = pl.scan_csv(
    "~/icloud/Data/iris.csv",
    with_column_names = clean_names).collect()
iris

▄▄▄▄  ▄▄▄▄▄▄▄▄ .• ▌ ▄ ·. ▪  .▄▄ ·

▀▄  █·•██  ▀▄.▀··██ ▐███▪██ ▐█ ▀.

▐▀▀▀▄  ▐█.▪▐▀▀▪▄▐█ ▌▐▌▐█·▐█·▄▀▀▀█▄

▐█• █▌ ▐█▌·▐█▄▄▌██ ██▌▐█▌▐█▌▐█▄▪▐█

.▀  ▀  ▀▀▀  ▀▀▀ ▀▀  █▪▀▀▀▀▀▀ ▀▀▀▀-py

.:rtemispy v.0.3.5 🏝 macOS-15.4.1-arm64-arm-64bit-Mach-O



PSA: Do not throw data at algorithms. Compute responsibly!

shape: (150, 5)

Sepal_Length	Sepal_Width	Petal_Length	Petal_Width	Species
f64	f64	f64	f64	str
5.1	3.5	1.4	0.2	"setosa"
4.9	3.0	1.4	0.2	"setosa"
4.7	3.2	1.3	0.2	"setosa"
4.6	3.1	1.5	0.2	"setosa"
5.0	3.6	1.4	0.2	"setosa"
…	…	…	…	…
6.7	3.0	5.2	2.3	"virginica"
6.3	2.5	5.0	1.9	"virginica"
6.5	3.0	5.2	2.0	"virginica"
6.2	3.4	5.4	2.3	"virginica"
5.9	3.0	5.1	1.8	"virginica"

22.6 Unique rows

iris = iris.unique()
iris.shape

(149, 5)

22.7 Types

22.7.1 Convert column to Categorical

iris = iris.with_columns(
    pl.col("Species").cast(pl.Categorical)
)
list(zip(iris.columns, iris.dtypes))

[('Sepal_Length', Float64),
 ('Sepal_Width', Float64),
 ('Petal_Length', Float64),
 ('Petal_Width', Float64),
 ('Species', Categorical(ordering='physical'))]

22.7.2 Specify data types at read time

Define the schema of the DataFrame at read time using the schema argument. This is useful if you want to specify the dtypes of all columns.

iris = pl.read_csv("~/icloud/Data/iris.csv",
    schema = {'Sepal_Length': pl.Float64,
              'Sepal_Width': pl.Float64,
              'Petal_Length': pl.Float64,
              'Petal_Width': pl.Float64,
              'Species': pl.Categorical}
    )
iris

shape: (150, 5)

Sepal_Length	Sepal_Width	Petal_Length	Petal_Width	Species
f64	f64	f64	f64	cat
5.1	3.5	1.4	0.2	"setosa"
4.9	3.0	1.4	0.2	"setosa"
4.7	3.2	1.3	0.2	"setosa"
4.6	3.1	1.5	0.2	"setosa"
5.0	3.6	1.4	0.2	"setosa"
…	…	…	…	…
6.7	3.0	5.2	2.3	"virginica"
6.3	2.5	5.0	1.9	"virginica"
6.5	3.0	5.2	2.0	"virginica"
6.2	3.4	5.4	2.3	"virginica"
5.9	3.0	5.1	1.8	"virginica"

Override a subset of columns using the schema_overrides argument.

iris = pl.read_csv("~/icloud/Data/iris.csv",
    schema_overrides = {"Species": pl.Categorical})
iris

shape: (150, 5)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
f64	f64	f64	f64	cat
5.1	3.5	1.4	0.2	"setosa"
4.9	3.0	1.4	0.2	"setosa"
4.7	3.2	1.3	0.2	"setosa"
4.6	3.1	1.5	0.2	"setosa"
5.0	3.6	1.4	0.2	"setosa"
…	…	…	…	…
6.7	3.0	5.2	2.3	"virginica"
6.3	2.5	5.0	1.9	"virginica"
6.5	3.0	5.2	2.0	"virginica"
6.2	3.4	5.4	2.3	"virginica"
5.9	3.0	5.1	1.8	"virginica"

22.7.3 Get all columns of type

Select Float64 columns:

iris.select(pl.col(pl.Float64))

shape: (150, 4)

Sepal.Length	Sepal.Width	Petal.Length	Petal.Width
f64	f64	f64	f64
5.1	3.5	1.4	0.2
4.9	3.0	1.4	0.2
4.7	3.2	1.3	0.2
4.6	3.1	1.5	0.2
5.0	3.6	1.4	0.2
…	…	…	…
6.7	3.0	5.2	2.3
6.3	2.5	5.0	1.9
6.5	3.0	5.2	2.0
6.2	3.4	5.4	2.3
5.9	3.0	5.1	1.8

Get names of all Float64 columns:

iris.select(pl.col(pl.Float64)).columns

['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']

Select Categorical columns:

iris.select(pl.col(pl.Categorical))

shape: (150, 1)

Species
cat
"setosa"
"setosa"
"setosa"
"setosa"
"setosa"
…
"virginica"
"virginica"
"virginica"
"virginica"
"virginica"

22.8 Write CSV

iris.write_csv("~/icloud/Data/iris_p.csv")

22.9 Write Arrow parquet

You can easily save a polars DataFrame as a parquet file:

iris.write_parquet("~/icloud/Data/iris.parquet")

22.10 Resources

Polars IO