import polars as pl
import re22 Data I/O
22.1 Read CSV
iris = pl.read_csv("~/icloud/Data/iris.csv")
iris| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | str |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
| 4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
| 5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
| … | … | … | … | … |
| 6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
| 6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
| 6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
| 6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
| 5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.2 Lazy Read CSV
iris = pl.scan_csv("~/icloud/Data/iris.csv")
type(iris)polars.lazyframe.frame.LazyFrame
irisNAIVE QUERY PLAN
run LazyFrame.show_graph() to see the optimized version
Fetch the lazy-read DataFrame:
iris = iris.fetch()
type(iris)/var/folders/rb/99nqfz7s2rb6d_p0d6yxtbxc0000gn/T/ipykernel_55863/3654851018.py:1: DeprecationWarning:
`LazyFrame.fetch` is deprecated. `LazyFrame.fetch` is deprecated; use `LazyFrame.collect` instead, in conjunction with a call to `head`.
polars.dataframe.frame.DataFrame
iris| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | str |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
| 4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
| 5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
| … | … | … | … | … |
| 6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
| 6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
| 6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
| 6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
| 5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.3 Column names
Get column names:
iris.columns['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
Set column names:
iris.columns = [re.sub("\.", "_", col) for col in iris.columns]
iris.columns<>:1: SyntaxWarning:
invalid escape sequence '\.'
<>:1: SyntaxWarning:
invalid escape sequence '\.'
/var/folders/rb/99nqfz7s2rb6d_p0d6yxtbxc0000gn/T/ipykernel_55863/3413779961.py:1: SyntaxWarning:
invalid escape sequence '\.'
['Sepal_Length', 'Sepal_Width', 'Petal_Length', 'Petal_Width', 'Species']
22.4 Get column names from LazyFrame
You can get the column names from a file without reading the entire file into memory. This is useful if you have a large file and only want to know the column names, e.g. to then set the schema.
fpath = "~/Data/iris.csv"
# Get column names
columns = pl.scan_csv(fpath).collect_schema().names()
columns['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']
22.5 Apply function to column names at read time
Need to use pl.scan_csv() to allow setting argument with_column_names to a function that works on the column names. Here, we convert all symbols to underscores using rtemis.strng.clean_names().
from rtemis.utils.strng import clean_names
iris = pl.scan_csv(
"~/icloud/Data/iris.csv",
with_column_names = clean_names).collect()
iris▄▄▄▄ ▄▄▄▄▄▄▄▄ .• ▌ ▄ ·. ▪ .▄▄ · ▀▄ █·•██ ▀▄.▀··██ ▐███▪██ ▐█ ▀. ▐▀▀▀▄ ▐█.▪▐▀▀▪▄▐█ ▌▐▌▐█·▐█·▄▀▀▀█▄ ▐█• █▌ ▐█▌·▐█▄▄▌██ ██▌▐█▌▐█▌▐█▄▪▐█ .▀ ▀ ▀▀▀ ▀▀▀ ▀▀ █▪▀▀▀▀▀▀ ▀▀▀▀-py .:rtemispy v.0.3.5 🏝 macOS-15.4.1-arm64-arm-64bit-Mach-O PSA: Do not throw data at algorithms. Compute responsibly!
| Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | Species |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | str |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
| 4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
| 5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
| … | … | … | … | … |
| 6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
| 6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
| 6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
| 6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
| 5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.6 Unique rows
iris = iris.unique()
iris.shape(149, 5)
22.7 Types
22.7.1 Convert column to Categorical
iris = iris.with_columns(
pl.col("Species").cast(pl.Categorical)
)
list(zip(iris.columns, iris.dtypes))[('Sepal_Length', Float64),
('Sepal_Width', Float64),
('Petal_Length', Float64),
('Petal_Width', Float64),
('Species', Categorical(ordering='physical'))]
22.7.2 Specify data types at read time
Define the schema of the DataFrame at read time using the schema argument. This is useful if you want to specify the dtypes of all columns.
iris = pl.read_csv("~/icloud/Data/iris.csv",
schema = {'Sepal_Length': pl.Float64,
'Sepal_Width': pl.Float64,
'Petal_Length': pl.Float64,
'Petal_Width': pl.Float64,
'Species': pl.Categorical}
)
iris| Sepal_Length | Sepal_Width | Petal_Length | Petal_Width | Species |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | cat |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
| 4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
| 5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
| … | … | … | … | … |
| 6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
| 6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
| 6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
| 6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
| 5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
Override a subset of columns using the schema_overrides argument.
iris = pl.read_csv("~/icloud/Data/iris.csv",
schema_overrides = {"Species": pl.Categorical})
iris| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| f64 | f64 | f64 | f64 | cat |
| 5.1 | 3.5 | 1.4 | 0.2 | "setosa" |
| 4.9 | 3.0 | 1.4 | 0.2 | "setosa" |
| 4.7 | 3.2 | 1.3 | 0.2 | "setosa" |
| 4.6 | 3.1 | 1.5 | 0.2 | "setosa" |
| 5.0 | 3.6 | 1.4 | 0.2 | "setosa" |
| … | … | … | … | … |
| 6.7 | 3.0 | 5.2 | 2.3 | "virginica" |
| 6.3 | 2.5 | 5.0 | 1.9 | "virginica" |
| 6.5 | 3.0 | 5.2 | 2.0 | "virginica" |
| 6.2 | 3.4 | 5.4 | 2.3 | "virginica" |
| 5.9 | 3.0 | 5.1 | 1.8 | "virginica" |
22.7.3 Get all columns of type
Select Float64 columns:
iris.select(pl.col(pl.Float64))| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width |
|---|---|---|---|
| f64 | f64 | f64 | f64 |
| 5.1 | 3.5 | 1.4 | 0.2 |
| 4.9 | 3.0 | 1.4 | 0.2 |
| 4.7 | 3.2 | 1.3 | 0.2 |
| 4.6 | 3.1 | 1.5 | 0.2 |
| 5.0 | 3.6 | 1.4 | 0.2 |
| … | … | … | … |
| 6.7 | 3.0 | 5.2 | 2.3 |
| 6.3 | 2.5 | 5.0 | 1.9 |
| 6.5 | 3.0 | 5.2 | 2.0 |
| 6.2 | 3.4 | 5.4 | 2.3 |
| 5.9 | 3.0 | 5.1 | 1.8 |
Get names of all Float64 columns:
iris.select(pl.col(pl.Float64)).columns['Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width']
Select Categorical columns:
iris.select(pl.col(pl.Categorical))| Species |
|---|
| cat |
| "setosa" |
| "setosa" |
| "setosa" |
| "setosa" |
| "setosa" |
| … |
| "virginica" |
| "virginica" |
| "virginica" |
| "virginica" |
| "virginica" |
22.8 Write CSV
iris.write_csv("~/icloud/Data/iris_p.csv")22.9 Write Arrow parquet
You can easily save a polars DataFrame as a parquet file:
iris.write_parquet("~/icloud/Data/iris.parquet")