eToolBox Release Notes¶
0.4.1 (2025-XX-XX)¶
What’s New?¶
Add colocation results example to ref:eToolBox and R <etb-r-label> and simplify structure.
etb cloud init
walks you through setup if no arguments are provided.Azure account name is is set / stored rather than hard coded.
Update and cleanup readme.
read_patio_file()
is nowread_cloud_file()
and only takes a filename which can represent any file in any of the account’s containers. It also supports reading all filetypes thatwrite_cloud_file()
does.write_patio_econ_results()
is nowwrite_cloud_file()
and only takes a filename which can represent any file in any of the account’s containers.Remove
remote_zip
as we never used it or actively maintained it. If that functionality is needed, use the original package python-remotezip.etoolbox.utils.cloud
now allows multiple cloud accounts to have configs and caches with a process for activating the desired one globally. Environment variableETB_AZURE_ACTIVE_ACCOUNT
can override this global setting.
Bug Fixes¶
Fixed a bug in
read_patio_file()
where the fallback process for missing specified file extensions was incorrect for csv and parquet files.
0.4.0 (2025-05-08)¶
What’s New?¶
Use
pyarrow
directly inpd_read_pudl()
to avoid having dates cast to objects.Compatibility with Python 3.13 tested and included in CI.
Declaring optional cloud dependencies of
pandas
andpolars
explicitly.Tools for working with data stored on Azure.
DataZip
now recognizes alternative methods for getting and setting object state so that an object can specify a serialization forDataZip
that is different than that forpickle
. These new methods are_dzgetstate_
and_dzsetstate_
.storage_options()
to simplify reading from/writing to Azure usingpandas
orpolars
.generator_ownership()
compiles ownership information for all generators using data frompudl
.New CLI built off a single command
rmi
oretb
withcloud
andpudl
subcommands for cleaning caches and configs, showing the contents of caches, and in the cloud case, getting, putting, and listing files.DataZip
will not append.zip
suffix to file paths passed to its init as strings.Added
simplify_strings()
topudl_helpers
.Subclass of
logging.Formatter
,SafeFormatter
that can fill extra values with defaults when they are not provided in the logging call. See here for more info on the extra kwarg in logging calls.Option to disable use of ids in
DataZip
to keep track of multiple references to the same object usingids_for_dedup
kwarg.Instructions and additional helper functions to support using eToolBox from R, specifically
read_patio_resource_results()
,read_patio_file()
, andwrite_patio_econ_results()
, see eToolBox and R for details.Use azcopy under the hood in
get()
andput()
which is faster and more easily allows keeping directories in sync by only transferring the differences.pl_scan_pudl()
now works withuse_polars=True
which avoids usingfsspec
in favor ofpolars
faster implementation that can avoiding downloading whole parquets when using predicate pushdown. Unfortunately this means there is no local caching.write_patio_econ_results()
now works withstr
andbytes
for writing.json
,.csv
,.txt
, &c.Added
etb pudl list
command to the CLI for seeing pudl releases and data in releases, as well asetb pudl get
to download a table and save it as a csv.Improved CLI using
click
and new CLI documentation.Remove
get_pudl_sql_url()
andPretendPudlTabl
.Migrate
tox
and GitHub Action tooling touv
.
Bug Fixes¶
Fixed a bug in the implementation of the alternative serialization methods that caused recursion or other errors when serializing an object whose class implemented
__getattr__
.Attempt to fix doctest bug caused by pytest logging, see pytest#5908
Fixed a bug that meant only zips created with
DataZip.dump()
could be opened withDataZip.load()
.Fixed a bug where certain
pandas.DataFrame
columns of dtypeobject
, specifically columns withbool
andNone
became lists rather than DataFrame columns when theread_patio_resource_results()
is called from R.
0.3.0 (2024-10-07)¶
What’s New?¶
New functions to read
pudl
tables from parquets in an open-access AWS bucket usingpd_read_pudl()
,pl_read_pudl()
, andpl_scan_pudl()
which handle caching.polars
AWS client does not currently work souse_polars
must be set toFalse
.New
pudl_list()
to show a list of releases or tables within a release.Restricting
platformdirs
version to >= 3.0 when config location changed.Removed:
read_pudl_table()
get_pudl_tables_as_dz()
make_pudl_tabl()
lazy_import()
Created
etoolbox.utils.logging_utils
with helpers to setup and format loggers in a more performant and structured way based on mCoding suggestion. Also replaced module-level loggers with library-wide logger and removed logger configuration frometoolbox
because it is a library. This requires Python>=3.12.Minor performance improvements to
DataZip.keys()
andDataZip.__len__()
.Fixed links to docs for
polars
,plotly
,platformdirs
,fsspec
, andpudl
. At least in theory.Optimization in
DataZip.__getitem__()
for reading a single value from a nested structure without decoding all enclosing objects, we useisinstance()
anddict.get()
rather than try/except to handle non-dict objects and missing keys.New CLI utility
pudl-table-rename
that renames PUDL tables in a set of files to the new names used by PUDL.Allow older versions of
polars
, this is a convenience for some other projects that have not adapted to >=1.0 changes but we do not test against older versions.
Bug Fixes¶
Fixed a bug where
etoolbox
could not be used iftqdm
was not installed. As it is an optional dependency,_optional
should be able to fully address that issue.Fixed a bug where import of
typing.override()
inetoolbox.utils.logging_utils
broke compatibility with Python 3.11 since the function was added in 3.12.
0.2.0 (2024-02-28)¶
Complete redesign of system internals and standardization of the data format. This resulted in a couple key improvements:
Performance Decoding is now lazy, so structures and objects are only rebuilt when they are retrieved, rather than when the file is opened. Encoding is only done once, rather than once to make sure it will work, and then again when the data is written on close. Further, the correct encoder/decoder is selected using
dict
lookups rather than chains ofisinstance()
.Data Format Rather than a convoluted system to flatten the object hierarchy, we preserve the hierarchy in the
__attributes__.json
file. We also provide encoders and decoders that allows all Python builtins as well as other types to be stored injson
. Any data that cannot be encoded tojson
is saved elsewhere and the entry in__attributes__.json
contains a pointer to where the data is actually stored. Further, rather than storing some metadata in__attributes__.json
and some elsewhere, now all metadata is stored alongside the data or pointer in__attributes__.json
.Custom Classes We no longer save custom objects as their own
DataZip
. Their location in the object hierarchy is preserved with a pointer and associated metadata. The object’s state is stored separately in a hidden key,__state__
in__attributes__.json
.References The old format stored every object as many times as it was referenced. This meant that objects could be stored multiple times and when the hierarchy was recreated, these objects would be copies. The new process for storing custom classes,
pandas.DataFrame
,pandas.Series
, andnumpy.array
usesid()
to make sure we only store data once and that these relationships are recreated when loading data from aDataZip
.API
DataZip
behaves a little like adict
. It hasDataZip.get()
,DataZip.items()
, andDataZip.keys()
which do what you would expect. It also implements dunder methods to allow membership checking usingin
,len()
, and subscripts to get and set items (i.e.obj[key] = value
) these all also behave as you would expect, except that setting an item raises aKeyError
if the key is already in use. One additional feature with lookups is that you can provide multiple keys which are looked up recursively allowing efficient access to data in nested structures.DataZip.dump()
andDataZip.load()
are static methods that allow you to directly save and load an object into aDataZip
, similar topickle.dump()
andpickle.load()
except they handle opening and closing the file as well. Finally,DataZip.replace()
is a little liketyping.NamedTuple._replace()
; it copies the contents of oneDataZip
into a new one, with select keys replaced.
Added dtype metadata for
pandas
objects as well as ability to ignore that metadata to allow use ofpyarrow
dtypes.Switching to use
ujson
rather than the standard library version for performance.Added optional support for
polars.DataFrame
,polars.LazyFrame
, andpolars.Series
inDataZip
.Added
PretendPudlTabl
when passed as theklass
argument toDataZip.load()
, it allows accessing the dfs in a zippedpudl.PudlTabl
as you would normally but avoiding thepudl
dependency.Code cleanup along with adoption of ruff and removal of bandit, flake8, isort, etc.
Added
lazy_import()
to lazily import or proxy a module, inspired bypolars.dependencies.lazy_import
.Created tools for proxying
pudl.PudlTabl
to provide access to cached PUDL data without requiring thatpudl
is installed, or at least imported. The process of either loading aPretendPudlTabl
from cache, or creating and then caching apudl.PudlTabl
is handled bymake_pudl_tabl()
.Copied a number of helper functions that we often use from
pudl.helpers
topudl_helpers
so they can be used without installing or importingpudl
.Added a very light adaptation of the python-remotezip package to access files within a zip archive without downloading the full archive.
Updates to
DataZip
encoding and decoding ofpandas.DataFrame
so they work withpandas
version 2.0.0.Updates to
make_pudl_tabl()
and associated functions and classes so that it works with new and changing aspects ofpudl.PudlTabl
, specifically those raised in catalyst#2503. Added testing for fullmake_pudl_tabl()
functionality.Added to
get_pudl_table()
which reads a table from apudl.sqlite
that is stored where it is expected.Added support for
polars.DataFrame
,polars.LazyFrame
, andpolars.Series
toetoolbox.utils.testing.assert_equal()
.plotly.Figure
are now stored as pickles so they can be recreated.Updates to
get_pudl_sql_url()
so that it doesn’t require PUDL environment variables or config files if the sqlite is atpudl-work/output/pudl.sqlite
, and tells the user to put the sqlite there if the it cannot be found another way.New
conform_pudl_dtypes()
function that casts PUDL columns to the dtypes used inPudlTabl
, useful when loading tables from a sqlite that doesn’t preserve all dtype info.Added
ungzip()
to help with un-gzippingpudl.sqlite.gz
and now using the gzipped version in tests.Switching two cases of
with suppress...
totry - except - pass
inDataZip
to take advantage of zero-cost exceptions.Deprecations these will be removed in the next release along with supporting infrastructure:
lazy_import()
and the rest of thelazy_import
module.PUDL_DTYPES
, useconform_pudl_dtypes()
instead.make_pudl_tabl()
,PretendPudlTablCore
,PretendPudlTablCore
; read tables directly from the sqlite:import pandas as pd import sqlalchemy as sa from etoolbox.utils.pudl import get_pudl_sql_url, conform_pudl_dtypes pd.read_sql_table(table_name, sa.create_engine(get_pudl_sql_url())).pipe( conform_pudl_dtypes )
import polars as pl from etoolbox.utils.pudl import get_pudl_sql_url pl.read_database("SELECT * FROM table_name", get_pudl_sql_url())
Bug Fixes¶
Allow
typing.NamedTuple
to be used as keys in adict
, and acollections.defaultdict
.Fixed a bug in
make_pudl_tabl()
where creating and caching a newpudl.PudlTabl
would fail to load the PUDL package.Fixed a bug where attempting to retrieve an empty
pandas.DataFrame
raised anIndexError
whenignore_pd_dtypes
isFalse
.Updated the link for the PUDL database.
Known Issues¶
Some legacy
DataZip
files cannot be fully read, especially those with nested structures and custom classes.DataZip
ignoresfunctools.partial()
objects, at least in most dicts.
0.1.0 (2023-02-27)¶
What’s New?¶
Migrating
DataZip
from rmi.dispatch where it didn’t really belong. Also added additional functionality including recursive writing and reading oflist
,dict
, andtuple
objects.Created
IOMixin
andIOWrapper
to make it easier to addDataZip
to other classes.Migrating
compare_dfs()
from the Hub.Updates to
DataZip
,IOMixin
, andIOWrapper
to better better manage attributes missing from original object or file representation of object. Including ability to use differently organized versions ofDataZip
.Clean up of
DataZip
internals, both within the object and in laying out files. Particularly how metadata and attributes are stored. AddedDataZip.readm()
andDataZip.writem()
to read and write additional metadata not core toDataZip
.Added support for storing
numpy.array
objects inDataZip
usingnumpy.load()
andnumpy.save()
.DataZip
now handles writing attributes and metadata usingDataZip.close()
soDataZip
can now be used with or without a context manager.Added
isclose()
, similar tonumpy.isclose()
but allowing comparison of arrays containing strings, especially useful withpandas.Series
.Added a module
etoolbox.utils.match
containing the helpers Raymond Hettinger demonstrated in his talk at PyCon Italia for using Python’scase
/match
syntax.Added support for Python 3.11.
Added support for storing
plotly
figures aspdf
inDataZip
.DataZip.close()
soDataZip
can now be used with or without a context manager.Added support for checking whether a file or attribute is stored in
DataZip
usingDataZip.__contains__()
, i.e. using Python’sin
.Added support for subscript-based, getting and setting data in
DataZip
.Custom Python objects can be serialized with
DataZip
if they implement__getstate__
and__setstate__
, or can be serialized using the default logic described inobject.__getstate__()
. That default logic is now implemented inDataZip.default_getstate()
andDataZip.default_setstate()
. This replaces the use ofto_file
andfrom_file
byDataZip
.IOMixin
has been updated accordingly.Added static methods
DataZip.dump()
andDataZip.load()
for serializing a single Python object, these are designed to be similar to howpickle.dump()
andpickle.load()
work.Removing
IOWrapper
.Added a
DataZip.replace()
that copies the contents of an oldDataZip
into a new copy of it after which you can add to it.Extended JSON encoding / decoding to process an expanded set of builtins, standard library, and other common objects including
tuple
,set
,frozenset
,complex
,typing.NamedTuple
,datetime.datetime
,pathlib.Path
, andpandas.Timestamp
.Adding centralized testing helpers.
Added a subclass of
PudlTabl
that adds back__getstate__
and__setstate__
to enable caching, this caching will not work for tables that are not stored in the object which will be an increasing portion of tables as discussed here.
Bug Fixes¶
Fixed an issue where a single column
pandas.DataFrame
was recreated as apandas.Series
. Now this should be backwards compatible by applyingpandas.DataFrame.squeeze
if object metadata is not available.Fixed a bug that prevented certain kinds of objects from working properly under 3.11.
Fixed an issue where the name for a
pandas.Series
might get mangled or changed.