Workflow#
Current scope#
The library supports:
Scanning Zarr and OME-TIFF stores into canonical
ScanResultobjectsSummarizing scan results into ergonomic
DatasetSummaryobjectsPublishing canonical
image_assetsrows to Iceberg with PyIcebergPublishing canonical
chunk_indexrows for chunked assetsIngesting one or more existing datasets into Cytotable-compatible warehouses
Exporting Parquet Cytomining warehouse roots for
pycytominer-style workflowsValidating profile-table schemas against the microscopy join contract
Joining scanned image metadata to profile tables through a high-level API
Optional DuckDB query helpers over canonical metadata tables
Optional OME-Arrow helpers for Arrow-native image payload workflows
Catalog-facing adapters for loading canonical metadata into query workflows
DuckDB integration remains optional and outside the core metadata model. The library keeps execution concerns separate from scanning, validation, and publishing so other query engines can be added later without reshaping the core package.
Chunk index publishing is intentionally metadata-only. Assets without chunking
metadata simply produce zero chunk_index rows.
For install/setup flow and a path chooser, see Getting Started.
Zarr v2 and v3#
The package intentionally keeps one scan entry point for .zarr paths:
scan_store(...).
Local Zarr v2 stores are scanned through the
zarrPython packageLocal Zarr v3 stores are scanned from
zarr.jsonmetadataSummaries expose the detected storage variant so users can inspect mixed catalogs without learning separate APIs
The package dependency range permits both Zarr 2 and Zarr 3 runtimes.
Optional integrations such as OME-Arrow can coexist with the core scanner.
This keeps the package approachable while the wider ecosystem continues to settle around Zarr v3 runtime support.
Microscopy join contract#
Required columns:
dataset_idimage_id
Recommended columns:
plate_idwell_idsite_id
This validator is intentionally schema-focused. It checks whether a profile table exposes the columns needed for stable joins against canonical image metadata without taking on execution-engine responsibilities.
CLI examples#
iceberg-bioimage scan data/experiment.zarr
iceberg-bioimage summarize data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/a.zarr data/b.zarr
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr
iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage validate-contract data/cells.parquet
iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet
Example workflows#
examples/quickstart.py: minimal registration and validation flowexamples/catalog_duckdb.py: catalog-backed metadata joined to analysis rowsexamples/synthetic_workflow.py: self-contained local synthetic workflowdocs/src/examples/metadata-workflow.ipynb: summary and join workflow in a notebookdocs/src/examples/basic-workflow.ipynb: simple TIFF/Zarr warehouse ingestion and namespace demo for Cytomining-style metadata