Workflow#

Current scope#

The library supports:

  • Scanning Zarr and OME-TIFF stores into canonical ScanResult objects

  • Summarizing scan results into ergonomic DatasetSummary objects

  • Publishing canonical image_assets rows to Iceberg with PyIceberg

  • Publishing canonical chunk_index rows for chunked assets

  • Ingesting one or more existing datasets into Cytotable-compatible warehouses

  • Exporting Parquet Cytomining warehouse roots for pycytominer-style workflows

  • Validating profile-table schemas against the microscopy join contract

  • Joining scanned image metadata to profile tables through a high-level API

  • Optional DuckDB query helpers over canonical metadata tables

  • Optional OME-Arrow helpers for Arrow-native image payload workflows

  • Catalog-facing adapters for loading canonical metadata into query workflows

DuckDB integration remains optional and outside the core metadata model. The library keeps execution concerns separate from scanning, validation, and publishing so other query engines can be added later without reshaping the core package.

Chunk index publishing is intentionally metadata-only. Assets without chunking metadata simply produce zero chunk_index rows.

For install/setup flow and a path chooser, see Getting Started.

Zarr v2 and v3#

The package intentionally keeps one scan entry point for .zarr paths: scan_store(...).

  • Local Zarr v2 stores are scanned through the zarr Python package

  • Local Zarr v3 stores are scanned from zarr.json metadata

  • Summaries expose the detected storage variant so users can inspect mixed catalogs without learning separate APIs

  • The package dependency range permits both Zarr 2 and Zarr 3 runtimes.

  • Optional integrations such as OME-Arrow can coexist with the core scanner.

This keeps the package approachable while the wider ecosystem continues to settle around Zarr v3 runtime support.

Microscopy join contract#

Required columns:

  • dataset_id

  • image_id

Recommended columns:

  • plate_id

  • well_id

  • site_id

This validator is intentionally schema-focused. It checks whether a profile table exposes the columns needed for stable joins against canonical image metadata without taking on execution-engine responsibilities.

CLI examples#

iceberg-bioimage scan data/experiment.zarr
iceberg-bioimage summarize data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage ingest --catalog default --namespace bioimage.cytotable data/a.zarr data/b.zarr
iceberg-bioimage export-cytomining --warehouse-root warehouse-root data/experiment.zarr
iceberg-bioimage register --catalog default --namespace bioimage.cytotable --publish-chunks data/experiment.zarr
iceberg-bioimage publish-chunks --catalog default --namespace bioimage.cytotable data/experiment.zarr
iceberg-bioimage validate-contract data/cells.parquet
iceberg-bioimage join-profiles data/experiment.zarr data/cells.parquet --output joined.parquet

Example workflows#

  • examples/quickstart.py: minimal registration and validation flow

  • examples/catalog_duckdb.py: catalog-backed metadata joined to analysis rows

  • examples/synthetic_workflow.py: self-contained local synthetic workflow

  • docs/src/examples/metadata-workflow.ipynb: summary and join workflow in a notebook

  • docs/src/examples/basic-workflow.ipynb: simple TIFF/Zarr warehouse ingestion and namespace demo for Cytomining-style metadata