Skip to content
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 41 additions & 33 deletions docs/release-notes/25-10-5.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,43 +6,51 @@ Simplyblock is happy to release the general availability release of Simplyblock

## New Features

- The storage node addition tasks (creating partitions and internal devices) are now performed in parallel.
- The cluster statistics are moved from FoundationDB to Prometheus to increase performance and reduce the memory footprint.
- The command `sbctl` now supports the `--version` (`-v) flag to display the version.
- The command `sbctl storage-node configure` now supports the `--force` flag to format NVMe devices whwn they are already formatted.
- The command `sbctl storage-node add-node` now supports the `--cores-percentage` flag to configure the percentage of cores to be used by the storage node (required on Oracle Linux).
- The command `sbctl storage-node add-node` now supports the `--nvme-names` flag to specify the NVMe device names.
- The command `sbctl storage-node add-node` now supports the `--format-4k` flag to format the NVMe devices with 4K alignment.
- The command `sbctl storage-node add-node` now supports the `--calculate-hp-only` flag to calculate the minimum required huge pages.
- The commands `sbctl cluster create` and `sbctl cluster add` now support the `--client-data-nic` flag to specify the client data network interface.
- The command `sbctl storage-node` now has additional subcommands to list snapshots and volumes by storage node.

## Fixes

- Control Plane: Fixed an issue where remote objects weren't correctly removed from the node on receiving distrib events.
- Control Plane: Improved communication between the control plane and spdk-proxy to avoid race conditions.
- Control Plane: Improved the reliability of logical volume deletions when multiple delete requests arrive simultaneously.
- Control Plane: Improved firewall rules management.
- Control Plane: Fixed an issue where pip packages were installed on cluster update.
- Control Plane: Fixed an issue with the calculation of total_mem for multi storage nodes on the same NUMA node.
- Control Plane: Improved the reliability of device connections when storage nodes connect or disconnect.
- Control Plane: Improved the reliability of the cluster migration tasks.
- Control Plane: Improved the reliability of the cluster health checks.
- Control Plane: Fixed an issue where core isolation wouldn't correctly work on some systems.
- Control Plane: Improved the reliability of primary and secondary leader changes.
- Control Plane: Fixed an issue where additional delete operations would interfere with an asynchronous delete operation already in progress.
- Control Plane: Fixed an issue where failed devices couldn't be added back to the cluster.
- Control Plane: Improved the reliability of the storage node status update process.
- Control Plane: Improved the graceful startup and shutdown of the storage plane.
- Kubernetes: Fixed an issue where the wrong value was used to configure the hugepages.

- We can force-bind now both the clients (via the nvme connect) and the storage nodes to particular NICs on the host
- There is no requirement any longer to have a route from the management nodes to the storage network (Data-NICs)
- Optional CSI feature, which allows to auto-delete and restart all pods after a cluster got suspended (pods lost both IO path) once the cluster becomes operational again
- if a device dissappears on node restart directly from NEW state, just remove it from database
- variable port ranges
- accelerate activate: parallelize the creation of distribs on cluster activate
- accelerate node restart: Parallelize connecting to other / from other nodes on node restart
- support for kubernetes topology manager

## Fixes (Summary)

- node affinity and IO interrupt and device format failure on 2+2 after multiple 2-node outages in a row [multiple issues with journal and placement fixed]
- outages in a row without data migration in between can lead to IO interrupt [changes to placement algorithm]
- Snapshot delete failed while primary node in outage
- dpdk fails to initialize
- Lvol delete fails when secondary is in down state
- force option of remove-device command doesnt work
- records param in get-io-stats return same 20 values

## Upgrade Considerations

It is possible to upgrade from 25.10.4.
It is possible to upgrade from 25.10.4. and 25.10.4.2

## Known Issues

Use of different erasure coding schemas per cluster is available but can cause io interrupt issues in some tests.
This feature is experimental and not GA.
- we still have unnecessary retries of data migration while a node is down in +2 schemas (2+2, 4+2). data migration should pause once all migratable chunks have been moved and resume only for the remaining part once all nodes are online. Currently, it retries without success until all nodes are online. We will deliver this as a hotfix asap.
- at the moment, to sustain full fault tolerance we need more nodes than the theoretical minimum. This is due to a missing feature in the placement, which we will deliver as hotfix asap.
- still not fixed to use different erasure coding schemas per cluster (will be delivered with next major release)

## Features to expect with next major release

- Ability to use different erasure coding schemas in the same cluster
- remote snapshot replication (send snapshots to remote cluster)
- Kubernetes: asynchronous replication (replicate volumes via snapshots in regular intervals and support fail-over in kubernetes)
- Kubernetes Operator: Use CRDs to specify, create and track a cluster, storage nodes, volumes and snapshots, and replications
- Significant Performance Optimizations during Node Outage (Journal Writes)
- Cluster-internal Multi-Pathing for both RDMA and TCP
- Snapshot Backups to S3
- n+2: 3 paths from client (2 secondaries per primary)