declearn v2.4.0

Released: 18/03/2024

Important notice:
DecLearn 2.4 derogates to SemVer by revising some of the major DecLearn component APIs.

This is mitigated in two ways:

No changes relate to the main process setup code, meaning that end-users that do not use custom components (aggregator, optimodule, metric, etc.) should not see any difference, and their code will work as before (as an illustration, our examples' code remains unchanged).
Key methods that were deprecated in favor of new API ones are kept for two more minor versions, and are still tested to work as before.

Any end-user encountering issues due to the released or planned evolution of DecLearn is invited to contact us via GitLab, GitHub or e-mail so that we can provide with assistance and/or update our roadmap so that changes do not hinder the usability of DecLearn 2.x for research and applications.

New version policy (and future roadmap)

As noted above, v2.4 does not fully abide by SemVer rules. In the future, more partially-breaking changes and API revisions may be introduced, incrementally paving the way towards the next major release, while trying as much as possible not to break end-user code.

To avoid unforeseen incompatibilities and cryptic bugs from arsing, from this version onward, the server and clients are expected and verified to use the same major.minor version of DecLearn. This policy may be updated in the future, e.g. to specify that clients may have a newer minor version than the server (and most probably not the other way around).

To avoid unhappy surprises, we are starting to maintain a public roadmap on our GitLab. Although it may change, it should provide interested users (notably those that are interested in developing custom components or processes on top of DecLearn) with a way to anticipate changes, and voice any concerns or advice they might have.

Revise all aggregation APIs

Revise the overall design for aggregation and introduce `Aggregate` API

This release introduces the Aggregate API, which is based on an abstract base dataclass acting as a template for data structures that require sharing across peers and aggregation.

The declearn.utils.Aggregate ABC acts as a shared ancestor providing with a base API and shared backend code to define data structures that:

are serializable to and deserializable from JSON, and may therefore be preserved across network communications
are aggregatable into an instance of the same structure
use summation as the default aggregation rule for fields, which is overridable by redefining the default_aggregate method
can implement custom aggregate_<field.name> methods to override the default summation rule
implement a prepare_for_secagg method that
- enables defining which fields merely require sum-aggregation and need encryption when using SecAgg, and which fields are to be preserved in cleartext (and therefore go through the usual default or custom aggregation methods)
- can be made to raise a NotImplementedError when SecAgg cannot be achieved on a data structure

This new ABC currently has three main children:

AuxVar: replaces plain dict for Optimizer auxiliary variables
MetricState: replaces plain dict for Metric intermediate states
ModelUpdates: replaces sharing of updates as Vector and n_steps

Each of this is defined jointly with another (pre-existing, revised) API for components that (a) produce Aggregate data structures based on some input data and/or computations; (b) produce some output results based on a received Aggregate structure, meant to result from the aggregation of multiple peers' produced data.

Revise `Aggregator` API, introducing `ModelUpdates`

The Aggregator API was revised to make use of the new ModelUpdates data structure (inheriting Aggregate).

Aggregator.prepare_for_sharing pre-processes an input Vector containing raw model updates and an integer indicating the number of local SGD steps into a ModelUpdates structure.
Aggregator.finalize_updates receives a ModelUpdates resulting from the aggregation of peers' instances, and performs final computations to produce a Vector of aggregated model updates.
The legacy Aggregator.aggregate method is deprecated (but still works).

Revise auxiliary variables for `Optimizer`, introducing `AuxVar`

The OptiModule API (and, consequently, Optimizer) was revised as to the design and signature of auxiliary variables related methods, to make use of the new AuxVar data structure (inheriting Aggregate).

OptiModule.collect_aux_var now emits either None or an AuxVar instance (the precise type of which is module-dependent), instead of a mere dict.
OptiModule.process_aux_var now expects a proper-type AuxVar instance that already aggregates clients' data, externalizing the aggregation rules to the AuxVar subtypes, while keeping the finalization logic part of the OptiModule subclasses.
Optimizer.collect_aux_var therefore emits a {name: aux_var} dict.
Optimizer.process_aux_var therefore expects a {name: aux_var} dict, rather than having distinct signatures on the client and server sides.
It is now expected that server-side components will send the same data to all clients, rather than allow sending client-wise values.

The backend code of ScaffoldClientModule and ScaffoldServerModule was heavily revised to alter the distribution of information and computations:

Client-side modules are now the sole owners of their local state, and send sum-aggregatable updates to the server, that are therefore SecAgg-compatible.
The server consequently shares the same information with all clients, namely the current global state.
To keep track of the (possibly growing with time) number of unique client, clients generate a random uuid that is sent with their state updates and preserved in cleartext when SecAgg is used.
As a consequence, the server component knows which clients contributed to a given round, but receives an aggregate of local updates rather than the client-wise state values.

Revise `Metric` API, introducing `ModelState`

The Metric API was revised to make use of the new MetricState data structure (inheriting Aggregate).

Metric.build_initial_states generates a "zero-state" MetricState instance (it replaces the previously-private _build_states method that returned a dict).
Metric.get_states returns a (Metric-type-dependent) MetricState instance, instead of a mere dict.
Metric.set_states assigns an incoming MetricState into the instance, that may be finalized into results using the unchanged get_result method.
The legacy Metric.agg_states is deprecated, in favor of set_states (but it still works).

Revise backend communications and messaging APIs

This release introduces some important backend changes to the communication and messaging APIs of DecLearn, resulting in more robust code (that is also easier to test and maintain), more efficient message parsing (possibly-costly de-serialization is now delayed to a time posterior to validity verification) and the extensibility of application messages, enabling to easily define and use custom message structures in downstream applications.

The most important API change is that network communication endpoints now return SerializedMessage instances rather than Message ones.

New `declearn.communication.api.backend` submodule

Introduce a new ActionMessage minimal API under its actions submodule, that defines hard-coded, lightweight and easy-to-parse data structures designed to convey information and content across network communications agnostic to the content's nature.
Revise and expose the MessagesHandler util, that now builds on the ActionMessage API to model remote calls and answer them.
Move the declearn.communication.messaging.flags submodule to declearn.communication.api.backend.flags.

New `declearn.messaging` submodule

Revise the Message API to make it extendable, with automated type-registration of subclasses by default.
Introduce SerializedMessage as a wrapper for received messages, that parses the exact Lessage subtype (enabling logic tests and message filtering) but delays actual content de-serialization and Message object recovery (enabling to prevent undue resources use for unwanted messages that end up being discarded).
Move most existing Message subclasses to the new submodule, for retro-compatibility purposes. In DecLearn 3.0 these will probably be re-dispatched to make it clear that concrete messages only make sense in the context of specific multi-agent processes.
Drop backend-oriented Message subclasses that are replaced with the new ActionMessage backbone structures.
Deprecate the declearn.communication.messaging submodule, that is temporarily maintained, re-exporting moved contents as well as deprecated message types (which are bound to be rejected if sent).

Revise `NetworkClient` and `NetworkServer`

Have message-receiving methods return SerializedMessage instances rather than finalized de-serialized Message ones.
Quit sending and expecting 'data_info' with registration requests.
Rename NetworkClient.check_message into recv_message (keep the former as an alias, with a DeprecationWarning).
Improve the use of (optional) timeouts when sending or expecting messages and overall exceptions handling:
NetworkClient.recv_message may either raise a TimeoutError (in case of timeout) or RuntimeError (in case of rejection).
NeworkServer.send_messages and broadcast_message quietly stop waiting for clients to collect messages after the (opt.) timeout delay has passed. Messages may still be collected.
NetworkServer.wait_for_messages no longer accepts a timeout.
NetworkServer.wait_for_messages_with_timeout implements the possibility to setup a timeout. It returns both received client replies and a list of clients that failed to answer.
All timeouts can now be specified as float values (which is mostly useful for testing purposes or simulated environments).
Add a heartbeat instantiation parameter, with a default value of 1 second, that is passed to the underlying MessagesHandler. In simulated contexts (including tests), setting a low heartbeat can cut runtime down significantly.

New `declearn.communication.utils` submodule

Introduce the declearn.communication.utils submodule, and move existing declearn.communication utils to it. Keep re-exporting them from the parent module to preserve code compatibility.

Add verify_client_messages_validity and verify_server_message_validity as part of the new submodule, that refactor some backend code from orchestration classes related to the filtering and type-checking of exchanged messages.

Usability updates

A few minor changes were made in hope that they can improve DecLearn usability for end-users.

Record and save training losses

The Model API was updated so that Model.compute_batch_gradients now records the computed batch-averaged model loss as a float value in an internal buffer, and the newly-introduced Model.collect_training_losses method enables getting all stored values (and purging the buffer on the way).

The FederatedClient was consequently updated to collect and export training loss values at the end of each and every training round when a Checkpointer is attached to it (otherwise, values are purged from memory but not recorded to disk).

Add a verbosity option for `FederatedClient` and `TrainingManager`.

FederatedClient and TrainingManager now both accept a verbose: bool=True instantiation keyword argument, that changes:

(a) the default logger verbosity level: if logger=None is also passed, the default logger will have 'info'-level verbosity if verbose and declearn.utils.LOGGING_LEVEL_MAJOR-level if not verbose, so that only evaluation metrics and errors are logged.
(b) the optional display of a progressbar when conducting training or evaluation rounds; if not verbose, no progressbar is used.

Add 'TrainingManager.(train|evaluate)_under_constraints' to the API.

These public methods enable running training or evaluation rounds without relying on Message structures to specify parameters nor collect results.

Modularize 'run_as_processes' input specs.

The declearn.utils.run_as_processes util was modularized to that routines can be specified in various ways. Previously, they could only be passed as (func, args) tuples. Now, they can either be passed as (func, args), (func, kwargs) or (func, args, kwargs), where args is still a tuple of positional arguments, and kwargs is a dict of keyword ones.

Other changes

FederatedServer now keeps track of clients having received the latest global model weights, and avoids sending them redundantly with future training (or evaluation) requests. To achieve this, TrainRequest and EvaluationRequest now support setting their weights field to None.

Update TensorFlow supported versions

Older TensorFlow versions (v2.5 to 2.10 included) were improperly marked as supported in spite of TensorflowOptiModule requiring at least version 2.11 to work (due to changes of the Keras Optimizer API). This has been corrected.

The latest TensorFlow version (v2.16) introduces backward-breaking changes, due to the backend swap from Keras 2 to Keras 3. Our backend code was updated to both add support for this newer Keras backend, and preserve existing support.

Note that at the moment, the CI does not support TensorFlow above 2.13, due to newer versions not being compatible with Python 3.8. As such, our code will be tested to remain backward-compatible. Forward compatibility has been (and will keep being) tested locally with a newer Python version.

Deprecate `declearn.dataset.load_from_json`

As the save_to_json and from_to_json methods were removed from the declearn.dataset.Dataset API in DecLearn 2.3.0, there is no longer a guarantee that this function works (save with InMemoryDataset).

As a consequence, this function should have been deprecated, and has now been documented as such planned for removal in DecLearn 2.6 and/or 3.0.

declearn v2.4.0

New version policy (and future roadmap)

Revise all aggregation APIs

Revise the overall design for aggregation and introduce Aggregate API

Revise Aggregator API, introducing ModelUpdates

Revise auxiliary variables for Optimizer, introducing AuxVar

Revise Metric API, introducing ModelState