In the sections below, we outline various Federated Learning (FL) use cases in which the Rhino Federated Computing Platform (FCP) can be applied. The Rhino FCP enables numerous FL workflows, including horizontal and vertical FL model training from scratch, model fine-tuning, and model inference. Rhino allows one to adapt traditional machine learning and deep learning workflows, such as PyTorch, TensorFlow, Scikit-learn, XGBoost, and more, into a federated paradigm. Rhino FCP helps build the container image - using GUI or CLI that encapsulates adapted Nvidia Flare code, along with required dependencies. Then, pushes the container images to a private artifact registry, which makes the adapted training code available for model training and evaluation.
Here we describe a reference model training workflow for code development to code run using Rhino FCP product features and AI code development best practices.
- Develop and test local training code - In a local development environment, develop model training code and test against a small dummy dataset - same schema that you have in a real environment.
- Federated Trainer Code - Adapt the model training code to federated training. We will provide step by step guidance on how to code this step.
- Create Federated Training Configuration - Multiple federated training tasks can be started to train different variations of the model, or use different optimization hyperparameters. With the Rhino FCP, one can run multiple experiments with different model versions, perform hyper-parameter tuning, store multiple checkpoints/params, halt model runs, or run validation using different model params and/or different validation datasets at sites.
- Build the NVFlare container image: One may use Rhino FCP UI or SDK - auto containerization or utilize docker CLI to build and push containers into a container registry.
- Run Model Training - Run model training code at the selected clients by choosing training and validation datasets. Under the hood, the Rhino FCP automates and orchestrates the entire Federated Training process.
The client performs the model training using its local training data. Then, the clients produce a new set of model parameters and send their updates to the Rhino Cloud FL server. The FL server aggregates the models from each FL client using the selected federated aggregation method ( e.g. FedAvg, FedProx) and produces a new global model. The global model, then, gets evaluated using validation data. One can track and visualize the global model metrics (loss, accuracy etc.) against validation data using TensorBoard. The process repeats until a specified number of federated rounds are finished or until some termination condition is met (e.g., convergence of the model weights).
For a hands-on example using XGBoost with tabular data, see Part 1: Federated Training Example with Tabular Data.