yProv4ML

GPLv3 License

This library is part of the yProv suite, and provides a unified interface for logging and tracking provenance information in machine learning experiments, both on distributed as well as large scale experiments.

It allows users to create provenance graphs from the logged information, and save all metrics and parameters to json format.

Data Model

Data Model

Example

Example

The image shown above has been generated from the example program provided in the example directory.

Metrics Visualization

Loss and GPU Usage

Emission Rate

Experiments and Runs

An experiment is a collection of runs. Each run is a single execution of a machine learning model. By changing the experiment_name parameter in the start_run function, the user can create a new experiment. All artifacts and metrics logged during the execution of the experiment will be saved in the directory specified by the experiment ID.

Several runs can be executed in the same experiment. All runs will be saved in the same directory (according to the specific experiment name and ID).

Documentation

For detailed information, please refer to the Documentation

Contributors

Installation

Install from the repository:

git clone https://github.com/HPCI-Lab/yProvML.git
cd yProvML

pip install -r requirements.txt
# Install the package
pip install .
# Use apple extra if on a Mac
pip install .[apple]

# or install for specific arch
pip install .[nvidia] # or .[amd]

or simply:

pip install --no-cache-dir git+https://github.com/HPCI-Lab/yProvML

Home | Next

Setup

Before using the library, the user must set up the MLFlow execution, as well as library specific configurations:

prov4ml.start_run(
    prov_user_namespace: str,
    experiment_name: Optional[str] = None,
    provenance_save_dir: Optional[str] = None,
    collect_all_processes: Optional[bool] = False,
    save_after_n_logs: Optional[int] = 100,
    rank : Optional[int] = None, 
)

The parameters are as follows:

ParameterTypeDescription
prov_user_namespacestringRequired. User namespace for the provenance graph
experiment_namestringRequired. Name of the experiment
provenance_save_dirstringRequired. Directory to save the provenance graph
collect_all_processesboolOptional. Whether to collect all processes
save_after_n_logsintOptional. Save the graph after n logs
rankintOptional. Rank of the process

At the end of the experiment, the user must end the run:

prov4ml.end_run(
    create_graph: Optional[bool] = False, 
    create_svg: Optional[bool] = False, 
)
ParameterTypeDescription
create_graphboolOptional. Whether to create the graph
create_svgboolOptional. Whether to create the svg

This call allows the library to save the provenance graph in the specified directory.

Home | Prev | Next

Provenance Graph Creation (GraphViz)

The standard method to generate the .dot file containing the provenance graph is to set the create_graph parameter to True.

If the user necessitates to turn a PROV-JSON created with yProv4ML into a .dot file, the following code command can be used:

python -m prov4ml.prov2dot --prov_json prov.json --output prov_graph.dot

Provenance Graph Image (SVG)

The standard method to generate the .svg image of the provenance graph is to set the create_svg parameter to True. In this case both create_graph and create_svghave to be set to True.

If the user necessitates to turn a .dot file into a .svg file, the following code command can be used:

python -m prov4ml.dot2svg --dot prov_graph.dot --output prov_graph.svg

Or alternatively, using directly the Graphviz suite:

dot -Tsvg -O prov_graph.dot

Home | Prev | Next

General Logging

When logging parameters and metrics, the user must specify the context of the information. The available contexts are:

  • TRAINING: adds the information to the training context
  • VALIDATION: adds the information to the validation context
  • TESTING: adds the information to the testing context

Log Parameters

To specify arbitrary training parameters used during the execution of the experiment, the user can call the following function.

prov4ml.log_param(
    key: str, 
    value: str, 
)
ParameterTypeDescription
keystringRequired. Name of the parameter
valuestringRequired. Value of the parameter

Log Metrics

To specify metrics, which can be tracked during the execution of the experiment, the user can call the following function.

prov4ml.log_metric(
    key: str, 
    value: float, 
    context:Context, 
    step: Optional[int] = None, 
    source: LoggingItemKind = None, 
)
ParameterTypeDescription
keystringRequired. Name of the metric
valuefloatRequired. Value of the metric
contextprov4ml.ContextRequired. Context of the metric
stepintOptional. Step of the metric
sourceLoggingItemKindOptional. Source of the metric

The step parameter is optional and can be used to specify the current time step of the experiment, for example the current epoch. The source parameter is optional and can be used to specify the source of the metric, so for example which library the data comes from. If omitted, yProv4ML will try to automatically determine the origin.

Log Artifacts

To log artifacts, the user can call the following function.

prov4ml.log_artifact(
    artifact_path : str, 
    context: Context,
    step: Optional[int] = None, 
    timestamp: Optional[int] = None
)
ParameterTypeDescription
artifact_pathstringRequired. Path to the artifact
contextprov4ml.ContextRequired. Context of the artifact
stepintOptional. Step of the artifact
timestampintOptional. Timestamp of the artifact

The function logs the artifact in the current experiment. The artifact can be a file or a directory. All logged artifacts are saved in the artifacts directory of the current experiment, while the related information is saved in the PROV-JSON file, along with a reference to the file.

Log Models

prov4ml.log_model(
    model: Union[torch.nn.Module, Any], 
    model_name: str = "default", 
    log_model_info: bool = True, 
    log_as_artifact=True, 
)
ParameterTypeDescription
modelUnion[torch.nn.Module, Any]Required. The model to be logged
model_namestringOptional. Name of the model
log_model_infoboolOptional. Whether to log model information
log_as_artifactboolOptional. Whether to log the model as an artifact

It sets the model for the current experiment. It can be called anywhere before the end of the experiment. The same call also logs some model information, such as the number of parameters and the model architecture memory footprint. The saving of these information can be toggled with the log_model_info = False parameter. The model can be saved as an artifact by setting the log_as_artifact = True parameter, which will save its parameters in the artifacts directory and reference the file in the PROV-JSON file.

prov4ml.save_model_version(
    model: Union[torch.nn.Module, Any], 
    model_name: str, 
    context: Context, 
    step: Optional[int] = None, 
    timestamp: Optional[int] = None
)

The save_model_version function saves the state of a PyTorch model and logs it as an artifact, enabling version control and tracking within machine learning experiments.

ParameterTypeDescription
modeltorch.nn.ModuleRequired. The PyTorch model to be saved.
model_namestrRequired. The name under which to save the model.
contextContextRequired. The context in which the model is saved.
stepOptional[int]Optional. The step or epoch number associated with the saved model.
timestampOptional[int]Optional. The timestamp associated with the saved model.

This function saves the model's state dictionary to a specified directory and logs the saved model file as an artifact for provenance tracking. It ensures that the directory for saving the model exists, creates it if necessary, and uses the torch.save method to save the model. It then logs the saved model file using log_artifact, associating it with the given context and optional step number.

Log Datasets

yProv4ML offers helper functions to log information and stats on specific datasets.

prov4ml.log_dataset(
    dataset : Union[DataLoader, Subset, Dataset], 
    label : str
)
ParameterTypeDescription
datasetUnion[DataLoader, Subset, Dataset]Required. The dataset to be logged
labelstringRequired. The label of the dataset

The function logs the dataset in the current experiment. The dataset can be a DataLoader, a Subset, or a Dataset class from pytorch.

Home | Prev | Next

Provenance Collection Creation

The provenance collection functionality can be used to create a summary file linking all PROV-JSON files generated during a run. These files come from distributed execution, where each process generates its own log file, and the user may want to create a single file containing all the information.

The collection can be created with the following command:

python -m prov4ml.prov_collection --experiment_path experiment_path --output_dir output_dir

Where experiment_path is the path to the experiment directory containing all the PROV-JSON files, and output_dir is the directory where the collection file will be saved.

Home | Prev | Next

Carbon Metrics

The prov4ml.log_carbon_metrics function logs carbon-related system metrics during machine learning experiments. The information logged is related to the time between the last call to the function and the current call.

prov4ml.log_carbon_metrics(
    context: Context,
    step: Optional[int] = None,
)
ParameterTypeDescription
contextprov4ml.ContextRequired. Context of the metric
stepintOptional. Step of the metric

This function logs the following system metrics:

ParameterDescriptionUnit
EmissionsEmissions of the systemgCO2eq
Emissions rateEmissions rate of the systemgCO2eq/s
CPU powerPower usage of the CPUW
GPU powerPower usage of the GPUW
RAM powerPower usage of the RAMW
CPU energyEnergy usage of the CPUJ
GPU energyEnergy usage of the GPUJ
RAM energyEnergy usage of the RAMJ
Energy consumedEnergy consumed by the systemJ

How is CO2Eq calculated?

The CO2 equivalent (CO2eq) is a metric used to compare the emissions of CO2.

  • CO2eq is calculated by multiplying the energy consumed by the system by carbon intensity.
  • Energy is calculated by multiplying the power usage by the time interval, this is done for each component (CPU, GPU, RAM).
  • Carbon intensity is the amount of CO2 emitted per unit of energy consumed. It can be obtained in three ways:
    • Using cloud providers' carbon intensity data (Google).
    • Using the carbon intensity of the grid where the system is running (per country).
    • Using the electricity mix of the grid where the system is running (renewables / gas / petroleum / coal).

Why is it decreasing?

The emissions rate can decrease due to the following reasons:

  • Idle time: The system is not being used, so the power usage is low.
  • Energy efficiency: The system is using less power to perform the same tasks.
  • Startup time: The system is starting up, so the power usage is high at the beginning.

After plotting the metrics saved with codecarbon, we can see that the emissions rate decreases over time.

Emissions Rate

This shows that energy is mostly constant over time, while the emissions rate decreases. This is due to the ratio between energy and time, which is decreasing over time.

Home | Prev | Next

System Metrics

The prov4ml.log_system_metrics function logs critical system performance metrics during machine learning experiments. The information logged is related to the time between the last call to the function and the current call.

prov4ml.log_system_metrics(
    context: Context,
    step: Optional[int] = None,
)
ParameterTypeDescription
contextprov4ml.ContextRequired. Context of the metric
stepintOptional. Step of the metric
synchronousboolOptional. Whether to log the metric synchronously
timestampintOptional. Timestamp of the metric

This function logs the following system metrics:

ParameterDescriptionUnit
Memory usageMemory usage of the system%
Disk usageDisk usage of the system%
Gpu memory usageMemory usage of the GPU%
Gpu usageUsage of the GPU%

FLOPs per Epoch

The log_flops_per_epoch function logs the number of floating-point operations (FLOPs) performed per epoch for a given model and dataset.

prov4ml.log_flops_per_epoch(
    label: str, 
    model: Union[torch.nn.Module, Any],
    dataset: Union[torch.utils.data.Dataset, torch.utils.data.DataLoader, torch.utils.data.Subset], 
    context: Context, 
    step: Optional[int] = None
):
ParameterTypeDescription
labelstringRequired. Label of the FLOPs
modelUnion[torch.nn.Module, Any]Required. Model used for the FLOPs calculation
datasetstringRequired. Dataset used for the FLOPs calculation
contextprov4ml.ContextRequired. Context of the metric
stepintOptional. Step of the metric

FLOPs per Batch

The log_flops_per_batch function logs the number of floating-point operations (FLOPs) performed per batch for a given model and batch of data.

prov4ml.log_flops_per_batch(
    label: str, 
    model: Union[torch.nn.Module, Any],
    batch: Any, 
    context: Context, 
    step: Optional[int] = None, 
):
ParameterTypeDescription
labelstringRequired. Label of the FLOPs
modelUnion[torch.nn.Module, Any]Required. Model used for the FLOPs calculation
batchAnyRequired. Batch of data used for the FLOPs calculation
contextprov4ml.ContextRequired. Context of the metric
stepintOptional. Step of the metric

Home | Prev | Next

Execution Time

prov4ml.log_current_execution_time(
    label: str, 
    context: Context, 
    step: Optional[int] = None
)
ParameterTypeDescription
labelstringRequired. Label of the code portion
contextprov4ml.ContextRequired. Context of the metric
stepintOptional. Step of the metric

The log_current_execution_time function logs the current execution time of the code portion specified by the label.

prov4ml.log_execution_start_time()

The log_execution_start_time function logs the start time of the current execution. It is automatically called at the beginning of the experiment.

prov4ml.log_execution_end_time()

The log_execution_end_time function logs the end time of the current execution. It is automatically called at the end of the experiment.

Home | Prev | Next

Register Metrics for custom Operations

After collection of a specific metric, it's very often the case that a user may want to aggregate that information by applying functions such as mean, standard deviation, or min/max.

yProv4ML allows to register a specific metric to be aggregated, using the function:

prov4ml.register_final_metric(
    metric_name : str,
    initial_value : float,
    fold_operation : FoldOperation
) 

where fold_operation indicates the function to be applied to the data.

Several FoldOperations are already defined, such as MAX, MIN, ADD and SUBRACT. In any case the user is always able to define its own custom function, by either defining one with signature:

def custom_foldOperation(x, y): 
    return x // y

Or by passing a lambda function:

prov4ml.register_final_metric("my_metric", 0, lambda x, y: x // y) 

The output of the aggregated metric is saved in the PROV-JSON file, as an attribute of the current execution.

Home | Prev | Next

Example of usage with PyTorch

This section provides an example of how to use Prov4ML with PyTorch.

The following code snippet shows how to log metrics, system metrics, carbon metrics, and model versions in a PyTorch training loop.

Example

for epoch in tqdm(range(EPOCHS)):
    for i, (x, y) in enumerate(train_loader):
        optim.zero_grad()
        y_hat = mnist_model(x)
        loss = F.cross_entropy(y_hat, y)
        loss.backward()
        optim.step()
        prov4ml.log_metric("MSE_train", loss, context=prov4ml.Context.TRAINING, step=epoch)
    
    # log system and carbon metrics (once per epoch), as well as the execution time
    prov4ml.log_carbon_metrics(prov4ml.Context.TRAINING, step=epoch)
    prov4ml.log_system_metrics(prov4ml.Context.TRAINING, step=epoch)
    # save incremental model versions
    prov4ml.save_model_version(mnist_model, f"mnist_model_version_{epoch}", prov4ml.Context.TRAINING, epoch)
     

Home | Prev | Next

Example of usage with PyTorch Lightning

This section provides an example of how to use Prov4ML with PyTorch Lightning.

In any lightning module the calls to train_step, validation_step, and test_step can be overridden to log the necessary information.

def training_step(self, batch, batch_idx):
    x, y = batch
    y_hat = self(x)
    loss = self.loss(y_hat, y)
    prov4ml.log_metric("MSE_train", loss, prov4ml.Context.TRAINING, step=self.current_epoch)
    prov4ml.log_flops_per_batch("train_flops", self, batch, prov4ml.Context.TRAINING,step=self.current_epoch)
    return loss

This will log the mean squared error and the number of flops per batch for each the training step.

Alternatively, the on_train_epoch_end method can be overridden to log information at the end of each epoch.

def on_train_epoch_end(self) -> None:
    prov4ml.log_metric("epoch", self.current_epoch, prov4ml.Context.TRAINING, step=self.current_epoch)
    prov4ml.save_model_version(self, f"model_version_{self.current_epoch}", prov4ml.Context.TRAINING, step=self.current_epoch)
    prov4ml.log_system_metrics(prov4ml.Context.TRAINING,step=self.current_epoch)
    prov4ml.log_carbon_metrics(prov4ml.Context.TRAINING,step=self.current_epoch)
    prov4ml.log_current_execution_time("train_epoch_time", prov4ml.Context.TRAINING, self.current_epoch)