Get Data from Provenance Files

yProv4ml offers a set of directives to easilyy extract the information logged from the provenance.json file.

All these functions expect the data to be passed to be a pandas.DataFrame. When using a provenance json file coming from yProv4ML, this can be easily obtained following the example below.

Example:
import json
data = json.load(open(path_to_prov_json))     

Utility Functions

def get_metrics(data : pd.DataFrame, keyword : Optional[str] = None) -> List[str]

The get_metrics function retrieves all available metrics from the provided provjson file. If a keyword is specified, it filters the results to include only metrics that match the keyword.

ParameterTypeDefaultDescription
datapd.DataFrameRequiredThe dataset containing metrics.
keywordOptional[str]NoneIf provided, filters the metrics to only those containing this keyword.
def get_metric(
    data : pd.DataFrame, 
    metric : str, 
    time_in_sec : bool = False, 
    time_incremental : bool = False, 
    sort_by : Optional[str] = None, 
    start_at : Optional[int] = None,
    end_at : Optional[int] = None
) -> pd.DataFrame

The get_metric function extracts a specific metric from the dataset, with additional options for formatting and filtering:

  • It allows conversion of time-based metrics to seconds.
  • It can return time-incremental values instead of absolute values.
  • Sorting and range selection (start and end points) can be applied.
ParameterTypeDefaultDescription
datapd.DataFrameRequiredThe dataset containing metrics.
metricstrRequiredThe specific metric to retrieve.
time_in_secboolFalseIf True, converts time-based metrics to seconds.
time_incrementalboolFalseIf True, returns incremental values instead of absolute values.
sort_byOptional[str]NoneSorts the metric values by the specified column.
start_atOptional[int]NoneFilters data to start at this index.
end_atOptional[int]NoneFilters data to end at this index.

The return value for this function is a dataframe containing the following columns:

  • value: contains the metric items
  • epoch: contains the corresponding epochs
  • time: contains the corresponding time steps
def get_param(data : pd.DataFrame, param : str) -> Any

Retrieves a single value corresponding to the given param. This function is useful when the parameter is expected to have a unique value and the label exactly matches in the prov json file.

def get_params(data : pd.DataFrame, param : str) -> List[Any]

Retrieves a list of values for the given param. This is useful when multiple values exist for the parameter (for example when marked with an incremental ID) in the provenance json file, allowing further analysis or aggregation.

ParameterTypeReturn TypeDescription
datapd.DataFrame-The dataset containing parameters.
paramstr-The specific parameter to retrieve.

Let me know if you need further clarification! 🚀

More utility functions are also available:

def get_avg_metric(data, metric) -> pd.DataFrame: ...
def get_sum_metric(data, metric) -> pd.DataFrame: ...
def get_metric_time(data, metric, time_in_sec=False) -> pd.DataFrame: ...