yProv4DV

yProv4DV (Data Visualization) is a python utility which allows for packaging of code, inputs and outputs of data visualization scripts. Once integrated, it will produce a zip file which includes all information necessary for reproducibility of the current script, including a copy of the files used. This library is part of the yProv framework, which means it can also produce W3C-prov compliant files useful for interpretability and reproducibility.

Features

The library allows for the automatic collection of inputs, outputs and source code used during the programs execution. If a file is too large, the user can specify to save only the information necessary to the creation of the chart. Additionally, a provenance graph of the program can also be created, along with its visual representation and the ro-create package for the script's reproducibility.

For an example, run:

python ./examples/customized.py

Example

Inside the examples folder is contained an example of a simple data visualization script in python. It is already integrated with the yProv4DV library, and can be run with the prompt:

python ./examples/simple.py

This execution will create:

The prov directory (which is customizable) and will hold all the information for the current execution, so inputs, outputs and source code (src), all in their respective folders. Additionally, in the same directory, the library creates a set of provenance files, containing a description of the current execution (in .json, dot and svg formats).
prov.zip: containining all the aforementioned information in a zipped RO-Crate.

← Prev 🏠 Home Next →

Installation

Check out the yProv4ML documentation page to install graphviz.

Then:

pip install yprov4dv

Or:

git clone https://github.com/HPCI-Lab/yProv4DV.git
cd yProv4DV
pip install -r requirements.txt

pip install .

← Prev 🏠 Home Next →

Usage

While the library attempts to catch all read and write operations performed by the python script, some unsupported libraries might not be visible. To this end, the user can call the log_input and log_output directives after the start_run, to manually flag files as relevant to the execution.

import yprov4dv
yprov4dv.start_run()
# To track a file as input
yprov4dv.log_input(path_to_untracked_file)

# To track a file as output
yprov4dv.log_output(path_to_untracked_file)

The behaviour of yProv4DV can be changed passing parameters to the start_run function. All the parameters for the start_run function are listed below:

provenance_directory: (str) changes where the inputs, outputs and code directory are stored;
prefix: (str) changes the prefix given to fields in the provenance document;
run_name: (str) changes the run name inside the provenance file;
create_json_file: (True or False) whether the json file is created or not;
create_dot_file: (True or False) whether the dot file is created or not, cannot be True if YPROV4DV_CREATE_JSON_FILE is False;
create_svg_file: (True or False) whether the svg file is created or not, cannot be True if YPROV4DV_CREATE_JSON_FILE or YPROV4DV_CREATE_DOT_FILE are False;
create_rocrate: (True or False) whether the ro-crate zip is created or not;
default_namespace: (str) changes the default namespace inside the provenance file
save_input_files_full: (str) decides whether input files are saved in full
save_input_files_subset: (str) decides whether inputs are saved as a subset (only the plotted data)
skip_files_larger_than: (int) In Mb, files larger than the threshold will not be copied;
verbose: (True or False),

Current Compatibilities

Currently, the yProv4DV library is able to track input files which are opened by the following libraries:

pandas (read_csv, read_parquet, read_excel, read_json)
xarray (open_dataset, open_mfdataset)
geopandas (read_file)
numpy (load)
torch (load)
rasterio (open)
As well as the standard python calls (such as open())

Additionally, if data is plotted using:

matplotlib (plot, bar, ...)
seaborn (scatterplot, lineplot, barplot, histplot, boxplot) Then the subset of data used only for visualization can be saved in an isolated file (by setting the save_input_files_subset option to True).

Any type of output files generated during the execution of the program will also be logged, indipendently of file type.

← Prev 🏠 Home Next →

Example

Two examples are provided:

← Prev 🏠 Home Next →

Simple Example

Example:

import yprov4dv
import pandas as pd
import matplotlib.pyplot as plt

from lib import elaborate

data = pd.read_csv("./assets/results.csv")
yprov4dv.log_input("./assets/results.csv")

data["second_series"] = elaborate(data["points"])

data.plot()
plt.legend()
plt.savefig("tmp.png")

yprov4dv.log_output("tmp.png")

ExampleSimple

← Prev 🏠 Home Next →

Customized Example

Example:

import yprov4dv
yprov4dv.start_run(
    create_rocrate=False, 
    create_json_file=True, 
    create_dot_file=True, 
    create_svg_file=True, 
    save_input_files_subset=True, # Take only the data plotted
    skip_files_larger_than=1 # Larger than 1 Mb
)

import pandas as pd
import matplotlib.pyplot as plt

data_path = "assets/large.csv"
# This log is not necessary, the file will be tracked anyways
yprov4dv.log_input(data_path) 
data = pd.read_csv(data_path)

data['time'] = pd.to_datetime(data['time'])
data = data.set_index('time')

recent_data = data.tail(365).copy()
recent_data["Price_Smoothing"] = recent_data["PriceUSD"].rolling(window=30).mean()

# This will capture ONLY the last 365 days of data into your PROV log
# (both "PriceUSD", "Price_Smoothing")
ax = recent_data[["PriceUSD", "Price_Smoothing"]].plot(
    figsize=(10, 6), 
    title="Bitcoin Price Trend (Last Year)",
    color=['#1f77b4', '#ff7f0e'],
    linewidth=2
)

plt.ylabel("Price (USD)")
plt.grid(True, linestyle='--', alpha=0.7)
plt.legend(["Daily Price", "30-Day Average"])

# 5. Save and Log Output
output_path = "btc_analysis.png"
plt.savefig(output_path, dpi=300)
# Not necessary, the file will be tracked anyways
yprov4dv.log_output(output_path)

← Prev 🏠 Home Next →

yProv4DV Documentation