Usage
While the library attempts to catch all read and write operations performed by the python script, some unsupported libraries might not be visible. To this end, the user can call the log_input and log_output directives after the start_run, to manually flag files as relevant to the execution.
import yprov4dv
yprov4dv.start_run()
# To track a file as input
yprov4dv.log_input(path_to_untracked_file)
# To track a file as output
yprov4dv.log_output(path_to_untracked_file)
The behaviour of yProv4DV can be changed passing parameters to the start_run function.
All the parameters for the start_run function are listed below:
provenance_directory: (str) changes where the inputs, outputs and code directory are stored;prefix: (str) changes the prefix given to fields in the provenance document;run_name: (str) changes the run name inside the provenance file;create_json_file: (TrueorFalse) whether the json file is created or not;create_dot_file: (TrueorFalse) whether the dot file is created or not, cannot beTrueifYPROV4DV_CREATE_JSON_FILEisFalse;create_svg_file: (TrueorFalse) whether the svg file is created or not, cannot beTrueifYPROV4DV_CREATE_JSON_FILEorYPROV4DV_CREATE_DOT_FILEareFalse;create_rocrate: (TrueorFalse) whether the ro-crate zip is created or not;default_namespace: (str) changes the default namespace inside the provenance filesave_input_files_full: (str) decides whether input files are saved in fullsave_input_files_subset: (str) decides whether inputs are saved as a subset (only the plotted data)skip_files_larger_than: (int) In Mb, files larger than the threshold will not be copied;verbose: (TrueorFalse),
Current Compatibilities
Currently, the yProv4DV library is able to track input files which are opened by the following libraries:
- pandas (read_csv, read_parquet, read_excel, read_json)
- xarray (open_dataset, open_mfdataset)
- geopandas (read_file)
- numpy (load)
- torch (load)
- rasterio (open)
- As well as the standard python calls (such as open())
Additionally, if data is plotted using:
- matplotlib (plot, bar, ...)
- seaborn (scatterplot, lineplot, barplot, histplot, boxplot)
Then the subset of data used only for visualization can be saved in an isolated file (by setting the
save_input_files_subsetoption toTrue).
Any type of output files generated during the execution of the program will also be logged, indipendently of file type.