hash_files
[source]
hash_files
(file_list
:List
[str
],block_size
:int
=10485760
,progressbar_min_size
:int
=10737400000
)
Takes a list of path objects and returns the SHA256 hash of the files in the list. If any of the objects are not file objects, this will crash. Ignores any files called 'receipt.rst' as those are considered data intake files and not part of the work.
Parameters
file_list : List[str] List of strings denoting files to be hashed, the strings must all be valid files or this method will throw a ValueError exception.
block_size : int, optional Block size in bytes to read from disk a good generic value is 10MB as most files are smaller than 10MB and it means we can load whole files in at a time when hashing, defaults to 10_485_760 (10MB).
progressbar_min_size : int, optional Minimum size a file needs to be to get its own progress bar during processing. Default was chosen to work well on an SSD reading 400 MB/s, defaults to 10_737_400_000 (10GB).
Returns
str A string representation of the SHA256 hash of the files provided in the file list.
Raises
ValueError The strings passed in the file_list need to be valid file objects in the file system. Currently only windows and Posix filesystems are supported. This may change in future.
process_data_group
[source]
process_data_group
(folder
:Path
,type
:str
,light
:bool
=False
)
Return the system fields for a data group folder.
If the data group is a delivery type, then this only looks at the data folder in it, if it is any other type it looks at the whole folder.
Parameters
folder : Path The location to get metadata for.
type : DataIntakeEnv The type of data group. ['delivery', 'raw_data', 'dataset']
light : bool, optional If set skip the hashing
Returns
dict A dict of the following five metadata elements calculated:
- name : Name of the folder of the data group
- type : The type of the data group processed
- last_update : The current date and time
- size : The size of the data on disk
- num_files : The number of data files.
- group_hash : A SHA256 hash of all the data in the folder
- group_last_modified : The maximum date of created, and modified for all files
count_data_group_components
[source]
count_data_group_components
(data_group
:Path
,data_extensions
:list
,report_extensions
:list
,script_extensions
:list
)
A utility method to analyze a folder to determine which data it contains and whether those have the three requisite elements, generation script, data, and report. It relies on certain conventions about the folder which must be followed:
- Each data respresentation is stored in a folder, files in the root of the passed folder will be ignored.
- Folders starting with "In_Progress" or "." will be ignored.
- In each data representation folder there are three entries more won't cause an error but should be avoided
- Report types have extensions:
with the initial report extension added to a folder containing report files if there is more than 1 report file needed.['report','md','html','pptx','docx', ...]
- Data types have extensions:
with the initial data extension being used for folders in the declaration which allows the data to be spread over multiple files.['data','parquet','hdf5', ...]
- Script types have extensions:
Where the first extension can be applied to a folder if more than one file was needed to process the data.['script','ipynb','py','r','jl','sh', ...]
This analyzer will look only for the extensions listed and report how many of each of the types of files/folders exist in the root of the provided folder.
Parameters
folder : Path A folder containing folders of data representations
TODO add extension list parameters
Returns
pd.DataFrame A table listing all the data representations which appear in the root of the folder.