Draft: Improve hdf5Reader with random sampling and type handling by varunviswapriyan · Pull Request #50 · idtlab/AIDRIN

varunviswapriyan · 2025-11-23T14:46:46Z

Enhanced the hdf5Reader class to include random sampling of rows while reading HDF5 files. Improved handling of numpy types and added error handling for various data processing steps. Also introduced chunked reading, multidimensional flattening, and structured type handling.

Related Issues / Pull Requests

None.
List all related issues and/or pull requests if there are any.

Description

Include a brief summary of the proposed changes.
Random sampling of 2000 rows in the HDF5 reader and chunked reading better supports massive datasets. Code also expanded structured NumPy types into python dictionaries, properly flattened multidimensional datasets, and handled 1D arrays more efficiently.

What changes are proposed in this pull request?

New feature (non-breaking change which adds functionality)

Checklist:

My code modifies existing public API, or introduces new public API, and I updated or wrote documentation
I have commented my code
My code requires documentation updates, and I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Enhanced the hdf5Reader class to include random sampling of rows while reading HDF5 files. Improved handling of numpy types and added error handling for various data processing steps.

Refactor completeness function to process DataFrame in chunks, calculate completeness scores, and generate a visualization. Added error handling for file reading and completeness calculation.

Removed unused constants for maximum rows.

Removed docstring explaining the outliers function.

Improve hdf5Reader with random sampling and type handling

5463d09

Enhanced the hdf5Reader class to include random sampling of rows while reading HDF5 files. Improved handling of numpy types and added error handling for various data processing steps.

jeanbez changed the title ~~Improve hdf5Reader with random sampling and type handling~~ Draft: Improve hdf5Reader with random sampling and type handling Nov 25, 2025

varunviswapriyan and others added 12 commits November 30, 2025 11:15

Update hdf5_reader.py

28f79f5

Refactor completeness function for chunk processing

b31e18d

Refactor completeness function to process DataFrame in chunks, calculate completeness scores, and generate a visualization. Added error handling for file reading and completeness calculation.

Refactor outlier detection to use DataFrame chunks

51cb023

Refactor duplicity task to handle DataFrame chunks

366c6a3

Merge branch 'develop' into develop

3039ff6

Update hdf5_reader.py

0efb7c1

Remove MAX_TOTAL_ROWS and MAX_ROWS_PER_DATASET

21abf25

Removed unused constants for maximum rows.

Update outliers.py

8f5f55e

Remove docstring from outliers function

ea8cefa

Removed docstring explaining the outliers function.

Refactor completeness.py for improved readability

b512984

Refactor duplicity task for improved file handling

6bc418f

Add MAX_TOTAL_ROWS and MAX_ROWS_PER_DATASET variables

b8987e6

Aman-Cool mentioned this pull request Feb 25, 2026

fix: replace HDF5 fill values with NaN before DataFrame construction #61

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Draft: Improve hdf5Reader with random sampling and type handling#50

Draft: Improve hdf5Reader with random sampling and type handling#50
varunviswapriyan wants to merge 13 commits into
idtlab:developfrom
varunviswapriyan:develop

varunviswapriyan commented Nov 23, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

varunviswapriyan commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Related Issues / Pull Requests

Description

What changes are proposed in this pull request?

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

varunviswapriyan commented Nov 23, 2025 •

edited

Loading