as of commit 5322088
Installation
-
It'd be nice to have a "Hi there" section or pre-section text blurb, that says what the package is for and what it is inspired by (e.g. loosely modeled on scikit-allel but with GPU acceleration and more features)
-
Requirements should include pixi and not mention cupy/numpy/etc. Or, there should be instructions on how to install without pixi. I'm under the impression we're conditioning on pixi install for the nvidia build tools. If we are conditioning on a pixi install it'd be good to say why ("we're not just being capricious, pixi makes the nvidia compilation toolchain portable and eliminates headaches from managing virtual environments!")
-
There should be mention of how to run a script that uses pg_gpu here, e.g. pixi run python script.py -- this is shown for moments but should be in the first section.
-
I'm not sure why we're showing the moments integration here specifically---I suppose just to illustrate using the sub-environment with pixi? Why do we have a separate moments environment (do the additional dependencies really add that much overhead, or are they just needed for tests/examples)? If the latter, maybe we should title this section "Example-specific environments"
-
More generally, make it clear to people who are coming from pip or pip-in-conda land how to use the package, e.g.:
-
What if I want to run a script, but not from the root of the pg_gpu repo?
-
What if I want to run pg_gpu from a notebook?
-
What if I want to use pg_gpu as a dependency in my package?
-
"verify installation" should have the bash invocation to run the script (and put script name at top of code blurb, # is_pg_gpu_installed.py). Also why is this section needed? Could we simply check for cuda on init and error out (so import fails with informative message)?
Quick start
-
Generally: use longer variable names to give more of a lead here, e.g. "se" ==> "stderr"; "T, B" ==> "f3_numer, f3_denom"; and don't reuse names in the same blurb, e.g. "est" ==> "jack_est" and "boot_est"
-
Loading data: Maybe say upfront that this is coming from diploid VCF and is interpreted as phased. Most folks are probably coming to this with diploid data, either phased or unphased.
-
Patterson: why are numerator and denominator returned separately? This needs an intro sentence or comment to explain (and feeds into next section)
-
Jackknife/bootstrap: "both estimators" ==> which estimators? estimators of what? needs an intro sentence or two, unlike the other stats it's not self-evident what's going on from the script blurb
-
Bootstrap: where are 'num_blocks' and 'den_blocks' created? This workflow needs more hand-holding/context.
-
Lostruct: why not wrap this workflow into a one-and-done like was done for Patterson's D?
-
Lostruct: suggest moving the "windowed" part of this into windowed analysis section
-
Diploid data: as per earlier comment, maybe better to phrase this as "Phased to unphased" as presumably most folks are loading in a HaplotypeMatrix from phased or unphased diploid data
-
Move theta estimators up closer to the top, these are fundamental statistics. Keep the "FrequencySpectrum" stuff in its own section.
-
Missing data: show attaching a mask, computing the same thing on masked/unmasked data---this is fundamental. Maybe just move the relevant blurb from "Examples" to here.
Missing data handling
-
Supported statistics: what is "SSL" (in selection scan)
-
Why is naive r2 the default if it is biased? Why not always choose the unbiased option? Does sigma_D2 have any limitations? This is the preference under "best practices" ...
-
HaplotypeMatrix utilities: does this also work for GenotypeMatrix?
-
Move "accessible site masks" section above "span normalization" section
-
SFS projection: give sentence or two saying what hypergeometric projection is, I think we need more than a reference to paper in which projection is a footnote
Examples
-
Unlike the quick start I think we want some more context per example here. It's good to be verbose in examples. This is where people go, rather than the API reference, to understand how things fit together.
-
"Complete workflow" --- why is this the complete workflow? There's a lot of other features ... so what is this intended to show? Maybe "minimal workflow"?
-
Two-population LD: what statistics are being computed here?
-
Batch statistics: batch of what? Say something like "Fused operations may be used to compute multiple summary statistics in parallel, more efficiently than if they were computed independently:". Maybe show single locus batched statistics too?
-
Integration with moments: give some more context, like "These are statistics used in moments.LD model fitting described in Ragsdale and Gravel 2019. However, as they are pairwise they are costly to compute, which is exactly where GPU acceleration shines."
-
LD pruning: needs more context, what is this example doing? Also point to content in our examples/ folder.
-
Utility scripts: move this to its own "major section" --- this is something like "Workflows" rather than "Utility scripts". It would be good to illustrate usage from the CLI (if this is what is intended).
Moments integration
Does this need to be in its own section if it's already in "Examples"? Seems like it should be one or the other. Maybe it'd be clearer to split the "Examples" section into "Small examples" (code blurbs all on one page) and the longer we-wrote-a-script-for-this-thing cases into "Tutorials" (each with their own page)? The former with just a header to provide context, the latter providing some narrative explanation (and a mention of the packaged examples/ script at the top)?
as of commit 5322088
Installation
It'd be nice to have a "Hi there" section or pre-section text blurb, that says what the package is for and what it is inspired by (e.g. loosely modeled on scikit-allel but with GPU acceleration and more features)
Requirements should include pixi and not mention cupy/numpy/etc. Or, there should be instructions on how to install without pixi. I'm under the impression we're conditioning on pixi install for the nvidia build tools. If we are conditioning on a pixi install it'd be good to say why ("we're not just being capricious, pixi makes the nvidia compilation toolchain portable and eliminates headaches from managing virtual environments!")
There should be mention of how to run a script that uses pg_gpu here, e.g.
pixi run python script.py-- this is shown for moments but should be in the first section.I'm not sure why we're showing the moments integration here specifically---I suppose just to illustrate using the sub-environment with pixi? Why do we have a separate moments environment (do the additional dependencies really add that much overhead, or are they just needed for tests/examples)? If the latter, maybe we should title this section "Example-specific environments"
More generally, make it clear to people who are coming from pip or pip-in-conda land how to use the package, e.g.:
What if I want to run a script, but not from the root of the pg_gpu repo?
What if I want to run pg_gpu from a notebook?
What if I want to use pg_gpu as a dependency in my package?
"verify installation" should have the bash invocation to run the script (and put script name at top of code blurb,
# is_pg_gpu_installed.py). Also why is this section needed? Could we simply check for cuda on init and error out (so import fails with informative message)?Quick start
Generally: use longer variable names to give more of a lead here, e.g. "se" ==> "stderr"; "T, B" ==> "f3_numer, f3_denom"; and don't reuse names in the same blurb, e.g. "est" ==> "jack_est" and "boot_est"
Loading data: Maybe say upfront that this is coming from diploid VCF and is interpreted as phased. Most folks are probably coming to this with diploid data, either phased or unphased.
Patterson: why are numerator and denominator returned separately? This needs an intro sentence or comment to explain (and feeds into next section)
Jackknife/bootstrap: "both estimators" ==> which estimators? estimators of what? needs an intro sentence or two, unlike the other stats it's not self-evident what's going on from the script blurb
Bootstrap: where are 'num_blocks' and 'den_blocks' created? This workflow needs more hand-holding/context.
Lostruct: why not wrap this workflow into a one-and-done like was done for Patterson's D?
Lostruct: suggest moving the "windowed" part of this into windowed analysis section
Diploid data: as per earlier comment, maybe better to phrase this as "Phased to unphased" as presumably most folks are loading in a HaplotypeMatrix from phased or unphased diploid data
Move theta estimators up closer to the top, these are fundamental statistics. Keep the "FrequencySpectrum" stuff in its own section.
Missing data: show attaching a mask, computing the same thing on masked/unmasked data---this is fundamental. Maybe just move the relevant blurb from "Examples" to here.
Missing data handling
Supported statistics: what is "SSL" (in selection scan)
Why is naive r2 the default if it is biased? Why not always choose the unbiased option? Does sigma_D2 have any limitations? This is the preference under "best practices" ...
HaplotypeMatrix utilities: does this also work for GenotypeMatrix?
Move "accessible site masks" section above "span normalization" section
SFS projection: give sentence or two saying what hypergeometric projection is, I think we need more than a reference to paper in which projection is a footnote
Examples
Unlike the quick start I think we want some more context per example here. It's good to be verbose in examples. This is where people go, rather than the API reference, to understand how things fit together.
"Complete workflow" --- why is this the complete workflow? There's a lot of other features ... so what is this intended to show? Maybe "minimal workflow"?
Two-population LD: what statistics are being computed here?
Batch statistics: batch of what? Say something like "Fused operations may be used to compute multiple summary statistics in parallel, more efficiently than if they were computed independently:". Maybe show single locus batched statistics too?
Integration with moments: give some more context, like "These are statistics used in moments.LD model fitting described in Ragsdale and Gravel 2019. However, as they are pairwise they are costly to compute, which is exactly where GPU acceleration shines."
LD pruning: needs more context, what is this example doing? Also point to content in our examples/ folder.
Utility scripts: move this to its own "major section" --- this is something like "Workflows" rather than "Utility scripts". It would be good to illustrate usage from the CLI (if this is what is intended).
Moments integration
Does this need to be in its own section if it's already in "Examples"? Seems like it should be one or the other. Maybe it'd be clearer to split the "Examples" section into "Small examples" (code blurbs all on one page) and the longer we-wrote-a-script-for-this-thing cases into "Tutorials" (each with their own page)? The former with just a header to provide context, the latter providing some narrative explanation (and a mention of the packaged examples/ script at the top)?