Skip to content

Commit de13bd7

Browse files
committed
Deploying to gh-pages from @ dd0dd18 🚀
0 parents  commit de13bd7

53 files changed

Lines changed: 20652 additions & 0 deletions

File tree

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.DS_Store

8 KB
Binary file not shown.

.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
Manifest.toml
2+
/.vscode/
3+
/data/

LICENSE

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivs 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-nc-nd/4.0/.

Project.toml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
authors = ["Karandeep Singh", "Christoph Scheuch"]
2+
3+
[deps]
4+
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"
5+
FileIO = "5789e2e9-d7fb-5bc7-8068-2c6fae9b9549"
6+
Pluto = "c3e4b0f8-55cb-11ea-2926-15256bba5781"
7+
PlutoUI = "7f904dfe-b85e-4ff6-b463-dae2292396a8"
8+
Tidier = "f0413319-3358-4bb0-8e7c-0c83523a93bd"
9+
ZipFile = "a5390f91-8eb1-5f08-bee0-b1d1ffed6cea"
10+
11+
[compat]
12+
PlutoUI = "0.7.55"
13+
Tidier = "1.6.1"
14+
ZipFile = "0.10.1"
15+
julia = "1.11"

README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# Tidier Course
2+
3+
<img src="https://raw.githubusercontent.com/TidierOrg/.github/main/profile/TidierOrg_logo.png" align="left" style="padding-right:10x;" width="150"/>
4+
5+
Welcome to the **Tidier Course**, an interactive course designed to introduce you to Julia and the Tidier.jl ecosystem for data analysis. The course consists of a series of Pluto Notebooks so that you can both learn and practice how to write Julia code through real data science examples.
6+
7+
This course assumes a basic level of familiarity with programming but does not assume any prior knowledge of Julia. This course emphasizes the parts of Julia required to read in, explore, and analyze data. Because this course is primarily oriented around data science, many important aspects of Julia will *not* be covered in this course.
8+
9+
This course is currently under construction. Check back for updated content.
10+
11+
[Click here to view the course](https://tidierorg.github.io/TidierCourse/)

data-pipelines.html

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
<!DOCTYPE html><html lang="en"><head><meta name="viewport" content="width=device-width"><meta charset="utf-8"><meta property='og:type' content='article'>
2+
3+
<meta name="pluto-insertion-spot-meta">
4+
<meta name="theme-color" media="(prefers-color-scheme: light)" content="white"><meta name="theme-color" media="(prefers-color-scheme: dark)" content="#2a2928"><meta name="color-scheme" content="light dark"><link rel="icon" type="image/png" sizes="16x16" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/favicon-16x16.347d2855.png" integrity="sha384-3qsGeVLdddzV9oIkj3PhXXQX2CZCjOD/CiyrPQOX6InOWw3HAHClrsQhPfX9uRAj" crossorigin="anonymous"><link rel="icon" type="image/png" sizes="32x32" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/favicon-32x32.8789add4.png" integrity="sha384-cOe5vSoBIgKNgkUL27p9RpsGVY0uBg9PejLccDy+fR8ZD1Iv5dF1MGHjIZAIZwm6" crossorigin="anonymous"><link rel="icon" type="image/png" sizes="96x96" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/favicon-96x96.48689391.png" integrity="sha384-TN49cYb8GyNmrZT14bsYXXo4l1x1NJeJ/EHuVAauAKsNPopPHLojijs9jFT4Vs8c" crossorigin="anonymous"><link rel="pluto-logo-big" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/logo.004c1d7c.svg" integrity="sha384-GkQkODcGxsrSRJCkeakBXihum0GUM44cwBgKyutDimectXCbCgj6Vu3jlrueqEcN" crossorigin="anonymous"><link rel="pluto-logo-small" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/favicon_unsaturated.d1387b25.svg" integrity="sha384-omwjH+Qy3hpAVf5FYd/pkaDBuVAfsEDRN7eBxEA8Ek00OAWP+aiV+GpEYk3I7lyo" crossorigin="anonymous"><script type="module" src="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.7330d793.js" integrity="sha384-+mLMSKQxWEYKJeUt5VTdKTDfzHvui0mdMSd+iIQKYybm+6crs+6FeCr73c8yxir6" crossorigin="anonymous"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.c9b6b472.css" integrity="sha384-/r++eFqY+MX24zOPLVQ1SEXsNKaMgaiC42LUbooLnc1+zar5i0Ih+sKH5dM93WL4" crossorigin="anonymous"><script defer="">console.log("Pluto.jl, by Fons van der Plas (https://github.com/fonsp), Mikołaj Bochenski (https://github.com/malyvsen), Michiel Dral (https://github.com/dralletje) and friends 🌈");</script><script src="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.b8733d72.js" defer="" integrity="sha384-84yPd6AGZ/1IUiaBlssipmMKMFz9WGFQ+u8vYZ9cWicH6bZm7ZOej+kLDXnIIAQJ" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.9f9dc874.js" defer="" integrity="sha384-tkFo1EK72I9JvoTmHFa199dfRzW8mkXPUkHb/N7UhYI+bxKzX3Kh8LNCZz1ltsFF" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.90ede145.js" defer="" integrity="sha384-CuNU9gQg6fa/yynNqNWjHWzPm4nj+d7O6+HXsNGSqClhs/bYQIbBC3Lw/kh8Ukui" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.dbeed08a.js" defer="" integrity="sha384-1BEdQwXfZi4ZpsNV8w1X8pQcVK1/DS/+/M8OTo3gol7mdEspSN7nT6llX57NQCSt" crossorigin="anonymous"></script><script id="iframe-resizer-content-window-script" src="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.6386bd9d.js" crossorigin="anonymous" defer="" integrity="sha384-tgN2a0VDi/lCYwZuDqT7L+A/Y/9kpxf3HV7zv2BJ5Fu7zW0EClq0nM4crfK3TRPs"></script><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.1a986d5f.css" type="text/css" integrity="sha384-biEV7R+dtBt8r/kVXCVPv0QFmmMMFBF9n6MxBxScN5PULdIdz+5W/YaKFE7GFyJn" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.317d32de.css" type="text/css" media="all" data-pluto-file="hide-ui" integrity="sha384-rM7rRGvRYP65Tiqkdta+WSApQBfZCqeSEF7JwMX/lSAQUubDKjBejLjGlQBVyphe" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.d0a5b1f0.css" type="text/css" integrity="sha384-oUdA9RJhs9IlGgJOs6m3tNmyOqOLTPOfpCXeXLUex2W5KOLfSAdyT5HoVuwUEFDQ" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.e2e3dd3d.css" type="text/css" integrity="sha384-rFNNfBgG448S4mC8A/rtDd6eRIjB04OhJ640kkIF/t55EWPrv2ZT42x9lamXEFpR" crossorigin="anonymous"><link rel="stylesheet" href="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.09b09a3f.css" type="text/css" integrity="sha384-dHB2VzrvTc7+CLgp62sndIQSbzeitJhO8vZnxV2zNlO4GHz83BZPqsY+0nTAF7WO" crossorigin="anonymous"><script data-pluto-file="launch-parameters">
5+
window.pluto_notebook_id = undefined;
6+
window.pluto_isolated_cell_ids = undefined;
7+
window.pluto_notebookfile = "data-pipelines.jl";
8+
window.pluto_disable_ui = true;
9+
window.pluto_slider_server_url = undefined;
10+
window.pluto_binder_url = "https://mybinder.org/v2/gh/fonsp/pluto-on-binder/v0.19.47";
11+
window.pluto_statefile = "data-pipelines.plutostate";
12+
window.pluto_preamble_html = undefined;
13+
</script>
14+
15+
<meta name="pluto-insertion-spot-parameters">
16+
<script src="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.e5e13b39.js" type="module" defer="" integrity="sha384-aSQciUMYA0alIWQ4WkNgRf/hEKn/leIeBB/mSeGjvKDSc2DFz+jKgaIDKLhAMPtc" crossorigin="anonymous"></script><script src="https://cdn.jsdelivr.net/gh/fonsp/Pluto.jl@0.19.47/frontend-dist/editor.8a3292da.js" integrity="sha384-itp4oE2PRbSrrTHVpWh8sqAuVUsz7ja6L2Dgp/JRfMCD2AwVdTk56K96POF3oLmu" crossorigin="anonymous"></script><script type="text/javascript" id="MathJax-script" integrity="sha384-4kE/rQ11E8xT9QgrCBTyvenkuPfQo8rXYQvJZuMgxyPOoUfpatjQPlgdv6V5yhUK" crossorigin="" not-the-src-yet="https://cdn.jsdelivr.net/npm/mathjax@3.2.2/es5/tex-svg-full.js" async=""></script>
17+
<link rel="preload" as="fetch" href="data-pipelines.plutostate" crossorigin>
18+
19+
<meta name="pluto-insertion-spot-preload">
20+
</head><body class="loading no-MαθJax"> <div style="display:flex;min-height:100vh;"> <pluto-editor class="fullscreen"> <progress style="filter:grayscale(1)" class="delete-me-when-live statefile-fetch-progress" max="100"></progress> </pluto-editor> </div> </body></html>

data-pipelines.jl

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
### A Pluto.jl notebook ###
2+
# v0.19.36
3+
4+
using Markdown
5+
using InteractiveUtils
6+
7+
# ╔═╡ d6823989-bb85-400d-87ec-2a365260f5fb
8+
# ╠═╡ show_logs = false
9+
using Pkg; Pkg.activate(".."); Pkg.instantiate()
10+
11+
# ╔═╡ 51e24e5e-cfc7-4b02-978c-505e21e6df43
12+
using PlutoUI: TableOfContents
13+
14+
# ╔═╡ 2eec5998-bb36-11ee-2283-67ea47c4f5ed
15+
md"""
16+
# Tidier Course: Data Pipelines
17+
"""
18+
19+
# ╔═╡ a4baabcd-d425-449e-b7bb-f8b776582330
20+
html"""<img src="https://raw.githubusercontent.com/TidierOrg/.github/main/profile/TidierOrg_logo.png" align="left" style="padding-right:10x;" width="150"/>"""
21+
22+
# ╔═╡ c38f82c2-def3-4d1c-bda0-54e779e2583a
23+
md"""
24+
## The Structured Query Language (SQL)
25+
26+
Let's rewind to our benchmarks for data aggregation tasks: [https://duckdblabs.github.io/db-benchmark/](https://duckdblabs.github.io/db-benchmark/).
27+
"""
28+
29+
# ╔═╡ 99aa11d7-09f8-4ebb-9166-e248fc5af44f
30+
html"""<img src="https://raw.githubusercontent.com/TidierOrg/TidierCourse/main/why-julia/duckdb_benchmark.jpeg" style="width:50%"/>"""
31+
32+
# ╔═╡ 78d16051-d5d9-4f9c-9316-ce4ddee39dce
33+
md"""
34+
DuckDB and ClickHouse were two of the fastest tools, and while both are implemented in C++, their primary interface to users is in SQL. SQL is the *lingua franca* of databases, and it is important background knowledge as a data scientist to understand its syntax, which is the source of its popularity as well as its primary limitation.
35+
36+
Let's say we have a dataset called `patients`, which has columns `diagnosis`, `takes_medications`, and `age`. Each row represents a unique patient, `diagnosis` is the primary diagnosis, `takes_medications` is a string indicating whether a patients takes any medications ("yes") or not ("no"), and `age` is their current age.
37+
38+
To compare the mean age among patients with diabetes who take medications versus those who do not take medications, we would write the following in SQL:
39+
40+
```sql
41+
SELECT takes_medications, AVG(age) AS mean_age
42+
FROM patients
43+
WHERE diagnosis = 'diabetes'
44+
GROUP BY takes_medications;
45+
```
46+
47+
The SQL syntax is fairly intuitive in that each verb (e.g., `SELECT`) has a clear purpose, and the full query itself reads a bit like a sentence that you could read aloud. However, hidden within this apparent simplicity is the fact that SQL queries don't actually run in the order in this order.
48+
49+
The *actual* order in which this query runs is:
50+
51+
1. `FROM patients`
52+
2. `WHERE diagnosis = 'diabetes'`
53+
3. `GROUP BY takes_medications`
54+
4. `SELECT takes_medications, AVG(age) AS mean_age`
55+
56+
If you think about this, this makes sense. You first need to start with the dataset (`FROM patients`), then you need to limit the dataset to only those rows where the primary diagnosis is diabetes (`WHERE diagnosis = 'diabetes'`). Then, after grouping by whether or not a patient takes medications, we need to calculate the mean age for each group.
57+
58+
The key lesson with SQL is:
59+
60+
> The order in which you write the verbs in SQL is different from the order in which the verbs are processed by SQL.
61+
62+
Much has been written about this issue (see: [https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/](https://jvns.ca/blog/2019/10/03/sql-queries-don-t-start-with-select/) and [https://www.flerlagetwins.com/2018/10/sql-part4.html](https://www.flerlagetwins.com/2018/10/sql-part4.html)).
63+
64+
In case you're curious, this is a more complete comparison of how SQL queries are written vs. how they are processed by SQL.
65+
66+
| What You Write in SQL | Order In Which It Runs |
67+
| ----------------------|------------------------|
68+
| SELECT | FROM |
69+
| DISTINCT | JOIN |
70+
| TOP | WHERE |
71+
| [AGGREGATION] | GROUP BY |
72+
| FROM | [AGGREGATION] |
73+
| JOIN | HAVING |
74+
| WHERE | SELECT |
75+
| GROUP BY | DISTINCT |
76+
| HAVING | ORDER BY |
77+
| ORDER BY | TOP / LIMIT |
78+
"""
79+
80+
# ╔═╡ 2686a0fb-15e1-44d8-9565-1abdee13ec5b
81+
md"""
82+
## Why not run SQL queries in the same order they are written?
83+
84+
While the fact that SQL queries form sentences that can be read aloud is convenient, this convenience comes at a cost. When queries get more complicated, they can no longer be read aloud, and the order of operations becomes much harder to keep track of. For more complex queries, it actually becomes cognitively less demanding to keep track of queries that are run in the same order that they are written.
85+
86+
This idea of behind `PRQL` ([https://github.com/PRQL/prql](https://github.com/PRQL/prql)), which calls itself a "simple, powerful, pipelined, SQL replacement."
87+
88+
This same query in PRQL would be written as:
89+
90+
```
91+
from patients
92+
filter diagnosis == "diabetes"
93+
group {takes_medications}
94+
aggregate {age = avg age}
95+
```
96+
97+
The fact that the analytic steps are written in the same order as they are performed seems trivial, but this is the big idea behind data pipelines. A data pipeline starts with a dataset, and each function transforms the data in a specific way until the end result answers an analytical question.
98+
"""
99+
100+
# ╔═╡ 4fde78bb-3dc5-4849-ad24-29804a49740c
101+
md"""
102+
## Modern data pipelines
103+
104+
Data pipelines were popularized by the `dplyr` and `ggplot2` R packages, which are two of the core packages that make up the `tidyverse` ecoystem in R. In fact, the `dplyr` R package was a key inspiration behind `PRQL` (see [https://prql-lang.org/faq/](https://prql-lang.org/faq/)). While `PRQL` brings the idea of data pipelines to a `SQL` syntax, modern data pipelines are much more expansive in their capabilities.
105+
106+
While all data pipelines *start* with a dataset, they don't need to *end* with a dataset. Modern data pipelines often end with plots (as in `ggplot2` in R), statistical analyses, machine learning models, and more. These more advanced types of data pipelines is where SQL-like languages (like PRQL) show their limitations. While great for transforming data, SQL-like langauges do not have facilities for plotting and machine learning.
107+
108+
Data pipelines implemented in a programming language like Python, R, or Julia are thus much more capable than in PRQL.
109+
"""
110+
111+
# ╔═╡ 6a08598c-69bf-498c-9ac2-4e0a4b749598
112+
md"""
113+
## Summary
114+
115+
- The Structured Query Language (SQL) is a popular way of working with datasets
116+
- SQL's simple-to-read syntax introduces complexity because the order in which SQL queries are written is different from the order in which SQL queries are run
117+
- PRQL is a SQL-like language that implements data pipelines
118+
- Data pipelines refer to data analysis pathways that start with a dataset and then sequentially transform the dataset
119+
- While data pipelines start with a dataset, modern data pipelines end with plots, statistical analyses, and machine learning models.
120+
"""
121+
122+
# ╔═╡ 831bad3f-0e43-4226-a75c-7a7c4c569e53
123+
md"""
124+
# Appendix
125+
"""
126+
127+
# ╔═╡ 0ddc3de7-c4a8-44c7-8cd3-4d63de3334c7
128+
TableOfContents()
129+
130+
# ╔═╡ Cell order:
131+
# ╟─2eec5998-bb36-11ee-2283-67ea47c4f5ed
132+
# ╟─a4baabcd-d425-449e-b7bb-f8b776582330
133+
# ╟─c38f82c2-def3-4d1c-bda0-54e779e2583a
134+
# ╟─99aa11d7-09f8-4ebb-9166-e248fc5af44f
135+
# ╟─78d16051-d5d9-4f9c-9316-ce4ddee39dce
136+
# ╟─2686a0fb-15e1-44d8-9565-1abdee13ec5b
137+
# ╟─4fde78bb-3dc5-4849-ad24-29804a49740c
138+
# ╟─6a08598c-69bf-498c-9ac2-4e0a4b749598
139+
# ╟─831bad3f-0e43-4226-a75c-7a7c4c569e53
140+
# ╠═d6823989-bb85-400d-87ec-2a365260f5fb
141+
# ╠═51e24e5e-cfc7-4b02-978c-505e21e6df43
142+
# ╠═0ddc3de7-c4a8-44c7-8cd3-4d63de3334c7

data-pipelines.plutostate

25.2 KB
Binary file not shown.

0 commit comments

Comments
 (0)