diff --git a/common-content/en/module/complexity/big-o/index.md b/common-content/en/module/complexity/big-o/index.md index 3c78f7f76..2d6a53fdd 100644 --- a/common-content/en/module/complexity/big-o/index.md +++ b/common-content/en/module/complexity/big-o/index.md @@ -1,11 +1,11 @@ +++ title = "Big-O" +time = 30 +emoji = "📈" [build] render = 'never' list = 'local' publishResources = false -time = 30 -emoji= "📈" [objectives] 1="Categorise algorithms as O(lg(n)), O(n), O(n^2), O(2^n)" +++ @@ -29,13 +29,13 @@ line [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024] Complete the coursework [Data Structures and Algorithms: Space and Time Complexity](https://www.wscubetech.com/resources/dsa/time-complexity). -This is in your backlog and you do not need to do it now, but you might like to open it in a tab. +This is in your backlog. You do not need to do it right now, but it might help to. If not, you might like to open it in a tab. {{}}
-☺️ **Constant:** The algorithm takes the same amount of time, regardless of the input size. +😀 **Constant:** The algorithm takes the same amount of time, regardless of the input size. @@ -48,6 +48,8 @@ y-axis "Computation Time" 0 --> 10 line [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] ``` +An example is getting the first character of a string. No matter how long the string is, we know where the first character is, and we can get it. +
@@ -62,8 +64,9 @@ line [2.3, 3.0, 3.4, 3.7, 3.9, 4.1, 4.2, 4.4, 4.5, 4.6] -😐 **Logarithmic:** The runtime grows proportionally to the [logarithm](https://www.bbc.co.uk/bitesize/guides/zn3ty9q/revision/1) of the input size. +☺️ **Logarithmic:** The runtime grows proportionally to the [logarithm](https://www.bbc.co.uk/bitesize/guides/zn3ty9q/revision/1) of the input size. +An example is finding a string in a sorted list. Each time we can look in the middle of the list, and halve the number of entries we need to consider next time by looking either in the half before or the half after that element.
@@ -78,7 +81,11 @@ line [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] -😨 **Linear:** The runtime grows proportionally to the input size. +🙂 **Linear:** The runtime grows proportionally to the input size. + +An example is finding an element by value in an un-sorted list. To be sure we find the element, we may need to look through every element in the list and check if it's the one we're looking for. + +If we double the length of the list, we need to check twice as many elements.
@@ -108,9 +115,11 @@ line [100, 400, 900, 1600, 2500, 3600, 4900, 6400, 8100, 10000] What does this mean? It means that the time is the square of the input size: n\*n. +An example is finding which elements in an array are present more than once. For each element, we need to check every other element in the same array to see if they're equal. If we double the number of elements in the array, we _quadruple_ the number of checks we need to do. + -😰 **Quadratic:** The runtime grows proportionally to the square of the input size. +😨 **Quadratic:** The runtime grows proportionally to the square of the input size. @@ -144,6 +153,8 @@ Oh where have we seen this sequence of numbers before? ;) 😱 **Exponential:** The runtime grows exponentially with the input size. +An example is making a list of every _combination_ of ever element in a list (so if we have `[1, 2, 3]` and want to make all the combinations: `[]`, `[1]`, `[2]`, `[3]`, `[1, 2]`, `[1, 3]`, `[2, 3]`, `[1, 2, 3]`). + You will explore this theory in your backlog, but you will find that you already have a basic understanding of this idea. No really! Let's look at these algorithms in real life: @@ -163,8 +174,12 @@ You will explore this theory in your backlog, but you will find that you already [LABEL=Quadratic Time] - Everyone at a party shaking hands with everyone else. If you double the number of people (n), the number of handshakes increases much faster (roughly n \* n). This is like nesting a loop inside a loop. [LABEL=Exponential Time] -- Trying every possible combination to unlock a password. Each extra character dramatically increases the possibilities. This is like naive recursion; we'll talk about this more later. +- Trying every possible combination to unlock a password. Each extra character dramatically increases the possibilities. {{< /label-items >}} > [!TIP] > Big-O notation also describes space complexity (how memory use grows). Sometimes an algorithm's time complexity is different from its space complexity. We have focused on time here, but you'll meet space complexity analysis in the assigned reading. + +Big-O notation is focused on the _trend_ of growth, not the exact growth. + +Think about strings: one character may take up one byte or four. If we double the length of the string, we don't check which characters are in the string. We just think about the **trend**. The string will take _about_ twice as much space. If the string only has four-byte characters, and we add one-byte characters, the string is _still_ growing linearly, even though it may not take _exactly_ double the space. diff --git a/common-content/en/module/complexity/caching/index.md b/common-content/en/module/complexity/caching/index.md index d70a0f4b2..bfc7b068e 100644 --- a/common-content/en/module/complexity/caching/index.md +++ b/common-content/en/module/complexity/caching/index.md @@ -1,11 +1,11 @@ +++ title = "Caching" +time = 15 +emoji = "🛍️" [build] render = 'never' list = 'local' publishResources = false -time = 15 -emoji= "🛍️" [objectives] 1="Identify and explain how web browsers benefit from caching" 2="Demonstrate how caching can trade memory for CPU" diff --git a/common-content/en/module/complexity/invalidation/index.md b/common-content/en/module/complexity/invalidation/index.md index 380945f02..0d9bbfb1f 100644 --- a/common-content/en/module/complexity/invalidation/index.md +++ b/common-content/en/module/complexity/invalidation/index.md @@ -1,11 +1,11 @@ +++ title = "Cache Invalidation" +time = 15 +emoji = "⛓️‍💥" [build] render = 'never' list = 'local' publishResources = false -time = 15 -emoji= "⛓️‍💥" [objectives] 1="Identify and explain staleness risks with caching, and the difficulty of invalidation" +++ diff --git a/common-content/en/module/complexity/memoisation/index.md b/common-content/en/module/complexity/memoisation/index.md index 663ca9b64..878f962d8 100644 --- a/common-content/en/module/complexity/memoisation/index.md +++ b/common-content/en/module/complexity/memoisation/index.md @@ -1,11 +1,11 @@ +++ title = "Memoisation" +time = 15 +emoji = "📝" [build] render = 'never' list = 'local' publishResources = false -time = 15 -emoji= "📝" [objectives] 1="Define memoisation" +++ diff --git a/common-content/en/module/complexity/memory-consumption/index.md b/common-content/en/module/complexity/memory-consumption/index.md index cc4835586..34153d579 100644 --- a/common-content/en/module/complexity/memory-consumption/index.md +++ b/common-content/en/module/complexity/memory-consumption/index.md @@ -1,12 +1,12 @@ +++ title = "Memory consumption" description="Memory is finite" +time = 30 +emoji = "🥪" [build] render = 'never' list = 'local' publishResources = false -time = 30 -emoji= "🥪" [objectives] 1="Quantify the memory used by different arrays" +++ @@ -19,7 +19,7 @@ Think back to Chapter 7 of How Your Computer Really Works. ```mermaid graph LR -CPU -->|️ Fastest: Smallest| Cache -->|Fast: Small| RAM -->|Slow : Big| Disk -->|Slowest: Vast| Network +CPU -->|️ Fastest and Smallest| Cache -->|Fast and Small| RAM -->|Slow and Big| Disk -->|Slowest and Vast| Network ``` At each stage there are **limits** to **how fast** you can get the data and **how much** data you can store. Given this constraint, we need to consider how much memory our programs consume. @@ -34,11 +34,14 @@ const userRoles = ["Admin", "Editor", "Viewer"]; //An array of 3 short strings const userProfiles = [ {id: 1, name: "Farzaneh", role: "Admin", preferences: {...}}, {id: 2, name: "Cuneyt", role: "Editor", preferences: {...}} ]; // An array of 2 complex objects ``` -Different kinds of data have different memory footprints: +Different kinds of data have different memory footprints. All data is fundamentally stored as bytes. We can form intuition for how much memory a piece of data takes: -- Numbers or booleans use less memory than objects +- Numbers are typically stored as 8 bytes. In some languages, you can define numbers which take up less space (but can store a smaller range of values). +- Each character in an ASCII string takes 1 byte. More complex characters may take more bytes. The biggest characters take up to 4 bytes. - The longer a string, the more bytes it consumes. -- Objects and arrays need memory for their internal organisation _as well_ as the data itself. +- Objects and arrays are stored in different ways in different languages. But they need to store _at least_ the information contained within them. + - This means an array of 5 elements will use _at least_ as much memory as the 5 elements would on their own. + - And objects will use _at least_ as much memory as all of the _values_ inside the object (and in some languages, all of the keys as well). More complicated elements or more properties need more memory. It matters what things are made of. All of this data added up is how much _space_ our program takes. diff --git a/common-content/en/module/complexity/n+1/index.md b/common-content/en/module/complexity/n+1/index.md index 37dc41899..a9c6b4a22 100644 --- a/common-content/en/module/complexity/n+1/index.md +++ b/common-content/en/module/complexity/n+1/index.md @@ -1,11 +1,11 @@ +++ title = "N+1 Query Problem" +time = 60 +emoji = "🎟️" [build] render = 'never' list = 'local' publishResources = false -time = 60 -emoji= "🎟️" [objectives] 1="Define the n+1 query problem" 2="List effective strategies to reduce database queries" @@ -45,7 +45,14 @@ We've already seen that every query adds network delay and processing time. This The server has to handle each request individually, consuming resources (CPU, memory, connections). If many users trigger this N+1 pattern at once, the database can slow down for everyone, or even fall over entirely. -This N+1 problem can happen with any database interaction if you loop and query individually. Understanding this helps you write backend code that doesn't accidentally overload the database. +This N+1 problem can happen with any database interaction if you loop and query individually. Understanding this helps you write code that doesn't accidentally overload the database. + +{{< + multiple-choice + question="What is the N+1 Query Problem?" + answers="Fetching N items plus 1 extra backup item. | Making 1 query to get a list, then N separate queries to get details for each item in the list. | A query that is N times too complex. | Trying N+1 different network endpoints." + feedback="No, but flip this and try again? | Right! That's a clear description. | No, this is so vague it describes nothing. | No, it's not about network endpoints." + correct="1">}} ### 📦 What to do instead @@ -55,11 +62,8 @@ The real `/home `endpoint avoids these problems by using efficient strategies: **Caching**: Store results so you don't have to ask the network again. Ask for only new changes in future. **Pagination**: Ask for only the first page of results. Load more later if the user scrolls or clicks "next". -All these are ways to save the data we need, close to where we need it. But each strategy also has downsides. +All these are ways to save the data we need, close to where we need it. But each strategy also has downsides. -{{< - multiple-choice - question="What is the N+1 Query Problem?" - answers="Fetching N items plus 1 extra backup item. | Making 1 query to get a list, then N separate queries to get details for each item in the list. | A query that is N times too complex. | Trying N+1 different network endpoints." - feedback="No, but flip this and try again? | Right! That's a clear description. | No, this is so vague it describes nothing. | No, it's not about network endpoints." - correct="1">}} \ No newline at end of file +* Batching may reduce our responsiveness. It will probably take longer to fetch three users' blooms than one user's blooms. If we'd just asked for one user's blooms, then the next user's, we probably probably could've showed the user _some_ results sooner. Batching forces us to wait for _all_ of the results before we show anything. +* Caching may result in stale results. If we store a user's most recent blooms, and when they re-visit the page we don't ask the database for the most recent blooms, it's possible we'll return old results missing the newest blooms. +* Pagination means the user doesn't have complete results up-front. If they want to ask a question like "has this user ever bloomed the word cheese", they may need to keep scrolling and searching to find the answer (with each scroll requiring a separate network call and database lookup). diff --git a/common-content/en/module/complexity/network-as-a-bottleneck/index.md b/common-content/en/module/complexity/network-as-a-bottleneck/index.md index 48960b946..99073680d 100644 --- a/common-content/en/module/complexity/network-as-a-bottleneck/index.md +++ b/common-content/en/module/complexity/network-as-a-bottleneck/index.md @@ -1,11 +1,11 @@ +++ title = "Network as a bottleneck" +time = 15 +emoji = "⏳" [build] render = 'never' list = 'local' publishResources = false -time = 15 -emoji= "⏳" [objectives] 1="Explain limitations of needing to make network calls (e.g. from a backend to a database)" +++ diff --git a/common-content/en/module/complexity/operations/index.md b/common-content/en/module/complexity/operations/index.md index 1be8bb19d..c0d11581f 100644 --- a/common-content/en/module/complexity/operations/index.md +++ b/common-content/en/module/complexity/operations/index.md @@ -1,11 +1,11 @@ +++ title = '"Expensive" Operations' +time = 30 +emoji= "🧮" [build] render = 'never' list = 'local' publishResources = false -time = 30 -emoji= "🧮" [objectives] 1="Explain what the significant/expensive operations for a particular algorithm are likely to be" 2="Quantify the number of significant operations taken by a particular algorithm" @@ -15,7 +15,7 @@ Let's think about Purple Forest from the [Legacy Code](https://github.com/CodeYo When we build the timeline of blooms on the homepage, we call an endpoint `/home`. This returns an array of objects, blooms, produced by people we follow, plus our own blooms, sorted by timestamp. We stuff this in our state object (and cache _that_ in our local storage). -There are many different ways we could get and show this information. Some ways are {{}} +There are many different ways we could get and show this information in our frontend. Some ways are {{}} Here we are defining better as faster. We might at other times define better as _simpler_, _clearer_, or _safer_. {{}} than others. @@ -36,7 +36,7 @@ What if we had tried any of the following strategies: 1. Request our own blooms 1. Merge all the arrays 1. Sort by timestamp -1. Display blooms! +1. Display blooms #### 3. Get ALL Blooms & People, then Loop & Filter @@ -47,7 +47,7 @@ What if we had tried any of the following strategies: 1. Sort by timestamp 1. Display blooms -Given what we've just thought about, how efficient are these programs? How could you make them more efficient? Write your ideas down in your notebook. +Given what we've just thought about, how efficient are these programs? Which is going to be fastest or slowest? Which is going to use the most or least memory? How could you make them more efficient? Write your ideas down in your notebook. Our end state is always to show the latest blooms that meet our criteria. How we produce that list determines how quickly our user gets their page. This is very very important. After just **three seconds**, half of all your users have given up and left. @@ -58,3 +58,5 @@ The Purple Forest application does not do most of this work on the front end, bu 1. Number of network calls This is because some operations are more {{}}Expensive operations consume a lot of computational resources like CPU time, memory, or disk I/O.{{}} than others. + +The _order_ also matters. In all of the above strategies, we filter the blooms _before_ sorting them. Sorting isn't a constant-time operation, so it takes more time to sort more data. If in the first strategy we had sorted _all_ of the blooms before we filtered down to the just the ones we cared about, we would have spent a lot more time sorting blooms we don't care about. diff --git a/common-content/en/module/complexity/pre-computing/index.md b/common-content/en/module/complexity/pre-computing/index.md index 8ad8c9b97..a7dd31c55 100644 --- a/common-content/en/module/complexity/pre-computing/index.md +++ b/common-content/en/module/complexity/pre-computing/index.md @@ -1,11 +1,11 @@ +++ title = "Pre-computing" +time = 30 +emoji = "🔮" [build] render = 'never' list = 'local' publishResources = false -time = 30 -emoji= "🔮" [objectives] 1="Identify a pre-computation which will improve the complexity of an algorithm" +++ diff --git a/common-content/en/module/complexity/trade-offs/index.md b/common-content/en/module/complexity/trade-offs/index.md index 7405985ab..5a68d45c1 100644 --- a/common-content/en/module/complexity/trade-offs/index.md +++ b/common-content/en/module/complexity/trade-offs/index.md @@ -1,11 +1,11 @@ +++ title = "Trade-offs" +time = 15 +emoji = "⚖️" [build] render = 'never' list = 'local' publishResources = false -time = 15 -emoji= "⚖️" [objectives] 1="Give examples of trading off memory for CPU" 2="Give examples of choosing where work is done in system design" diff --git a/common-content/en/module/complexity/worked-example-duplicate-encoder/index.md b/common-content/en/module/complexity/worked-example-duplicate-encoder/index.md new file mode 100644 index 000000000..c25ee041d --- /dev/null +++ b/common-content/en/module/complexity/worked-example-duplicate-encoder/index.md @@ -0,0 +1,121 @@ ++++ +title = "Worked example: Duplicate Encoder" +time = 90 +emoji = "🧰" +[build] + render = "never" + list = "local" + publishResources = false +objectives = [ + "Analyse the time complexity of a function.", + "Compare two algorithms for solving the same problem in terms of complexity.", + "Identify that speed isn't the only factor in choosing the best code.", +] ++++ + +Let's consider this problem: + +{{}} +Convert a string to a new string. The new string should: +* Replace each character which occurs exactly once in the original string with a `1` character. +* Replace each character which occurs more than once in the original string with a `*` character. + +Ignore capitalization when determining if a character is a duplicate. + +Examples +Input | Output +------------|------- +`"din"` | `"```"` +`"recede"` | `"1*1*1*"` +`"Success"` | `"*1**1**"` +`"11*2"` | `"**11"` +``` +{{}} + +> [!WARNING] +> +> First, try solving this exercise yourself. + +Here are three sample solutions we will compare: + +```js {linenos=table} +function duplicateEncode(word){ + let result = "" + for (const char of word.toLowerCase()) { + if (word.indexOf(char) === word.lastIndexOf(char)) { + result += "1"; + } else { + result += "*"; + } + } + return result; +} +``` + +```js {linenos=table} +function duplicateEncode(word){ + let result = "" + for (const char of word) { + const lowerCaseChar = char.toLowerCase() + if (word.indexOf(lowerCaseChar) === word.lastIndexOf(lowerCaseChar)) { + result += "1"; + } else { + result += "*"; + } + } + return result; +} +``` + +```js {linenos=table} +function duplicateEncode(word){ + const occurrences = {}; + for (const char of word) { + const normalisedChar = char.toLowerCase(); + occurrences[normalisedChar] = (occurrences[normalisedChar] || 0) + 1; + } + let out = ""; + for (const char of word) { + out += (occurrences[char.toLowerCase()] === 1 ? "1" : "*"); + } + return out; +} +``` + +### Comparing the approaches + +Approaches 1 and 2 are very similar. In terms of time and memory complexity, they are the same as each other. But the second uses less memory than the first. The first one keeps around an entire copy of `word` (converted to lower case) for the whole function. The second one only converts each character in `word` to lower case one at a time. It's important to remember that big-O complexity doesn't tell you how much time or memory an approach takes, only fast how they grow. + +Approaches 2 and 3 are quite different. Let's analyse them each for time complexity: + +#### Approach 2 + +* Approach 2 has a `for`-loop over each character in the word (line 3). If we say the size of the word is `n`, a `for`-loop on its own is `O(n)`. +* Line 4 calls `char.toLowerCase()` - this is an `O(1)` operation - changing the case of one character takes constant time. +* Line 5 calls `word.indexOf` - this is an `O(n)` operation - it may have to look through the whole string, comparing every character to see if it's the one we're looking for. Because we have an `O(n)` operation (the `word.indexOf` call) inside an `O(n)` operation (the for loop), this makes the function at least `O(n^2)`. +* Line 5 also calls `word.lastIndexOf` - this is _also_ an `O(n)` operation for the same reason. But it doesn't change the complexity of our function - doing an `O(n)` operation _twice_ is still `O(n)` - we ignore constant factors. This is different from doing an `O(n)` operation inside a for loop (where do we do an `O(n)` operation _for each `n`_ - making it `O(n^2)`). + +Because the worst thing we've seen is `O(n^2)` (an `O(n)` operation inside a `for` loop), approach 2 takes `O(n^2)` time. + +#### Approach 3 + +* Approach 3 has a `for`-loop over each character in the word (line 3). This is `O(n)`. +* Each operation inside the `for`-loop is `O(1)` (converting one character to lower case, looking up a value in an object, adding two numbers, inserting a value in an object). So this whole loop is `O(n)`. +* We have a second `for`-loop over each character in the word - `O(n)`. +* Each operation inside the `for`-loop is `O(1)` (converting one character to lower case, looking up a value in an object, a ternary operator, and appending one character to a string). So this whole loop is `O(n)`. + +Because the worst thing we've seen is `O(n)` (even though there were two `O(n)` loops), approach 3 takes `O(n)` time. + +This technique is called {{}}Precomputing is when we do some work in advance which we can use to avoid needing to do it later.

Here we counted the occurrences of each character _before_ the loop (which was `O(n)`) rather than needing to search through the string for each character (which would've been `O(n^2)`).{{
}}. We will learn more about it later. + +### Which is better? + +Approach 3 has a better time complexity than approach 2. As longer input is given to the function, approach 3 will be much faster than approach 2. + +Speed is not the only concern, however! + +Approach 3 uses more memory than approach 2 (though not in terms of memory complexity), because it stores a whole extra object. + +Which approach do you think is easiest to read? Ease of reading is an important concern to consider in code. + +If we know our function will only take small strings as input, the time and memory use of the functions probably doesn't matter much, and we should prefer the easiest code to read, maintain, and modify. We can still make small choices to speed things up (like choosing approach 2 over approach 1). But we should only make our code harder to read if we know that we'll have a problem if we don't. diff --git a/org-cyf-sdc/content/complexity/sprints/1/prep/index.md b/org-cyf-sdc/content/complexity/sprints/1/prep/index.md index 71565d71a..bba764333 100644 --- a/org-cyf-sdc/content/complexity/sprints/1/prep/index.md +++ b/org-cyf-sdc/content/complexity/sprints/1/prep/index.md @@ -12,6 +12,9 @@ src = "module/complexity/memory-consumption" name = "Time: Big-O" src = "module/complexity/big-o" [[blocks]] +name = "Worked example" +src = "module/complexity/worked-example-duplicate-encoder" +[[blocks]] name = '"Expensive" Operations' src = "module/complexity/operations" [[blocks]] diff --git a/org-cyf-sdc/content/complexity/sprints/2/backlog/index.md b/org-cyf-sdc/content/complexity/sprints/2/backlog/index.md index 44774220b..46d71d424 100644 --- a/org-cyf-sdc/content/complexity/sprints/2/backlog/index.md +++ b/org-cyf-sdc/content/complexity/sprints/2/backlog/index.md @@ -5,5 +5,5 @@ emoji= '🥞' menu_level = ['sprint'] weight = 2 backlog= 'Module-Complexity' -backlog_filter='📅 Sprint 1' +backlog_filter='📅 Sprint 2' +++