Concurrency and Parallelism
How computers do more things at once and faster: thread concurrency, instruction parallelism and SIMD.
Hi Friends,
Welcome to the 141st issue of the Polymathic Engineer newsletter.
This week we talk about how computers implement concurrency and parallelism at OS and CPU level. The outline will be as follows:
Concurrency vs parallelism
Thread-Level Concurrency
Instruction-Level Parallelism
SIMD Parallelism
CodeRabbit: Free AI Code Reviews in CLI
This issue is offered by CodeRabbit, the Free AI Code Reviews in CLI. CodeRabbit CLI is an AI code review tool that runs directly in your terminal. It offers intelligent code analysis, catches issues early, and integrates seamlessly with AI coding agents like Claude Code, Codex CLI, Cursor CLI, and Gemini to ensure your code is ready for production before it ships.
Enables pre-commit reviews of changes, creating a multi-layered review process.
Fits into existing Git workflows. Review uncommitted changes, staged files, specific commits, or entire branches without disrupting your current development process.
Reviews specific files, directories, uncommitted changes, staged changes, or entire commits based on your needs.
Supports programming languages including JavaScript, TypeScript, Python, Java, C#, C++, Ruby, Rust, Go, PHP, and more.
Offers free AI code reviews with rate limits so developers can experience senior-level reviews at no cost.
Flags hallucinations, code smells, security issues, and performance problems.
Supports guidelines for other AI generators, AST Grep rules, and path-based instructions.
Concurrency vs Parallelism
Two things have always been clear about software development: we want computers to do more things and do those things faster. Making computer processors able to do more than one thing at once can solve both problems.
Before jumping into how this works, it’s critical to understand the difference between two related yet distinct concepts: concurrency and parallelism.
Concurrency is the ability of a system to handle more than one task at the same time. It is like having a busy chef who can prepare vegetables and watch over several pots on the stove simultaneously. Even though the chef isn’t doing everything at once, they are managing more than one task concurrently.
This idea is taken a step further in parallelism, which leverages concurrency to make the system run faster. It is like having several chefs working on different dishes in the kitchen at the same time. The goal is not just to handle multiple tasks simultaneously, but also to complete the work more efficiently overall.
Parallelism and concurrency are used in computers at three different levels. At the top level, there is thread-level concurrency, where entire programs or parts of them run concurrently. In the middle level, instruction-level parallelism enables processors to run more than one instruction simultaneously. At the lower level, SIMD parallelism enables a single instruction to execute more than one operation at the same time.
Understanding these three levels is very beneficial whether you are designing systems, writing software, or just trying to understand why your programs behave the way they do. We will do this in the following sections.
Thread-Level Concurrency
The process abstraction, managed by the OS, lets systems run more than one program concurrently. Thread-level concurrency takes it a step further by allowing more than one control flow to execute concurrently within the same process. The first computers that made it possible for multiple programs to run at the same time were time-sharing systems in the early 1960s.
These computers had a single processor that quickly switched between jobs, just like a juggler who keeps many balls in the air at once. For a short time slot, the processor works on one program and then switches to another. This made it look like everything was happening at the same time. Such a type of concurrency works well enough to let multiple users access the same web server or to enable a single user to browse the web while writing documents and playing music.
The limitation was clear: only one processor was doing the actual work, even while it was handling multiple tasks. These were known as uniprocessor systems.
When engineers developed computers with more than one processor that run under a single operating system, everything changed. These multi-processor systems have been used for large-scale computing since the 1980s, but they became more common with the advent of multi-core processors and hyperthreading.
Multi-core computers have multiple CPUs on the same chip. Each core has its L1 and L2 caches, with L1 caches split between instructions and data. The cores share higher-level caches and the connection to main memory.
Hyperthreading works differently. The idea is that a single CPU run multiple control flows by sharing some hardware components, like floating-point arithmetic units, and duplicating others, like program counters and register files. When one of the threads waits for data to load into cache, the CPU can immediately switch to running another thread. For example, an Intel Core i7 processor can handle two threads per each core, which means that a machine with four cores can run eight threads at the same time.
These improvements have two significant advantages. First, they reduce the need to simulate concurrency when handling multiple tasks. Second, they speed up programs, but only if those are designed to use multiple threads efficiently. The challenge is that applications must be explicitly written to take advantage of these parallel processing capabilities.
Instruction-Level Parallelism
Instruction-level parallelism is the capability of modern processors to run more than one instruction simultaneously. This works on a much more basic level than thread-level concurrency, focusing on single processor instructions instead of programs or threads.
Early microprocessors, such as the Intel 8086, required between 3 and 10 clock cycles to execute a single instruction. This meant that the processor had to complete the execution of one instruction thoroughly before moving on to the next one. These days, computers work in a very different manner. They can execute two to four instructions every clock cycle, even though it might take twenty cycles or more for one instruction to finish.
The key insight is that processors don’t need to wait for one instruction to complete before starting with another. They use techniques that enable them to process more than one instruction simultaneously.
Pipelining breaks down the execution of instructions into separate steps and sets up the processor's hardware in multiple stages. Each stage is responsible for a different part of the execution process. While a stage works on step 3 of instruction A, another stage can simultaneously work on step 1 of instruction C. This lets the processor work on different parts of different instructions at the same time. A well-designed pipeline can keep running at a rate of almost one instruction per clock cycle.
Processors supporting superscalar operations can even run more than one instruction per cycle, which is faster than pipelining. They can figure out which instructions can run on their own and use multiple processing units to run them simultaneously.
The implications for software development are big. By knowing how these processors work, programmers can write code that achieves a higher degree of instruction-level parallelism. Most of this parallelism is handled by the processor, but the way code is written can make it easier or harder for the processor to identify instructions that can be parallelized.
SIMD Parallelism
SIMD (Single-Instruction, Multiple-Data) parallelism works at the lowest possible level of abstraction. It enables a single instruction to perform multiple operations on different pieces of data simultaneously. This type of parallelism is different because it operates on a large number of data elements simultaneously and applies the identical operation to all of them.
Modern Intel and AMD processors have special instructions that can handle multiple data operations in parallel. For example, they can add up to 8 pairs of single-precision floating-point numbers in a single instruction. The processor handles all 8 additions at once instead of processing each pair one by one.
SIMD parallelism works well for applications that process large amounts of similar data. For example, image processing, sound editing, and video rendering all apply the same operations to many data points. SIMD instructions can speed up these tasks because they can handle multiple pixels, audio samples, or video frames at once.
Programmers can use SIMD parallelism in two main ways. Some compilers can find parts of code that would benefit from SIMD processing on their own. They can then generate the proper instructions. However, this automatic approach doesn’t always work well.
A better approach is to write programs using specialized vector data types. Such types can be used with compilers like GCC, allowing programmers to specify which SIMD instructions to use for specific tasks. This gives them more control over how the code uses parallel processing capabilities.
SIMD parallelism operates at the lowest level, but can make a big difference in how fast something works. The key is finding when you perform the same operation on multiple pieces of data. After that, you can set up your code to use these specific processing instructions.
Good one on concurrency vs parallelism, Fernando.
One thing I’d add is that in practice, the tricky part isn’t just getting things to run in parallel, it’s managing the overhead: locks, contention, and memory bandwidth.
Sometimes the cost of “making it concurrent” can outweigh the speedup.
Thanks for the mention!