Getting Started with OpenCV: A Practical Introduction
Learn how images work under the hood and build your first computer vision application.
Hi Friends,
Welcome to the 162nd issue of the Polymathic Engineer newsletter.
As everybody knows, computer vision is everywhere. From the filters on your phone camera to self-driving cars, from medical imaging to security systems, the ability for machines to “see” and understand visual information has become integral to modern software.
If you want to get started with computer vision, OpenCV is the tool you should look at. This open-source library has been around since 2000, and it is used by everyone from hobbyists to large companies like Google and Intel. You can build many cool applications with it, such as face detection, object tracking, and augmented reality. But before you can do anything interesting, you need to understand the basics: how images are represented in memory, how to read and write them, and how to display results to users.
In this article, we will build that basic know-how. We won’t just look at the code, but we will also explain why things work the way they do. If you get a solid understanding of these fundamentals, it will be much easier to grasp everything else you learn about computer vision.
The outline is as follows:
How Images Are Represented in Memory
Reading and Writing Images
Working with Video and Camera Input
Displaying Results and Handling User Input
Project-based learning is the best way to develop technical skills. CodeCrafters is an excellent platform for tackling exciting projects, such as building your own Redis, Kafka, a DNS server, SQLite, or Git from scratch. Sign up, and become a better software engineer.
How Images Are Represented in Memory
To work with images in OpenCV, you first need to understand what an image actually is at the code level. An image is a NumPy array: a grid of numbers where each number represents a pixel value.
Let’s begin with the simplest case: a grayscale image. Each pixel is a number between 0 and 255, where 0 is black and 255 is white. Any other number in between is a shade of gray. This is how you can make a 3x3 black picture from scratch:
import numpy as np
img = np.zeros((3, 3), dtype=np.uint8)
print(img)
"""Output:
[[0 0 0]
[0 0 0]
[0 0 0]]
"""Color images work the same way. However, each pixel has three values instead of one, corresponding to the three color channels. One thing that confuses many newbies is that OpenCV uses the BGR (Blue, Green, Red) format, not RGB. When you change our previous picture to color, the NumPy array gets a third dimension:
import cv2
img = cv2.cvtColor(img, cv2.COLOR_GRAY2BGR)
print(img.shape) # Output: (3, 3, 3)Now each pixel is a list of three values: [Blue, Green, Red]. A pure blue pixel would be [255, 0, 0], not [0, 0, 255] as you might expect from RGB. The middle image below has four colored quadrants: the blue quadrant (top-left corner) is created with [255, 0, 0], the green quadrant (top-right) is created with [0, 255, 0], the red quadrant (bottom-left) is created with [0, 0, 255], and the white quadrant is created with [255, 255, 255].
You can inspect any image’s structure using three properties:
shape: Returns the dimensions. For a color image, this is (height, width, channels). For grayscale, it’s just (height, width).
size: The total number of elements in the array. For a 100×100 BGR image, this would be 30,000 (100×100×3).
dtype: The data type of the pixels. Most images use uint8 (unsigned 8-bit integers), which gives you the 0-255 range.
Getting familiar with this layout is critical to do everything else in OpenCV. When you manipulate images, you are really just doing math on NumPy arrays.
Reading and Writing Images
Now that you know how images are stored in memory, let’s have a look at how to load them from files and save them back. To do this, OpenCV has two key methods: imread and imwrite. The first loads an image, while the second saves an image. Their basic usage is straightforward:
import cv2
img = cv2.imread(’photo.png’)
cv2.imwrite(’photo.jpg’, img)Even if the original file is grayscale, imread loads images by default in the BGR color format. If you want to load an image as grayscale, you need to specify it:
gray_img = cv2.imread(’photo.png’, cv2.IMREAD_GRAYSCALE)Using the cv2.IMREAD_UNCHANGED option, you can also load the image as it was saved, including the alpha (transparency) channel if it exists.
The cool thing is that once you have an image loaded, you can modify it directly since it is just a NumPy array. For example, if you want to change single pixels to white, you can just assign new values:
img[0, 0] = [255, 255, 255] # Top-left pixel is now whiteFor bulk changes, NumPy’s array slicing is much faster than looping through pixels one by one. For example, this is how you can remove all green from an image or copy regions of an image using slicing:
img[:, :, 1] = 0 # Set all green channel values to zero
# Copy a 100x100 region from the top-left corner
roi = img[0:100, 0:100]
# Paste it somewhere else
img[200:300, 200:300] = roiOf course, when copying regions, you need to ensure the areas are the same size. If the shapes don't match, NumPy will raise an error. The following image shows the result of manipulating the four-quadrant image we saw in the previous section: the green channel is removed from the top half, the bottom-left is set to blue, and a region is copied from the bottom-right to the top-left.
Working with Video and Camera Input
A video is simply a sequence of images, and OpenCV processes them the same way it processes images: frame by frame. The VideoCapture class lets you read videos, and the VideoWriter lets you save them.
The following code snippet creates a VideoCapture object and then loops through the frames to read from a video file:
video = cv2.VideoCapture('input.avi')
while True:
success, frame = video.read()
if not success:
break
# Do something with the frameThe read method returns two values: a boolean flag indicating whether a frame was successfully read, and the frame itself. When the video ends, success is set to False, and the loop is exited.
The amazing part is that using a webcam to capture works almost the same way. You just pass a device index instead of a filename. The default camera's index is usually 0. To save a video, you need a VideoWriter object. This requires a few more parameters: the output filename, a codec, the frame rate, and the frame size:
import cv2
camera = cv2.VideoCapture(0)
fps = 30
width = int(camera.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(camera.get(cv2.CAP_PROP_FRAME_HEIGHT))
writer = cv2.VideoWriter(
'output.avi',
cv2.VideoWriter_fourcc('X', 'V', 'I', 'D'),
fps,
(width, height)
)
while True:
success, frame = camera.read()
if not success:
break
writer.write(frame)The codec determines how the video is compressed. Common options include XVID for AVI files and MP4V for MP4 files. Of course, the available codecs depend on your system, so you might need to experiment to find one that works.
One gotcha with cameras: the get method does not always return accurate frame rate values. It often returns 0. If you need precise timing, you are better off measuring the actual frame rate yourself or assuming a reasonable value, such as 30 fps.
Displaying Results and Handling User Input
To debug and build interactive applications, you need to be able to display images and videos. OpenCV provides a simple windowing system for this purpose. The imshow function displays an image in a window:
import cv2
img = cv2.imread('photo.png')
cv2.imshow('My Window', img)
cv2.waitKey()
cv2.destroyAllWindows()
The waitKey function is often confusing for beginners. Without such a function, the window would open and then close right away. It pauses the execution and waits for a key to be pressed. You can give it a timeout in milliseconds as an argument. Passing 0 will cause it to wait forever. Passing a positive number, it will cause it to wait for that many milliseconds before continuing.
If the user doesn't press any keys before the timeout, waitKey returns -1. Otherwise, it returns the ASCII code of the key that was pressed. You can use this mechanism to make your application respond to keyboard input:
while True:
cv2.imshow('My Window', frame)
key = cv2.waitKey(1)
if key == 27: # ESC key
break
For mouse input, you need to register a callback function using setMouseCallback. The callback gets the kind of event, the x and y coordinates, and some extra flags:
import cv2
def on_mouse(event, x, y, flags, param):
if event == cv2.EVENT_LBUTTONUP:
print(f'Clicked at ({x}, {y})')
cv2.namedWindow('My Window')
cv2.setMouseCallback('My Window', on_mouse)
img = cv2.imread('photo.png')
while True:
cv2.imshow('My Window', img)
if cv2.waitKey(1) == 27:
break
cv2.destroyAllWindows()
The event types are EVENT_MOUSEMOVE for movement, LBUTTONDOWN and LBUTTONUP for left-click, and similar events for right- and middle-click.
One important detail: OpenCV windows only update when waitKey is called. If you forget to call it inside your loop, you will see a frozen or blank window. In the same way, waitKey only captures keyboard input when an OpenCV window has focus.
Conclusion
We covered a lot of basic OpenCV ground in this article. You learned that images are just NumPy arrays, how to load and save them with imread and imwrite, how to work with video files and camera input using VideoCapture and VideoWriter, and how to display results and handle user input with OpenCV’s windowing system.
These are the building blocks for everything else you will do. If you want to go deeper, I suggest playing around with the code samples I used for this article, which I put in this GitHub repo. Try loading images having different formats, recording video from your webcam, or making a small app that responds to mouse clicks. The best way to get this concepts is to run code and see what happens.




