1.2.5 – Compression

Compression is all about making things smaller and take up less space. The need to make things take up less space is certainly not unique to computing and there are countless examples in history of clever techniques and innovations designed to compress data.

The goal of compression is to take a file, whether it is a picture, video, sound or even text document and to reduce the amount of data needed to represent that information. This may sound impossible – how is it possible to make something smaller without losing something? The answer is that if we look closely, we can find repetition, patterns or information that simply isn’t needed that can be reduced to a simpler form.

There are two main forms of compression, one which preserves the original perfectly at the expense of larger files and one where some information is lost but in return file sizes are much smaller.

In this section (click to jump):

What is Compression?

Compression is when we use an algorithm or utility to reduce the amount of data in a file. We can do this in two ways:

  • Reduce the amount of data without losing information. This is called Lossless compression – nothing is lost!
  • Reduce the amount of data by discarding an acceptable amount of information whilst still being able to reproduce a very close approximation of the original. This is called Lossy compression – because we lose things!

This can seem quite confusing at first – for example, how can we lose information from a sound file and still have something that sounds good when played back? How is it possible to lose information from a video and still see the pictures clearly?

The first step towards understanding the concept of compression is to differentiate between data and information. When referring to data in computing we are talking about the amount of binary 0’s and 1’s which are in a file. Those binary bits on their own are utterly meaningless – they are just a pile of numbers. A computer must interpret and process these bits in different ways depending on the type of Information that the file represents – an image, a document or a video in order to produce something with meaning.

Next, how is it possible to get rid of some of these binary bits without destroying the file or its meaning?

A page of uncompressed data

Imagine you take a page from a book. It is covered in information from top to bottom – the words. Now, fold that paper in half and keep folding it in half until you can’t fold any more. That page now takes up far less physical space than it did before – this is a type of lossless compression. None of the words were lost, removed or changed, but the paper is much smaller and takes up less space.

Folding the paper makes it smaller – it has been compressed.
This is a totally reversible process – simply unfold to reveal the original information.

This is a completely non-destructive process. Just unfold the paper and you can return the page back to its original state, but this isn’t the only option to make the paper smaller. Can we improve the situation?

It turns out we can, especially if we scale up the amount of information. Books happen to contain lots of words and many of these are repeated. One method of reducing the amount of data would be to replace the words with a symbol or number. Not only is the amount of data stored instantly reduced, but every time a word is repeated, we save space. Again, this is completely non-destructive – reverse the process and you are presented with an exact copy of the original document.

Can we do even better? Well… yes…!

Around the edge of each page is white space, paper without any words printed. Do we really need this? You could theoretically rip all of these bits off and save yet more space. Whilst we don’t lose any information – the words are still preserved, this process is destructive. We are physically throwing parts of the page away

A more destructive method of compression removing unnecessary white space. We can still reconstruct the original document.

Even though we now have access to unimaginably large amounts of storage, there are still many reasons why compression is both necessary and useful. Although storage is getting ever larger and cheaper, we are simultaneously generating more and more data than ever before. Things like video editing in 4K create eye watering amounts of data and being able to reduce this in any way is desirable.

Uncompressed files are still occasionally necessary, however they are used less and less often as computing power continues to increase. Historically, it may have been necessary to work with uncompressed data simply because the CPU time taken to compress and decompress data would have had a noticeable impact on performance. Working with uncompressed data still places a huge demand on fast storage and data transmission methods.

Uncompressed file formats include:

  • Images – BMP or Bitmap
  • Sound – WAV
  • Text – TXT

The reasons for compressing file sizes are shown below.

  • Many tasks create huge amounts of data, reducing this makes more efficient use of storage available
  • Cost – despite the availability of large, cheap storage devices, storage is not free
  • Transmission of data – compressed data can be transmitted more quickly and uses less bandwidth
  • Streaming – compression means higher quality audio and video can be streamed to a device without a loss of performance or increased cost

Lossy Compression

Lossy compression works on the principle that we don’t always need all of the information in a file – we do not always need the highest possible quality in an image, video or sound file. Images and sound files especially contain lots of information that you wouldn’t even notice if it were removed. For example, in a sound file it is possible to remove all data for frequencies above 20,000hz and below 20hz because whilst a computer can record, store and play these tones back – you cannot hear them!

In an image, there are many millions of colours that a computer can represent on screen – at least 16 million on even a modest system. The human eye can actually only differentiate between around 7 million colours. Whilst that is very impressive, it does mean there is room to lose some data and not really notice it.

Some common file types which use lossy compression are listed below:

  • Images – JPG
  • Sound – MP3
  • Video – MP4, AVC

The advantage of lossy compression methods is that you can achieve a much greater reduction in file size simply because you do not need to keep all of the original quality. The obvious disadvantage is that once you use lossy compression you can never reproduce the original quality of the file. Therefore, lossy compression cannot be used for text documents – if you lose any of the original “quality” then you cannot reproduce the original text and that’s… a problem!

The image below has been saved as a JPG with default or “high” quality on the left and the lowest possible quality, and therefore smallest file size, on the right. Use the slider to see the difference that a high level of lossy compression makes to an image.

To summarise:

  • Lossy compression achieves high rates of compression (file sizes are significantly reduced)
  • Information or quality is permanently lost
  • You cannot reproduce the original file once lossy compression has been used
  • Lossy compression gives a “good enough” level of quality
  • Usually data is discarded that usually is not useful – such as sound frequencies we cannot hear.

Lossy compression is used with the following data types:

  • Sound
  • Video
  • Images

It cannot be used with text files.

Lossless Compression

Lossless compression methods reduce the amount of data in a file whilst still maintaining the ability to reproduce the original exactly. In other words, you get the best of both worlds:

  • Your file is smaller – it takes up less storage
  • You still maintain the maximum possible quality – the same as the original source

Lossless compression methods have been invented for all kinds of file types, some examples are listed below:

  • Sound – FLAC (Free Lossless Audio Codec) and ALAC (Apple Lossless Audio Codec)
  • Video – H264 Lossless
  • Images – PNG, WEBP
  • Text – Huffman Encoding
  • General data compression – Huffman Encoding, Run Length Encoding

If you want to understand how some of these methods work, the videos below are a fairly good overview of how Huffman and Run Length Encoding both work. You do not need to know this specifically for your exam, however it will help you do understand what’s going on and how it is possible to reduce data without losing quality or meaning:

Lossless compression, as with many things in computing, is a trade off between quality and file size. Often, lossy compression methods will achieve smaller file sizes than lossless methods. The trade off is that lossy methods lose quality whereas lossless methods do not.

In recent times, lossless compression is becoming more widespread as users demand ever higher quality video and audio content. Lossless compression is especially prevalent in the music industry and the usual suspects such as Spotify and Apple Music all offer lossless audio streaming. This is only possible because of improvements in mobile internet speeds (4 and 5G) and a general improvement in home broadband and wireless connection speeds.

To summarise:

  • Lossless compression makes file sizes smaller without losing any quality
  • Compressed files can still reproduce the original source exactly
  • Lossless compression may not achieve the same level of compression as lossy compression
  • Text files must only be compressed using lossless methods.