/ Security

Cryptography for laymen, part 2: encryption for data confidentiality

One of the most common uses of cryptography, and arguably the practice that led to the creation of the modern field, is to conceal sensitive messages from prying eyes. This is known as encryption. Even without the slightest affiliation to any IT-related field, if you're reading this article on a computer system[1], you're bound to have encountered the term. This article is meant to cover the core concepts behind it and, more importantly, their implications to our lives.

In line with pretty much every other paper on cryptography, I feel compelled to mention one of the earliest and a very well-known system meant to protect communication: Ceasar's cipher. When he wished to keep his correspondence private (for example, with his generals), Julius Caesar would obfuscate the words contained in said texts by replacing each letter with the one three positions earlier in the alphabet: D would become A, E would become B and so on, while the first three letters, A, B and C, would be translated to the last three, X, Y and Z respectively. The result would effectively look like gibberish (that's dfyybofpe after applying this transformation), unless you knew how it was generated, in which case it would become trivial to decrypt.

At this point I feel it's important to explain a term I introduced earlier. A cipher is an algorithm, a process, by which a message which needs protecting, commonly called a plaintext or cleartext[2] (the word gibberish in the previous example), is transformed to something which can't be understood, termed ciphertext (dfyybofpe). Nowadays these don't necessarily refer to text. The plaintext could be an image, a video, a piece of software, while most ciphertexts would be unintelligible sequences of bits. The terms sticked though.

And since I went down the clichéd path of mentioning the antique cipher that's absolutely useless today, I might as well continue with the oft-used analogy of locks and keys. We're all familiar with these inventions that provide a comforting level of security, even in modern times. With the right key, you can open and close the door, cabinet, chest or whatever else the lock is installed on. Without it, you are mostly out of luck, even with a key for the same lock model. In our analogy, the lock model corresponds to a cipher. There are only so many lock types in the world, just as there are only a few ciphers. It's the many variations of the key that make the difference, so it stands to reason a similar mechanism would be required for cryptography. It just so happens that that's exactly the case: a cipher is paired with a key to encrypt the plaintext to ciphertext. The format of the key, of course, depends on the cipher. So if we take Caesar's cipher, we can consider the algorithm to be the substitution of a letter with a corresponding letter in the alphabet a number of positions down, with this shift amount representing the key. However, not only did Caesar abuse of key "3" too often, his cipher presents a serious weekness – it only has 22 possible keys[3]. While manufacturing 22 physical keys might take a while, trying them out in a cipher is arguably faster. In fact, the privacy of his messages relied on the secrecy of the cipher itself, which is something that we've stopped doing. The ciphers in use today are known to everyone and this presents an interesting advantage – many scientists around the world are actively trying to find weaknesses in them and when they do succeed[4], the information is made public[5]. In practice, the vast majority of applications use a handful of ciphers, with one being particularly common – AES. This effectively puts the entire burden on the key. If one can discover the right key (which is nothing more than a number), an encrypted message can be decrypted. The trick is to ensure that there's too many of them to even consider trying to search for the right one. Nowadays it's common to use at least 128-bit keys, which provides 340.282.366.920.938.463.463.374.607.431.768.211.456 possibilities. Just for the sake of a completely irrelevant comparison, this is billions of times more than even the least conservative estimates for the number of grains of sand on Earth. It won't always be sufficient – as computers get faster and the science advances, the search space will need to be increased.

Before we get to the fun part and in order to explain the titles of the following sections, it would be good to mention that, technically speaking, what we've referred to so far are symmetric-key ciphers, for symmetric encryption. As the name suggests, this simply means that the same key is used for both encryption and decryption; the process may or may not be identical, but this is not relevant to the classification. This is in contrast to asymmetric-key ciphers, which work differently and have different purposes; we'll keep this topic for a future date.

Properties of symmetric ciphers

We're not going to discuss a cipher's properties from a cryptanalyst's point of view (to whom these articles would be incredibly boring). But there are several things which are always important to remember when dealing with cryptography (or when reading a news piece). While some of these may, in fact, be common sense, it's their sometimes-overlooked implications which make them interesting. The risks that I'm trying to bring to focus generally stem from an unfounded and excessive trust of encryption as some form of core and sufficient means to ensure information security. Unless used correctly, encryption may not only be ineffective, but also counter-productive, if it creates a false sense of security. Of course, in the right contexts, it can be an invaluable tool.

Ciphertext should, ideally, be indistinguishable from random data. However, in most real-life applications, this is not quite true and forms the basis of cryptanalysis[6]. If not used properly or in certain circumstances, ciphertext can leak information. As obvious as it may be, it should be stated that, at the very least, intercepting an encrypted message can be proof that a message was exchanged between two parties – encryption is not an invisibility magic spell. Not to mention that an approximate length of the plaintext can be easily deduced (or, in some cases, the exact length). Imagine what happens if Alice and Bob can only communicate through a limited set of three possible messages of wildly different lengths (e.g. "stay home", "come over to my place" and "join me at our favourite restaurant"); if these are also known to Eve, who can eavesdrop on their conversation and is interested in finding out which of the three Alice sent Bob, simply applying encryption won't help them. All Eve has to do is deduce the message based on the length of the ciphertext. There's a simple way to prevent the issue, but I'll leave that as an exercise for you.

It gets even more fun: encrypting the same plaintext with the same key will yield… (drum roll) the same ciphertext. Intercepting two identical ciphertexts will point to the fact that the same message was exchanged twice – that's information. This can be relevant in many circumstances, but let's provide a simple example: a database containing the marital status of sufficiently many people. If the marital status values are encrypted with the same key, all married people will have the same ciphertext value associated; as will all singles; and all divorced. By using statistical information of the population, one can easily derive which is which, effectively bypassing the encryption. As before, all isn't lost: there are cryptographic methods to address the issue – they just need to be implemented properly.

Encryption by itself can, at most, provide confidentiality – if Alice sends an encrypted message to her bank and Eve intercepts it, she shouldn't be able to retrieve the contents. But its integrity isn't guaranteed, which means that if a different, more advanced attacker, let's call him Mallory, can modify the message before it reaches its destination, the bank will still be able to decrypt it, resulting in a different plaintext than what Alice sent. Ideally, the result should be completely different, garbled data. But this is not always the case. Imagine if Mallory knows the format of the message; he might just be able to change some important part of it, such as the amount for a transfer. To achieve integrity (or even authenticity), you need to throw in some other cryptographic primitives, better left for a future article.

The most important property, which I simply can't stress enough, is that encryption is not some magic trick to protect sensitive data. Ciphertext is unusable for processing purposes and needs to be decrypted to be of any use; the decryption, of course, requires the key. Storing the key and ciphertext side-by-side is just as secure as leaving the key inside your front door's lock – you might as well not bother having the lock in the first place. A corollary to this is that losing the key to a well-generated ciphertext is tantamount to losing the data. This last property is, in fact, quite useful: you can "delete" a large amount of ciphertext quickly by simply discarding the key[7]. To summarise, encryption can only ever protect the data if the malicious entity can access the ciphertext, but not the key; the entities that actually need to use the data require the key, without which the ciphertext is useless.

Use for symmetric ciphers

By far the most common use for symmetric ciphers is to protect messages during transmission: encryption in transit. Imagine what happens when Alice wants to send a private letter to Bob. If there's a risk that it could be intercepted, they could make use of symmetric encryption to protect the contents. First, they must, somehow, come up with a key that they both have, without running the risk of anyone else having access to it – in this instance, it's commonly called a shared secret. This exchange is something symmetric ciphers simply can't help with. For this thought experiment, you can consider they agreed upon the key when they met face-to-face; in practice, other cryptographic functions, which of course we'll talk about later, are used to achieve this goal. Once this step is complete, they can encrypt and decrypt messages freely – but remember about the limitations presented earlier. Even if Eve can intercept their ciphertext, she shouldn't be able to recover the corresponding plaintext.

The Internet has made encryption in transit quite popular because of its architecture: communication between two parties transits multiple independent networks, under the administration of entities not necessarily known to either party, each of which could potentially intercept or eavesdrop on it. Furthermore, the originating network may fairly often not be trustworthy, whether it's a café's public Wi-Fi or a hotel's guest network. And how can we forget that communication lines can be rather easily tapped? So here comes cryptography to the rescue! A well-implemented encryption in transit scheme will fully mitigate these risks.

It's important to remember that encryption in transit is really only what its name suggests: the protection applies exclusively during the transfer of data. Both parties to the communication handle the decrypted data. Besides the obvious importance of correctly establishing the identity of the other peer, it also means that an attack mounted at either end of the transmission will not be in any way hindered by the encryption. For example, all Internet users should know they must only provide credit card details to a website using HTTPS – a secure connection; however, this only provides security during the transfer and the website must employ other means to ensure the secrecy of this very important data.

Lastly, it should be mentioned that without forward secrecy, a future attack that compromises one of the parties to the conversation could potentially disclose the contents of a prior message exchange, if the attackers kept the encrypted transfers in the hope of such an endeavor. This is something that applies to the way the key is generated when a secure channel is established, through methods covered in a forthcoming article, but it was worth mentioning here – you might not be out of the woods as soon as a data transfer finishes.

Encryption at rest is a whole different story, although the same ciphers are used and, after all, the aim is quite similar – the protection of information. What makes encryption in transit in the context of the Internet so great is that the contents of a message need not be known to relaying parties for it to get from point A to point B. With encryption at rest, the context is different and its usefulness is less clear-cut. The basic concept is always the same: you obtain a piece of data that requires storing; you first encrypt it with a key, before passing it on to a storage medium; when the data needs to be processed, the associated ciphertext is first retrieved from storage and then decrypted with the same key that was used earlier; the crucial aspect is ensuring the key is kept away from the data: an attacker that gains access to both would, effectively, overcome the encryption.

Consider a scenario where Alice wants to ask Bob to keep data on her behalf, but does not trust him enough to share the contents of said data. With encryption, all she has to do is keep the key private and only share the ciphertext with her storage provider, Bob. Ignoring the issue of availability, weaknesses in his solution would not have an impact on the secrecy of Alice's data. However, this would hinder any efforts that Bob might want to engage in to offer complementary services. A search index function? Infeasible. A pretty interface accessible over a web browser? Impractical. And how can Bob look Alice in the eyes as he tells her the data so securely kept by him is esentially worthless and, practically speaking, lost because she misplaced a key? As is commonly the case for real-world situations, security implies a compromise in functionality. All of this is due to the fact that, for all intents and purposes, Bob does not have any access to the original data and no means to obtain it. As eye-opening as this example might be, it's also quite removed from reality – remember the "hacked" celebrity iCloud accounts and leaked photographs? Would encryption have helped? Surely, but then the service would not have been as useful as customers would've expected and, ultimately, would've lacked any real traction and revenue. There are many ways to deploy encryption at rest and the main differentiator between the strategies is the key management procedures. With that in mind, let's talk about a few applications in order to shed some light on the practices and their strengths and weaknesses.

Disk encryption[8] is a common application and a very good example of encryption for data at rest. It's simple to achieve and, quite importantly, cheap to implement. Its only weakness is the very limited set of risks it helps mitigate. By definition, it can only protect data when the storage medium is physically stolen. The Macbook I'm writing this article on has an encrypted storage drive. But remember that ciphertext is useless unless decrypted. As soon as I type my password in, all protections effectively evaporate – all stored data is accessible to the operating system: otherwise, it simply couldn't do its job. No virus or remote hacker is going to feel in any way obstructed by my conscientious efforts. The only mitigated risks are those equivalent to a thief snatching the whole device from my bag on a train I accidentally fell asleep on; or a house burgler. A data centre environment doesn't change much, either; except that entry security is probably already ensured through other means[9]. In order to gain this confidentiality, however, the key must be kept private – ideally memorised by the system's user and manually provided at start-up. Storing the key within the system to provide a more seamless experience will, of course, limit its usefullness, or, even worse, render it superfluous.

Database encryption is often in the spotlight for a very good reason: people expect their data to be protected by the companies they provide it to; in turn, the companies want to be seen as taking security seriously. Too often, when a data breach occurs, the first question asked is "Was the data encrypted?". Unfortunately, it's too abstract of a concept to evaluate without intimate knowledge of the processes. For example, transparent data encryption (TDE) is a database encryption technology that protects the data when it's stored by the database engine. However, the rest of the system has access to the plaintext, which means a remotely exploited vulnerability will still easily access the data. In effect, it's quite similar to the disk encryption discussed earlier – except the key is generally also stored in the database system, so that it doesn't have to be provided manually by the system administrators – convenient, but less secure. It has its benefits and it does mitigate some risks, but few in respect to hacking attacks[10]. Would a spokesperson be able to answer "Yes" to the aforementioned question? Surely. But it's the question that's wrong, not the answer.

There are other methodologies that may provide some more consistent benefits, even in the face of the spooky hackers. But, remember that a company storing data on its customers' behalf will likely need access to it and ciphertext is a poor substitute. Generally speaking, if the encrypted data and the encryption key are kept separately, an attacker would require access to both in order to accomplish any real harm. However, the simple fact that data needs to be processed means there are points in the system where it’s decrypted to plaintext – making these components attractive targets. In the real world, encryption can be a useful tool to increase the number of hurdles a malicious actor would need to bypass. But, despite popular wisdom, it's far from an all-powerful cure.

This finally brings us to the trigger for this series of articles: the Grindr data scandal. The official statement mentioned that data was encrypted when it was shared with third-parties. Had only ciphertext been provided, no data sharing would actually have occurred and, crucially, the outside companies would not have been able to provide their services. Which, of course, only leaves the option that encryption in transit was employed: a security measure that only provides confidentiality against attackers that might intercept the transfers. In the end, the only conclusion I can reach, one that is by no means authoritative and is, notably, based on lack of disclosed information, is that the word "encryption" was used to soften the blow and serve as a PR exercise.

The rest of this article is going to delve into more specific concepts and, unless you're interested in the gritty technical details, generally only useful to the IT workforce, can be safely skipped. The main takeaway so far is that encryption is not some pixie dust[11] to be sprinkled onto a system to magically make it more secure. Security is based on a lot more than just cryptography.

Cipher types and modes

If you ever worked with encryption, you'd undoubtedly have come across a basic classification: block versus stream ciphers. This refers to a core characteristic of the algorithm and has consequences for its properties and uses.

A block cipher takes a key and a fixed-size plaintext (called a block) and applies its operation to yield the ciphertext, generally the same size as the input. If the data to be encrypted is shorter than the required block size, it has to be padded. If it's longer, it has to be split and padded so as to obtain a sequence of equal-sized blocks. 64 and 128-bit block sizes are common for the ciphers in use today, although some are flexible and support multiple configurations.

A stream cipher, on the other hand, generally operates at the bit level and requires no padding. Given a key as input, it will generate a infinite sequence of bits, called the keystream, which should have pseudorandom properties. Each bit from the plaintext is then combined with a simple operation with the corresponding bit in the keystream to result in the ciphertext. The use of XOR for this purpose allows the encryption and decryption functions to be one and the same.

The most widely known and, by extension, the most commonly used cipher today is AES – the Advanced Encryption Standard. The name was adopted by NIST (National Institute of Standards and Technology), a US government agency, when it selected the Rijndael algorithm as the new standard for use within the US public institutions. A block cipher with variable block and key sizes, it was limited to 128-bit blocks and key sizes of 128, 192 or 256 bits during the standardisation process. It has superseded the previous standard, DES – Data Encryption Standard[12].

As block ciphers, by definition, operate on a sequence of fixed-sized blocks, the way these are encrypted or decrypted has evolved to be known as the mode of operation. This writing is not meant to serve as some kind of reference, so I'll restrict it, for the sake of conciseness, to a few common ones:

  • ECB: Electronic Codebook. The simplest and generally one to avoid. Each block is independently encrypted in exactly the same way with the same key. Identical plaintext blocks within the same message will end up as identical blocks of ciphertext – what more can a cryptanalyst ask for if they want to break the encryption?
  • CBC: Cipher Block Chaining. Old (from 1976), reasonably secure, but with a few peculiarities that have been exploited in recent attacks (such as POODLE) – which is not to say that it's necessarily insecure. Each ciphertext block is combined with the following plaintext block before its encryption to avoid ECB's weakness. An initialisation vector (commonly called IV) is used to seed the operation and, provided a different one is chosen for each encrypted message, ensures that identical messages will be encrypted to completely different ciphertexts, even when the same key is used. For this very reason, IVs are employed for other modes as well and even for stream ciphers.
  • GCM: Galois/Counter Mode. Uses an initialisation vector in a way that makes it even more important that different IVs are chosen for each message, but does not share CBC's pecularities. Unlike the previous two, this is an authenticated encryption mode – it provides integrity in addition to confidentiality – the former are normally paired with other cryptographic primitives to achieve this goal.

All of this means that employing encryption is not just a matter of choosing your favourite cipher from the toolkit and throwing data at it. The type and, if we're talking about block ciphers, the mode of operation have important consequences to consider.

The practical bits

Before you consider applying the following in production code, I feel a duty to stress that using such low-level primitives comes with risks. Unless you know exactly what you're doing, using higher-level trusted and proven recipes, such as Fernet, is the safer way to go. That being said, the following code demonstrates the process of encrypting a message using AES128 (AES with 128-bit keys) in CBC mode, without offering any kind of integrity protection. It's meant to highlight the various steps which need to be taken.

>>> import os
>>> from cryptography.hazmat.primitives import padding
>>> from cryptography.hazmat.primitives.ciphers import Cipher, algorithms, modes
>>> from cryptography.hazmat.backends import default_backend

>>> backend = default_backend()
>>> key = os.urandom(16)  # This needs to be kept secret!
>>> iv = os.urandom(16)

>>> cipher = Cipher(algorithms.AES(key), modes.CBC(iv), backend=backend)
>>> encryptor = cipher.encryptor()
>>> padder = padding.PKCS7(cipher.algorithm.block_size).padder()

>>> plaintext = "I promise to first consider using proven recipes before copy-pasting this code!"
>>> padded_data = padder.update(plaintext.encode('utf-8')) + padder.finalize()
>>> len(padded_data)*8 % cipher.algorithm.block_size
0
>>> ciphertext = encryptor.update(padded_date) + encryptor.finalize()
>>> len(ciphertext) == len(padded_data)
True
>>> ciphertext
b"w\xc3\xad\xe6&(\x0b\xf9>\x02~q\n\xac\x97DL\x99\xd9\xf2\xac1\x02\xa2\x9e4\xca\xf7dV\x98\x1dr\xd6\x06\x02\xc8\xfa\xceo\x8e\xfex\x04^q\x1c\xba`\x0e\x1b\xdc'\xce\xa8$1\x17\xf3\x86\x8d\xf7\xc6\x10\xf33L\xdb\x94q\xc4\x0e\x1b`\xfa\tk\x1f\xf4\xdd"

>>> # decrypting means applying the inverse operations in the reverse order
>>> decryptor = cipher.decryptor()
>>> unpadder = padding.PKCS7(cipher.algorithm.block_size).unpadder()
>>> (unpadder.update(decryptor.update(ciphertext) + decryptor.finalize()) + unpadder.finalize()).decode('utf-8')
'I promise to first consider using proven recipes before copy-pasting this code!'

There you have it – just a few easy steps and you end up a sequence of undecipherable bytes which, ironically, don't really provide the best level of security: using a message authentication code is, for most applications, a must. However, this article was only ever meant to deal with symmetric encryption, so, to keep you hanging, we'll leave that subject for a subsequent piece.


  1. I use this term loosly, given the current proliferation of smartphones. ↩︎

  2. Yes, in one word, no hyphens. ↩︎

  3. The classical latin alphabet had 23 letters. ↩︎

  4. It is generally a matter of when, not if. ↩︎

  5. It is of course common sense that the secret services might come across such vulnerabilities first and keep them to themselves, but it then only becomes a matter of time before the discoveries are reproduced. Plus, they'd have to surreptitiously get other government agencies using the weak cryptography to switch, which could be a challenge in its own right. ↩︎

  6. Incidentally, this is why the one-time pad is mathematically proven to be unbreakable if used correctly – it is random data. But I'm getting ahead of myself. ↩︎

  7. A prime example of this feature in action is the "Erase All Content" function of Apple's iOS. ↩︎

  8. Quite often called filesystem encryption. ↩︎

  9. Disk encryption does bring other benefits too – it makes it easier to safely discard failed storage media. ↩︎

  10. One benefit is that backups are automatically encrypted, which does mitigate against hacking attacks targeting the backups rather than the main database. ↩︎

  11. I can't take credit for coming up with the idea of calling encryption pixie dust, but I can't remember where I first encountered the expression used in this context so as to provide a reference. ↩︎

  12. The original cipher's name is Lucifer ↩︎

Luci

Luci

I don’t know any witty quotes, but if I did, this is where I’d insert one.

Read More
Cryptography for laymen, part 2: encryption for data confidentiality
Share this