How security and privacy impacts the database design

How security and privacy impacts the database design

What I like about the banner picture above is that it is a very useful analogy to our problem at hand. A database is nothing else than a glorified electronic version of it. Lots of small storage boxes containing some data, and identified by a label or number, usually called "primary key".

In security, the main concern is protecting against breaches altogether. However, it is also a good idea to think about mitigating impact in case of breaches. Assuming we deal with some kind of sensitive user data, let's have some thoughts on what should be on the labels and the content of these wooden boxes.

What should be the user "identifier"?

Should it be the e-mail? Or a username? ...

TL;DR: No, it should be an anonymous ID.

How to identify users is something to be decided at the very beginning. It's crucial information that spreads everywhere: in the database, in links, in software logic... Soon, everything will be tied to it and too late to change. Doing so will be extremely difficult in the best case, or impossible in the worst case.

But why should it be an anonymous ID? ...It's not only about data breaches.

That ID is probably going to show up in extern systems too. In URLs, in logs, in databases, sent to third-party services... So, in the grand scheme of things, privacy-wise, it's better to be anonymized.

Security-wise, using a username/e-mail as the primary ID is not a vulnerability by itself. However, it slightly increases the "attack surface". For example, imagine you have a REST API with some endpoints accidentally not properly controlling access. If they accept a username/e-mail as parameter, it's trivial to invoke these endpoints with other user's username/e-mail as input, since these are fairly easy to find out. On the other hand, using anonymized IDs would already make exploiting such a vulnerability more difficult, since you don't know other's ID in the first place. This not only applies for REST API, but in all kinds of hacking, whether it's digging in a stolen file or eavesdropping traffic. It does not make your system invulnerable, but it's an additional layer of safety.

Imagine again the wooden boxes above, just don't place names on it for everyone to see but use some other identifier instead. For example, use the SHA256 of the username/e-mail with a salt. That's simple, deterministic and anonymous.

What about the stored data?

Most of the time, plain data is read and written directly to the database. The connection is typically protected by credentials to restrict access. This is all good and fine but the main access credentials become critical. If the access credentials are leaked or stolen, all the data can be extracted. Plain and simple.

Depending on the sensitivity of the data, like for example personal or financial data, another layer of protection is required: encrypting the data. Instead of placing the original text sheet in the wooden box, you encrypt it beforehand. Reading and writing encrypted data in the database makes it unreadable to others. Even database administrators would not be able to pry into the stored data, nor any other software not in possession of the encryption key. Likewise, this also increases privacy.

Note that although most databases also feature some encryption mechanisms, it typically refers to the raw data persisted on disk. To make the database storage files unreadable when stolen. These should not be mixed up and ideally, both should be used.

TL;DR: always encrypt all personal, financial or otherwise sensitive information when storing it in a database

Hash what you can!

Not only passwords but also recovery codes and other nonces.

As an anecdote, it reminds me of a story about a hacker who found an SQL injection vulnerability, enabling him to pry into the database. Of course, the user passwords were hashed. Even the sensitive data was encrypted. On the other hand, the recovery codes were stored "as is". The hacker now just had to start a recovery procedure for an arbitrary user account, and then look up the corresponding plain text recovery code. Bam! The hacker could now simply complete the recovery procedure and reset the password for any user account. That escalated quickly.

So if you don't really need the plain text data, don't even store it, just a hash of it.

TL;DR: hash not only passwords but also recovery codes, nonces, challenges and such.

Pairwise identifiers

User accounts are usually identified by some sort of ID. It might be a hash, a UUID, a username, an e-mail, whatever. When some third-party interacts with your service, it'll likely use this ID to identify the user or resource.

This also means that this ID is "universal" and that everyone out there will use this ID as an identifier. A user can be tracked that way, and attack attempts can be carried out that way using the known ID from another party.

A concept to push privacy and security further is called "pairwise identifiers". Each "consumer" of your service will be provided with different user identifiers.

Third-party service XYZ: can you please send me data related to user ABC123 ?

Provider: sure ...let me check ...for you XYZ, the account ABC123 is mapped to "Bob" ...here you go.

Of course, "Bob" is not the username, but should be the internal primary id.

It's about providing a mapping so that each third-party sees different IDs, so that they cannot correlate users among themselves. Also, it prevents leaked IDs from being used by others. As usual, these benefits both privacy and security.

TL;DR: it's best to provide third-party services anonymized IDs, unique third-party for each third-party. Both for privacy and security.

Good ol' cookies

This last advice is not directly related to the database but rather to how to maintain browser sessions. There are two camps: "use a good ol' session ID cookie" vs "use a JSON Web Token" (In an Authorization header or as cookie). The latter sounds more modern and trendy but it has drawbacks regarding security.

Let's review the good old way of handling sessions first. It's fairly straightforward: set a "Http-Only" cookie with some random session identifier when the user is authenticated. Voila, done. The browser will automatically send the cookie on each subsequent request and it cannot be read nor written by scripts. Server-side, you can retrieve the session data based on that id. Then, when the user signs out you can remove the cookie and clear your session data. Same if the user is inactive for too long. Simple, effective and there is not much that can go wrong.

Regarding the JWT tokens, there are two dangers with it. First, if the token which is typically stored browser side is stolen, the attacker can impersonate the user, even long after the user signed out... Except if you keep a database of revoked tokens, which not only loses the main benefit of JWT being "stateless" but also adds undesired complexity.

The second danger is the signing key being stolen, whether it's because of an accident, a vulnerability exploit or a malicious insider. Although this is unlikely because the key should be well protected, the potential consequences of a breach are catastrophic. Basically, your service would be doomed overnight because attackers would be able to impersonate any user at will, bypassing authentication altogether. This is a worst-case scenario that would not be possible for sessions identified by random IDs.

TL;DR: prefer random identifiers in a Http-Only cookie over using JWT tokens for user sessions

Doing it "later" is no good

It might be tempting to delay it, since it introduces complexity to the codebase. However, it's no good. The more you delay it, the more effort and time will be consumed in a later stage refactoring anyway. But even more importantly, while some changes can be delayed, others would break the API and data compatibility for all "consuming" software!

In particular, everything regarding hashed user IDs and pairwise identifiers will be breaking all software and integrations relying on these IDs. Changing these is a "hard cut" that should be done ASAP early on, ideally during the conception phase.

Other changes have a less critical impact. For example, hashing of short-lived recovery codes, nonces, challenges, etc. can be done in a version update. This will invalidate existing codes, nonces, and challenges and cause a small disturbance, but everything will work fine again afterwards.

The only change which is delayable with less concern is the data encryption. This is an internal database change, opaque to the API which remains unchanged and the consumers as well. It can be done in one fell swoop by encrypting all the data in the database at once using a background process.

TL;DR: don't delay it, do it ASAP since it's breaking changes

Side benefits

Following these recommendations will increase the security and privacy of your system, but that's not all. Doing this, although it sounds complex, has its benefits too regarding code structure. It forces the software to access user data through a streamlined access point instead of fetching it directly from the database. When all calls for user data go through this access point, it's also easier to monitor, control access and properly handle the data in this one place.

At least, that's what we experienced after our "endless refactoring" at Passwordless.ID. The refactoring is large, and the "version 2" breaks compatibility, requiring us to deploy a separate version and clear all user accounts. It is a very hard cut. However, we are pleased with the result. The system is now more secure, with better privacy, and better structured than ever before. Something I am proud of!

TL;DR: it will even make your codebase structure cleaner!

Thanks for reading! If you liked it, leave a comment!