I had initially planned to focus the sequel of the 1st part on discussing more use cases, but I meanwhile think it couldn’t hurt to insert a quick presentation of some certificate best practices, in order to make this little series more practical 😉
The following little pieces of advice are addressing three main risks
- (1) Service outages due to expiring certs or due to failing checks (as in: TLS handshake terminates as ‘something doesn’t match’)
- (2) Compromise of the private key
- (3) Violation of some security objective which a specific certificate is supposed to contribute to (I mean: you use them for an explicit purpose – which you fully & clearly understand, right?). Example: a certificate is used for user authentication, but the only check performed by an endpoint is if the lifetime is valid.
It should be noted that measures mitigating (1) might increase the risks of (2) or (3), and the other way round. You’ll have to find the proper balance in your environment. This requires understanding of the trust relationships, the trade-offs, etc. => 1st post.
It should also be noted that guidance coming from the fine folks in your infosec department is usually very much centered around (2) and (3). They don’t have to operate the services which actually employ certificates for one well-defined reason or another. Just saying 😉
Finally: being a fan of a ’10 golden rules’ approach (see here for a similar post on IPv6 security) I’ll make it ten. Also some people using certificates occasionally refer to a ‘certificate lifecycle’ which could look like the following. This can help to understand the order of pieces.
Understand – and potentially even better: document. but depending on your role it’s ok to just reflect on this a bit – in which places in your environment which certificates are in use, which purposes they are used for (and which related checks are performed, see below on different types), which lifetimes they have, what happens when the latter end etc.
Fair chance that some of your services connect to external systems (via HTTPS, evidently), so include those in this exercise. Apply the principles laid out here as well, in talks (for bold minds: audits) with the parties responsible for them.
Reflect on failure scenarios and how you want to deal with them. Discuss those with the relevant stakeholders (discussing during an outage caused by an expiring cert if it’s ok to disable checking cert lifetimes on a specific system/service – and looking for someone who approves the PR implementing such a change – might not be the best moment…). Maybe even write down the results of these conversations (runbooks come to mind).
This should include documenting how to emergency revoke a cert in case of key compromise.
This overall exercise mainly addresses risk (1), but the latter also risk (2).
Protect the privkeys
The private keys are the real deal to be protected. Do whatever is needed to protect them. This can include storing them in an encrypted manner, using a appropriate passphrase (which you don’t store together with the keys ;-), strictly limiting access rights to them, and limiting transfers. This is meant to protect against the above risk (2).
Memento Mori when installing a cert
At the very moment of installing a certificate on a system think hard and deep about that future moment when it expires. Make sure that proper auto-renewal mechanisms are in place. In case of manual renewal, know who will be in charge, which steps to perform etc. (did I already mention the value of runbooks?)
Align on use cases & objectives
A certificate is always used in a communication process (e.g. between a client and a server, or between a user/system and a network device granting Wi-Fi or VPN access). These parties might belong to different organizations, these might have different security objectives, and they might have a different understanding what those security objectives imply as for the (types and strictness) of checks to be performed. Aligning – via some type of communication – on those can have an impact both on avoiding and on dealing with failure scenarios. I’m aware that this may sound like a lot of overhead, but you know, a little conversation in advance can save you from quite some headaches later.
These conversations may involve infosec folks, maybe even on both sides. This can generally lead to interesting learnings, and to quite a few “oops, we thought it was ok to…” moments ;-). Remember the above example of performing cert-based user authentication and just looking at the validity period? Of course such a thing would never happen irl. never!
Automation is your friend
We all know that automating operational procedures is pretty much always a good idea, but there’s probably not many domains where this is so true as when certificates come into play. This does not only apply to renewal – where things have gotten significantly better in the last years, but also to initial deployment in distributed settings, e.g. on load balancers or on Wi-Fi controllers – where, in some spaces, things might not yet be fully there. Spend
some significant energy on this, you will thank yourself later.
Understand which checks you really need
Generally four types of checks can be differentiated:
- Lifetime. This is the most basic check, and you might not even be able to disable it in a specific setting. You probably never want to ignore this one (right? ;-), but grace periods can save your
lifeservice uptime here + there, and that’s totally ok as long as the implications & trade-offs (service availability vs. strict security objectives) are well understood.
- Identity. Again this is a basic check (‘am I connecting to the right server, represented by the certificate that it shows me?’), but this raises the question “how to define identity?”. Which identity does a wild card cert constitute? 😉 – those are not in use in your environment, you tell me? Well, at times developers *love* them (and Let’s Encrypt might happily hand them out once one has passed the initial domain validation). Ok, I get it, that’s only in dev, not in prod, you say? ;-).
Also it’s a common approach to use SANs (subject alternative names) in load-balanced settings, which can lead to interesting situations during troubleshooting. In short: identity things & checks might be more complex than they seem.
- Other checks on various fields of a certificate (e.g. parsing a piece from the distinguished name in order to determine some group membership which in turn leads to some security decision like authorizing access to a resource) . In the context of this post I have just one advice for you: don’t!
- Revocation checks. As I stated before, revocation checking usually opens a whole new can of worms, and it’s probably in this space where the objectives of operations personnel and infosec people most heavily differ. This brings me directly to the next point:
Be careful with revocation checking
Revocation checking brings new entities, roles & responsibilities, and processes to the picture. These can lead to all types of outage scenarios. On the other hand you have to deal with the capability-inherent issue of revocation (see 1st post). I know a number of environments explicitly foregoing revocation checking, for good operational reasons. Short lifetimes and proper renewal procedures can help to mitigate the related risks (“compensating controls” is favorable language then, when you talk to your infosec group or to ‘the auditors’).
Monitoring and alerting
Take care of proper monitoring and alerting, especially (but not only) in the context of expiring certs. Activities in this domain mostly address risk (1). I will cover approaches & tools in more detail in a future post. For the moment suffice to say from an operations perspective this can be considered to be the most important element of this little list, together with the next one.
Use auto-renewal wherever you can
This is simply based on the observation that certificate expiry is the most common outage reason. Automatic renewal (at least for the majority of certs) is a must in most environments, and supporting technologies like ACME exist these days. Two quick notes here:
- Think about: do you want to immediately revoke an old cert once a new one is generated? Doing so can avoid all types of interesting situations resulting from temporary co-existence, but doing so might also prevent you from undoing/rolling back changes in case that’s required.
- Keep in mind that just pushing the new cert(s) might not be enough. Very often services have to be restarted to use new certs.
Bonus: all of the elements of the certificate infrastructure itself, namely the CRL, should support IPv6 😉
tl;dr: to increase the maturity of certificate use within an environment the following recommendations can be worthwhile to consider:
- Be prepared
- Protect the privkey
- Memento Mori when installing a cert
- Align on use cases & objectives
- Automation is your friend
- Understand which checks you need
- Be careful with revocation checking
- Monitoring & alerting
- Use auto-renewal wherever you can
I’m always happy to receive feedback or comments on practices in your lovely world of certificates. Thank you for reading so far, and stay tuned for the next post of the series.