Earlier this week I went to Atlanta for NANOG87. I hadn’t been at a NANOG meeting for a while – I even missed the legend “World IPv6 Reunion Tour 2022” panel with my friend Jason Fester + some other fine IPv6 folks at NANOG85 in Montréal. Many people mainly join NANOG meetings for the hallway track, that is connecting/ socializing with peers (I had some very good conversations over dinner as well – you know who you are ;-), but they have a talks as well. In the following I’ll discuss some of them. I will add links to the videos once those are available in the NANOG YouTube channel.
Day 1 Keynote: Elad Nafshi (Comcast) – The Future is Now: Delivering the Next Generation Brilliant Network
A bit marketing-heavy but still entertaining, and it included some interesting technical tidbits on Comcast’s approach of fibre fault isolation & repair, and on their future cable modem design.
Aliraza Bhimani (Comcast Cable): The Operational Impacts of Supporting a Disaggregated, Distributed, Cloud-based Network Architecture
Solid technical presentation, however I’m not entirely convinced that this will gain heavy ground in other organizations than those using whitebox networking already (due to support strategies, due to added complexity for the needed interconnects etc.), but definitely an interesting talk.
Cat Gurinsky: Simplified Network Troubleshooting through API Scripting
This was one of the talks/workshops I was most looking forward to, as Cat is an expert in the field. Unfortunately she started a bit earlier than scheduled so I missed the majority of it.
Day 2 Keynote: Michael Bailey – A Security Practitioner’s Guide to Internet Measurement
Michael discussed the value of measurements & metrics, of critical thinking, and of an interdisciplinary approach to network security. Overall an excellent keynote. I particularly liked this slide 😉
Dr. Richard Clayton & John Kristoff: Assessing the Aftermath. Evaluating the effects of a global DDoS-for-hire service takedown
In December 2022 the FBI seized 49 domain names, taking roughly half of the booters active at the time (temporarily) out of business. Richard gave an overview of the booter landscape and the operation itself, while John looked at numbers of observed DDoS attacks before/after it, in order to identify the impact of the takedown.
Akiwate Gautam (Stanford University): Retroactive Identification of Targeted Domain Hijacks
This was one of my favorite talks as I think that while the attack vector is not new, it’s still under-estimated in several organizations. Gautam discussed a case study of a registrar-based DNS takeover/attack against the French airspace company Safran in 2014 which had triggered the researcher’s interest in the subject. He noted that TLS would not protect against such an attack (I discussed trust relationships & issues when using certificates here):
They subsequently developed a methodology to (retroactively) identify such attacks, based on the operational requirements from the attacker’s perspective:
This allowed them identify a number of hijacks, incl. some potentially unknown ones:
This was serious fun. Here are the categories of the two rounds:
Agustín Speziale: World Cup 2022 – Analysis of the impact on the Internet traffic and utilization
Agustín started with an overview of the LATAM countries participating in the 2022 FIFA World Cup. He introduced each analyzed country providing the respective population size, (estimated) Internet users, number of ASs, and some additional details on their Internet landscape, which in itself was quite interesting. He then presented traffic statistics during the period of the World Cup, and matched traffic peaks to individual matches, plus some rationale on the importance of the respective matches (some pictures with graphs here). Entertaining & educative talk, with an eventual plea to the NANOG audience to keep these numbers & trends in mind, with the World Cup 2026 looming. I for one expect that football soccer will be(come) a huge thing in the US in the next years anyway.
Later that day there were a talk on the benefits of using CG-NAT with 100-net (100.64.0.0/10) – which I did not attend as the topic was against my religion 😉 – and a talk on IPv6, which was a really bad talk (outdated stuff, “could have been a blogpost”, in the year 2015). Hence not covering those two here.
I had initially planned to focus the sequel of the 1st part on discussing more use cases, but I meanwhile think it couldn’t hurt to insert a quick presentation of some certificate best practices, in order to make this little series more practical 😉
The following little pieces of advice are addressing three main risks
(1) Service outages due to expiring certs or due to failing checks (as in: TLS handshake terminates as ‘something doesn’t match’)
(2) Compromise of the private key
(3) Violation of some security objective which a specific certificate is supposed to contribute to (I mean: you use them for an explicit purpose – which you fully & clearly understand, right?). Example: a certificate is used for user authentication, but the only check performed by an endpoint is if the lifetime is valid.
It should be noted that measures mitigating (1) might increase the risks of (2) or (3), and the other way round. You’ll have to find the proper balance in your environment. This requires understanding of the trust relationships, the trade-offs, etc. => 1st post. It should also be noted that guidance coming from the fine folks in your infosec department is usually very much centered around (2) and (3). They don’t have to operate the services which actually employ certificates for one well-defined reason or another. Just saying 😉
Finally: being a fan of a ’10 golden rules’ approach (see here for a similar post on IPv6 security) I’ll make it ten. Also some people using certificates occasionally refer to a ‘certificate lifecycle’ which could look like the following. This can help to understand the order of pieces.
Inventory
Understand – and potentially even better: document. but depending on your role it’s ok to just reflect on this a bit – in which places in your environment which certificates are in use, which purposes they are used for (and which related checks are performed, see below on different types), which lifetimes they have, what happens when the latter end etc. Fair chance that some of your services connect to external systems (via HTTPS, evidently), so include those in this exercise. Apply the principles laid out here as well, in talks (for bold minds: audits) with the parties responsible for them.
Be prepared
Reflect on failure scenarios and how you want to deal with them. Discuss those with the relevant stakeholders (discussing during an outage caused by an expiring cert if it’s ok to disable checking cert lifetimes on a specific system/service – and looking for someone who approves the PR implementing such a change – might not be the best moment…). Maybe even write down the results of these conversations (runbooks come to mind). This should include documenting how to emergency revoke a cert in case of key compromise. This overall exercise mainly addresses risk (1), but the latter also risk (2).
Protect the privkeys
The private keys are the real deal to be protected. Do whatever is needed to protect them. This can include storing them in an encrypted manner, using a appropriate passphrase (which you don’t store together with the keys ;-), strictly limiting access rights to them, and limiting transfers. This is meant to protect against the above risk (2).
Memento Mori when installing a cert
At the very moment of installing a certificate on a system think hard and deep about that future moment when it expires. Make sure that proper auto-renewal mechanisms are in place. In case of manual renewal, know who will be in charge, which steps to perform etc. (did I already mention the value of runbooks?)
Align on use cases & objectives
A certificate is always used in a communication process (e.g. between a client and a server, or between a user/system and a network device granting Wi-Fi or VPN access). These parties might belong to different organizations, these might have different security objectives, and they might have a different understanding what those security objectives imply as for the (types and strictness) of checks to be performed. Aligning – via some type of communication – on those can have an impact both on avoiding and on dealing with failure scenarios. I’m aware that this may sound like a lot of overhead, but you know, a little conversation in advance can save you from quite some headaches later. These conversations may involve infosec folks, maybe even on both sides. This can generally lead to interesting learnings, and to quite a few “oops, we thought it was ok to…” moments ;-). Remember the above example of performing cert-based user authentication and just looking at the validity period? Of course such a thing would never happen irl. never!
Automation is your friend
We all know that automating operational procedures is pretty much always a good idea, but there’s probably not many domains where this is so true as when certificates come into play. This does not only apply to renewal – where things have gotten significantly better in the last years, but also to initial deployment in distributed settings, e.g. on load balancers or on Wi-Fi controllers – where, in some spaces, things might not yet be fully there. Spend some significant energy on this, you will thank yourself later.
Understand which checks you really need
Generally four types of checks can be differentiated:
Lifetime. This is the most basic check, and you might not even be able to disable it in a specific setting. You probably never want to ignore this one (right? ;-), but grace periods can save your life service uptime here + there, and that’s totally ok as long as the implications & trade-offs (service availability vs. strict security objectives) are well understood.
Identity. Again this is a basic check (‘am I connecting to the right server, represented by the certificate that it shows me?’), but this raises the question “how to define identity?”. Which identity does a wild card cert constitute? 😉 – those are not in use in your environment, you tell me? Well, at times developers *love* them (and Let’s Encrypt might happily hand them out once one has passed the initial domain validation). Ok, I get it, that’s only in dev, not in prod, you say? ;-). Also it’s a common approach to use SANs (subject alternative names) in load-balanced settings, which can lead to interesting situations during troubleshooting. In short: identity things & checks might be more complex than they seem.
Other checks on various fields of a certificate (e.g. parsing a piece from the distinguished name in order to determine some group membership which in turn leads to some security decision like authorizing access to a resource) . In the context of this post I have just one advice for you: don’t!
Revocation checks. As I stated before, revocation checking usually opens a whole new can of worms, and it’s probably in this space where the objectives of operations personnel and infosec people most heavily differ. This brings me directly to the next point:
Be careful with revocation checking
Revocation checking brings new entities, roles & responsibilities, and processes to the picture. These can lead to all types of outage scenarios. On the other hand you have to deal with the capability-inherent issue of revocation (see 1st post). I know a number of environments explicitly foregoing revocation checking, for good operational reasons. Short lifetimes and proper renewal procedures can help to mitigate the related risks (“compensating controls” is favorable language then, when you talk to your infosec group or to ‘the auditors’).
Monitoring and alerting
Take care of proper monitoring and alerting, especially (but not only) in the context of expiring certs. Activities in this domain mostly address risk (1). I will cover approaches & tools in more detail in a future post. For the moment suffice to say from an operations perspective this can be considered to be the most important element of this little list, together with the next one.
Use auto-renewal wherever you can
This is simply based on the observation that certificate expiry is the most common outage reason. Automatic renewal (at least for the majority of certs) is a must in most environments, and supporting technologies like ACME exist these days. Two quick notes here:
Think about: do you want to immediately revoke an old cert once a new one is generated? Doing so can avoid all types of interesting situations resulting from temporary co-existence, but doing so might also prevent you from undoing/rolling back changes in case that’s required.
Keep in mind that just pushing the new cert(s) might not be enough. Very often services have to be restarted to use new certs.
Bonus: all of the elements of the certificate infrastructure itself, namely the CRL, should support IPv6 😉
tl;dr: to increase the maturity of certificate use within an environment the following recommendations can be worthwhile to consider:
Inventory
Be prepared
Protect the privkey
Memento Mori when installing a cert
Align on use cases & objectives
Automation is your friend
Understand which checks you need
Be careful with revocation checking
Monitoring & alerting
Use auto-renewal wherever you can
I’m always happy to receive feedback or comments on practices in your lovely world of certificates. Thank you for reading so far, and stay tuned for the next post of the series.
I’ve written a couple of posts on (X.509v3) certificates in the past, starting with this one in 2001. In the two decades since then a number of developments have taken place (to name a few: OCSP, ACME, Let’s Encrypt certificates and the general role of automation). On the other hand the fundamental mechanisms of certificates have stayed the same. In this post I argue that understanding the inherent (but often hidden) complexity, the trust relationships and the trade-offs of certificate use in a given environment can lead to better decision making and to more efficient operations.
The basic scheme (for the purposes of this post) usually involves a set of parties:
(1) A server (in the sense of an entity receiving a connection request, incl. network devices)
(2) A client (an entity that initiates a connection)
(3) A user who uses the client, and we can safely assume this is a human, so motivations & desires come into play (which can influence trust decisions)
(4) An operator being in charge of (1), or of (2), or of both. Here again we assume humans, so they have objectives (in particular “make the users happy by providing a service which is available, and which they can use with their present skill set”)
(5) CAs who issue certificates to be used on (1), or on (2), or on both. Evidently this involves (potentially complicated) relationships with the operators
(6) developers
(7) infosec people
Let’s start with some high-level concepts (yep, regular readers remember my love for those ;-).
Complexity
Working with certificates frequently induces a high level of complexity (definition of the term here), for a number of reasons:
multiple standards bodies have contributed to specifying what we have today, one of them (ITU) being notorious for complex outcomes. The main IETF document, that is RFC 5280, has 151 pages.
using certificates often involves other, not necessarily simple, things like ASN.1 or DER.
most importantly there are all types of extensions which can be employed for nearly unlimited creative uses ;-). See this part of the table of contents of RFC 5280
Unfortunately one of the objectives of the ‘traditional’ certificate use case (that is: securely buying stuff in the Internet) was to hide this complexity from the users. At the same time certificates being capabilities (see below) – which get deployed once and, seemingly, don’t have to be ‘operationally taken care of’ at least for a while – causes them & their complexity being underestimated (& being ‘invisible until something breaks’) in fast-moving environments. Realizing that certificates are complex beasts, and especially so when employed for certain use cases (=> below), might be the first step of getting better at handling them ;-).
Trust
By their very core value proposition trust (some definition of the term here) plays a huge role when certificates come into play. They’re exactly meant to contribute to trust between communication partners (by assuring the identity of one or multiple of them). In the classic use case this works as follows:
I can trust that this web site I’m visiting belongs to the organization holding the domain name I typed into my browser, because I see that little lock in the URL.
Behind the scenes this trust is established as another party (the CA) assured the binding of some cryptographic material to some identity information, based on some more or less rigorous checks. I might not know this other party but my browser does, and the mere existence in my browser’s (or OS’s) certificate store expresses this trust.
Alas, matters involving trust can be way more complex in today’s world. Imagine you operate an application which runs on several systems and at some point connects to a system operated by a 3rd party (called $ORG in the following), e.g. for querying a database. As smart & security-conscious people are involved, certificates are used everywhere incl. that one external system. When asked about the dimension of trust as for the certificate over there (in the following: $CERT) one might be tempted to respond “well, that one enables us to trust we’re connecting to the right system (and infosec told us to ubiquitously use certificates anyway)”. However, in reality
you now inherently trust that $ORG to has done a reasonable job when getting an appropriate certificate for the purpose.
you trust the respective CA to have done a proper job vetting $ORG (and to have issued an appropriate certificate for the purpose).
you now inherently trust that $ORG knows or monitors the expiry date of $CERT (and, evidently/subsequently, that related alerting capabilities are in place).
you inherently trust that some sufficiently qualified personnel will be available latest on the day when $CERT expires.
overall you inherently trust $ORG’s operational maturity to properly handle certificates ;-).
Looking closer you may also find out that $CERT is a wildcard cert covering the full domain of $ORG, so the initial assumption of trust (‘make sure we connect to the right system’) might be… debatable. In short understanding the (hidden) trust relationships in an environment can generally be beneficial for prioritizing operational resources. Which brings me directly to the next point.
Trade-offs
The world of certificates is full of trade-offs (as, of course, are all settings with many different parties and their – differing – objectives). Here they are usually clustered around two main themes:
performing certificate validation at all ;-). This may sound strange at the 1st glance – I mean, using certificates only makes sense once you validate them, right? – but many of us know situations of the “oops, that expired cert over there breaks our service delivery right now. what about temporarily [ed.: by some definition of temporary 😂] disabling cert validation for the TLS connections between those systems to quickly fix the issue?” type. You may also look at the Wi-Fi authentication use case below.
how to determine if a certificate is (still) valid. This can be time-based, or based on checks of the revocation status, or both. Such checks (and the concept of certificate lifetimes/validity periods as a whole) are related to a specific property of certificates (them being capabilities, see next section), and these checks can induce significant operational complexity (e.g. see the post I referenced at the beginning of this one). I will cover certificate revocation & checking in a later part of this series.
Finding the right balance between objectives of different parties, read: going with the right trade-offs, can greatly help to efficiently steer operational resources (in all directions, e.g. increasing cert lifetimes between systems which are all part of the same – your – operational domain can be a good idea when cert expiry is a frequent cause of issues. better yet to increase the level of automation for renewal then ;-). You may hence spend some intellectual cycles on understanding/questioning the trade-offs in your environment. As stated above quite some of the trade-offs are commonly related to the most important, yet at times least understood, point of my little theory discourse here, that is:
Certificates are capabilities
Imagine there’s a subject (a user/process) that wants to access an object, e.g. a resource (network, file etc.). The enforcement mechanism controlling the subject’s access to the object can then look
at an attribute of the object itself (we could call it sth like ‘access control list’). This attribute/list is then checked every time the subject shows up and asks for access, and it’s usually maintained by the object’s owner.
for an entitlement (not to be confused with, but similar to these) which at an earlier point of time was granted to the subject and which generally allows some access. Such a thing is sometimes called a capability, and certificates can be perfect examples of capabilities (strictly speaking & technically the private key corresponding to a cert’s pub key constitutes the actual capability, but let’s keep it simple).
I’m using the above terms a bit loosely here, and there’s a lot of theoretical discussion in OS security circles on these. In any case capabilities have two main challenges:
Delegation: how can you make sure that one subject does not transfer the capability to another subject after it has been granted.
Revocation: if circumstances change (e.g. a system/key material is compromised or when a user leaves an organization) how can you make sure that the once-granted entitlement can no longer be used.
Both are well-known in certificate circles, and various architectural or technical approaches exist how to deal with them incl.
Come up with a flag (‘non-exportable’) for private keys and hope that the OS environment properly enforces it.
Store the private key(s) in some extra-secure place. That’s the main reason why smart cards once gained a lot of popularity in some industry sectors (namely heavily regulated ones like banks), and why hardware security modules (HSMs) exist.
Implement an additional layer where, at the very moment of a certificate’s use, some extra check of the ‘ok, it is still within its validity period, but has it been revoked?’ type happens. Voilà the birth of certificate revocation checking, and welcome to a whole new space of complexity, trust relationships, and trade-offs (=> detailed discussion in next post).
It should be noted that
revocation checks significantly change the trust relationships (“ok, I see the cert that you present to me. It was meant to create trust between you and me, but I’m not convinced. let me reach out to somebody else to verify.”)
they kind-of move the needle towards an object-based security model which many people intuitively prefer as it gives them the notion of being in control (also this is better aligned with many compliance frameworks 😉 ).
—
Let’s now discuss some certificate use cases from the above perspectives. In the following I will look at five of them (the first two in this post, the others in the next):
E-commerce web server offering HTTPS
Authentication in enterprise Wi-Fi networks
Client/user authentication (e.g. for VPN access)
Client/user authorization (as in “enrich a certificate with additional information which is then parsed in order to take security decisions like controlling access to a specific resource”)
mTLS
Use case: e-commerce web server with HTTPS
This is probably the most classic use case, and it’s the one the paved the way for widespread use of certificates. When e-commerce became a thing, there were two challenges to be solved from a user’s (buyer’s) perspective:
How do I know I’m connected to the right server (assuming that this one only uses my credit card data for the goods I want to purchase)?
How can I be sure that my payment data is not compromised when using the Internet for its transfer?
Both could be addressed by deploying a cert on the web server(s) and enabling HTTPS.
To note:
From a trust perspective this is a kind-of easy one. The user has a certain desire (e.g. to buy something, or to watch specific content) which generally highly influences trust decisions (otherwise Ponzi schemes wouldn’t work). The CAs were trusted as there were only a few of them, and their trustworthiness was rarely questioned or verified from the ppl requesting certificates (in the early days the latter sometimes even were part of a company’s marketing team, which usually have a more optimistic approach to life – than those ever-skeptical infosec folks – anyway).
From a company’s security objective perspective it was an easy one, too: none of the to-be-protected assets (user’s credit card data) were really of relevance (wrt to protection need) for the owners of the web servers. This only changed when PCI came up.
From an operations perspective it wasn’t particularly difficult either: certs had comparably long lifetimes (usually two years), there were only a few of them, and while renewal was known to be somewhat inconvenient it was at least less cumbersome than the initial request.
Use case: authentication protocols used in enterprise Wi-Fi networks
Pretty much all extensible authentication protocols (EAPs, some overview here) used in enterprise Wi-Fi networks employ certificates, some of them only on the side of infrastructure elements (e.g. PEAP), others (EAP-TLS) also for clients. Especially the latter one brings high operational complexity (see for example this old setup guide which my fine buddy Chris Werny authored many yrs ago). With that come both heavily differing objectives of the involved parties and quite interesting failure scenarios. Let’s analyze some of the involved parties.
operators of the RADIUS servers. They might not be super-familiar with certificates, hence installing those may not a daily task for them, so they’d be happy with generally longer cert lifetimes.
‘enterprise desktop team’ – they will strive for auto-enrollment & renewal, and again they will want to keep things simple (“why do they bother us with this certificate stuff, our life is already difficult”). This group/task could be outsourced (=> $CONTRACTOR1).
the users just want Wi-Fi to work, they (legitimately) don’t care about the underlying technologies, and they will happily click away any certificate-related warnings “as long as the damn corporate Wi-Fi works”.
the infosec people want to prevent the users from doing the latter, and they’d be happy if lifetimes of involved certificates were better shorter than longer. Bonus if they come up with the idea of implementing some additional scheme where “Wi-Fi (security) profiles” are mapped to certain parts of the certificate (did I already mention that certificates have various types of fields which can be overloaded populated with all types of information?)
the operators of the whole Wi-Fi infrastructure want to keep the users happy. Some chance here that operations of (some parts of) the network infrastructure might be outsourced/provided by contractors ($CONTRACTOR2).
The CA issuing the involved certificates might be in-house, or not. Common scenario that this is another contracted service ($CONTRACTOR3). Bonus if the wireless infrastructure uses intermediary certificates from another CA ($CONTRACTOR4).
Let’s imagine at some point one of the following two things happens
Something breaks
One of the certificates, in particular on the infrastructure level (RADIUS server, AP, wireless controllers) level expires. High chance that the renewal requires human labor & skills and, evidently, requires touches availability-critical network infrastructure. Maybe the certificates in question are not monitored. Overall quite some probability that cert expiry leads to “something breaks”.
How well do you think, will CONTRACTORS{1,4} interact in such a case? – Exactly 😉
It should be noted that most of the above parties do not have a deep familiarity with certificates in their daily life. Those being mostly invisible until sth breaks, doesn’t help either (=> incentives?). I can also tell you from practical experience (from my days as a network consultant in a US Fortune 10 company 15 years ago) that all of the above parties (except the infosec folks) will happily & immediately sacrifice all cert-related security properties once, say, 50K users might not be able to use the corp Wi-Fi anymore (due to expiring intermediate certs from a vendor with whom $CONTRACTOR2 had ended their contractual relationship). Then the following suggestions might show up on the table:
Can’t we just disable certificate validation as a whole, on certain $INFRASTRUCTURE_ELEMENTS?
What about publishing guidance in which we tell users to ignore certificate warnings?
Any chance of configuring some grace period, say 4–8 weeks, during which we still accept the expired certs? $VENDOR already promised us a custom image which somehow avoids the issue (don’t ask…).
If only some group of experts had reflected on the certificate deployment in that environment, its operational complexities, its inherent trust relationships, and the trade-offs between the different parties & their incentives earlier ;-). If this post makes you think about these aspects in your own world, I’m a happy man. Thank you! for your time spent reading, and see you in a few weeks for the next part.
I spent the last two days in Paris to attend Hexacon 2022. As usual when I write here about conferences I’ll summarize some talks & observations. I don’t go to many offensive security-only events (it’s well-known that I have thoughts on a certain scene and its [non-] ethics, but on the other hand a periodic reality check of such sentiments shouldn’t hurt either). Hexacon had caught my attention due to the superb speaker line-up, and I could reasonably expect to meet some old friends there.
Having been a conference organizer for a while in my life I can say that I was seriously impressed by what Renaud & team have put together (fair chance that the team did the vast majority of the real work ;-). Very well organized event, excellent talks (not a single rly weak one) and good community spirit. Great job, folks! That said, let’s have a look at the talks. For time reasons I will only cover some of them, and slides/videos for quite a few have not been published yet (it’s announced though), hence let’s hope that my memory serves me correctly…
Luca started with an overview of the fundamental pieces of the iOS security model & of recent advances both in the space of attack vectors and when it comes to protections:
He repeatedly emphasized the value of Lockdown Mode, assuming it might have taken a couple of afternoons to implement 😂. He summarized that ‘Apple is finally winning’ which, according to Shahar Tal, was met with ‘crowd silence‘ (I can confirm this). Luca then provided some conclusions on the future business of iOS-oriented offensive security research which at the 1st glance can be summarized as follows:
– given the complexity of iOS and its security measures it’s very unlikely any individual can succeed alone. Some players will go out of business. – to survive (in the business) access to significant amounts of private knowledge is needed as public information is years behind the mitigations. – exploit-based public jailbreaks (JBs) are most likely over.
So far, so good. However an alternative reading of those statements could be this one: – groups who have what he calls ‘private knowledge’ will still make the deals $$$. – JBs won’t be released to the public (I’m not following that scene closely, but I think this has been already the case for a while now). They will instead be sold to the highest bidders. – ofc the speaker and their company belong to the privileged as of the 1st statement.
To – maybe – support this reading Luca pulled a final trick by ending with a short video (which then again, from my limited perception, has been a common turn of such talks in the last years) which – maybe – showed a working JB against iOS 16.1. There was a comment on Twitter that a photo of that demo ‘misses context for those not in attendance’. I can confirm that it also missed context for some in attendance (incl. myself ;-), but this might (srsly) be attributed to my non-familiarity with the space. Overall a solid technical talk (I learned quite a bit), together – maybe – with a pitch for services which – maybe – can only be offered by a privileged few.
The team from the Airbus Security Lab has been doing very interesting research for many years, together with the release of results and tools. In this talk they discussed their findings from performing an in-depth assessment of NetBackup. NetBackup here being a perfect example of – a piece of 3rd party software found in many large enterprises. – which runs with high privileges and/or has access to highly sensitive data. – is complex in itself, and may use old & complex standards (e.g. in this case CORBA).
I generally think it’s super-important to publicly discuss the results of such assessments (presumably well-funded actors look at enterprise tools, too, albeit without publishing the results…). Similar stuff from another research group can be found here or here.
The speakers started by laying our their methodology & research questions:
They then provided a detailed overview of the inner architecture of NetBackup, its daemons & processes, the ports those run on, and how those interact.
Finally, evidently, their findings were presented, incl. a very nice demo (pay close attention to the names of the phases of the demo, in the top left part of bottom pic):
Overall this was one of my favorite Hexacon talks: relevant research, extremely well-structured presentation, and a cool demo. Slides can be found here.
Thomas Chauchefoin: You’ve got mail! And I’m root on your Zimbra server
Another talk dissecting (and pwning) a piece of enterprise software, in this case an e-mail & collaboration suite called Zimbra. This one being a perfect example of a commercial product which – uses many OSS components, loosely coupled together + masked behind web frontends. – undertakes more or less successful attempts to filter/sanitize input, which is then processed in the chain of those various, loosely coupled, components. – does not rly use sandboxing of components or stripping-down of privileges.
What could go wrong with such a piece? – Right… (btw, many parts of this talk reminded me of the days when Felix owned FireEye boxes) Thomas discussed the inner workings of Zimbra and subsequently several vulnerabilities he found (incl. CVE-2022-27924), accompanied by some proper demos.
Overall an interesting presentation, and apparently quite timely as active exploitation of Zimbra seems to happen these days.
Ophir Harpaz & Stiv Kupchik: Exploring Ancient Ruins to Find Modern Bugs: Discovering a 0-Day in MS-RPC service
MS-RPC is a juicy target – it runs on every Windows machine, the endpoint mapper service listens on a fixed port (TCP 135), and vulnerabilities might be worm-able (Blaster used it, back in 2003). After Ophir laid out the general architecture and core terminology, Stiv explained how the interaction of authentication and caching of access information can lead to bypass attacks.
He went on detailing the steps needed to find CVE-2022-38034 which was patched on this month’s patch Tuesday (= three days before their Hexacon talk ;-). Overall another excellent technical presentation & very relevant research. Slides can be found here.
David Berard & Vincent Dehors: I feel a draft. Opening the doors and windows: 0-click RCE on the Tesla Model3
Certainly one of the most anticipated talks of Hexacon, and they did not disappoint. To own the car they focused on the infotainment system (synecdoche used deliberately here as I seem to have missed the part in which they discussed the strong isolation between infotainment and CAN bus which Tesla uses, or not):
It runs Linux with some COTS components for embedded systems like ConnMan which turned out to be the path for (attacker 😉 ) interaction:
Some remarks they did made me feel that Tesla was not always super-cooperative during their research (and, afai understood they did not receive the pay-out which would have been appropriate for their findings, but I might recall that part incorrectly…). David & Vincent concluded their – excellent– talk with an important reminder of the value of persistence in the life of a security researcher:
This was one of the Hexacon talks I was most looking forward to as I closely worked with Felix for many years, and I know that he has vast skills both exploiting stuff and explaining how he did it ;-). While in enterprise settings the SAML Identity Providers (IdPs) can be considered trusted from the Service Providers (SPs), this picture completely changes in cloud environments where the cloud provider has to interact with many potentially untrustworthy IdPs. When analyzing the attack surface in the space of XML signatures (which are used by SAML) Felix identified several vulnerabilities (CVE-2022-34716 External Entity Injection during XML signature verification, CVE-2022-29824 heap-buffer-overflow in xmlBufAdd, CVE-2022-34169 Integer Truncation in XSLTC). To quote from the Google P0 blog the latter “would allow for arbitrary code execution in software using Xalan-J for processing untrusted XSLT stylesheets. As Xalan-J is used for performing XSLT transformations during XML signature verification in OpenJDK, this bug potentially affects a large number of Java based SAML implementations”. These are his conclusions:
Great talk in which I learned a lot about cloud trust models & modern attack surfaces based on old complex standards. (it seems IPv6 is not an exception here 😉 ) Slides can be found here.
Slides of talks which I did not discuss:
Hara-Kirin: Dissecting the Privileged Components of Huawei Mobile Devices – slides here
A journey of fuzzing Nvidia graphic driver leading to LPE exploitation – slides here
Toner Deaf – Printing your next persistence – slides here
What a lovely week! An in-person RIPE meeting – Jan Žorž said to me over dinner “it immediately felt like home”, and I totally agree. Following some tradition I will summarize a few interesting, IPv6-related talks & other observations from last week in this post.
Constanze Bürger: Challenges and Chances of IPv6 Deployment in Public Authorities in Germany
Constanze serves a state secretary (‘Staatssekretärin’) in the German Federal Ministry of the Interior and Community. She has been driving IPv6 in the public administration space for a long time, and for that reason she’s been present at pretty much all RIPE meetings over the last years. In her talk she spoke about the challenges of getting IPv6 traction in her world, due to the distributed nature of responsibilities and to the high degree of siloization (sounds familiar to some of you large-enterprise folks? ;-). She included a very nice – positive – case study though: the German online tax system called ELSTER, which has IPv6 enabled since 2020 (which seems not to be the case for similar systems in other countries). In October 2021 52% of the connections to it happened over IPv6 (Antonios Atlasis suggested those filing over v6 should get a tax discount, which given the current prices of IPv4 addresses could be worth a discussion ;-), and I could imagine that number is even higher in the interim.
Carsten Strotmann: Frag-DNS. IP Fragmentation and Measures Against DNS-Cache-Poisoning
IP fragmentation attacks against DNS have been known for a while (research overview on the APNIC blog here, paper by Shulman et al. on DNS over TCP from 2021 here), but their Internet-scale impact was unclear, and members of the DNS operator community considered them theoretical (see discussion at RIPE78). This is why the German BSI decided to commission a study evaluating both the real-life impact and discussing mitigations. The results of this study were presented in this talk:
Wilhelm Boeddinghaus: IPv6 and the Windows 10 Firewall
In this talk Wilhelm spoke about the intricacies of the default rule set of the integrated firewall of Windows 10 when it comes to IPv6, namely in the space of ICMPv6. While I don’t share his perspective that these rules are overly risky (and I think for such types of security controls, very understandably, usability often wins over strictness, which in turn might even increase overall risk reduction as users do not disable the whole thing then), it was an interesting technical exercise nevertheless.
Paolo Volpato: IPv6 Deployment Status. Update and Remaining Challenges
This was similar to Paolo’s recent talk in the IETF v6ops working group, hence I refer to my comments in this blog post. To note that in the subsequent Q&A some challenging questions were asked, which were not directly related to the talk.
Justin Iurman: Just Another Measurement of Extension Header Survivability (JAMES)
This was also presented at IETF 113 which is why I, again, point to this post. Tl;dr: IPv6 extension headers can be considered unusable for any Internet-level service.
Matthias Scheer (AVM): IPv6 Addressing Inside a VPN Tunnel Between Endpoints With Rotating Prefixes
The talk itself might not win the title of the most entertaining or most exciting technical presentation of the week ;-), but given the strong presence of AVM in the German market many practitioners incl. myself heavily welcomed the fact that the vendor sought interaction with IPv6 folks at a RIPE meeting. I mean this is not least what those meetings are for, and it’s a great move by AVM to work on their IPv6 capabilities based on feedback from the IPv6 community (at least the part represented in RIPE circles).
Several quick things to note as for the meeting network and IPv6:
During the whole week I was connected with my iPhone with the v6-only/NAT64 network, and everything worked smoothly.
In the terminal room there was a networked printer and connecting (thanks mDNS) to the printer over IPv6 and more importantly printing (over IPv6, ofc) worked like a charm.
Here’s a router advertisement from the main conference network. I know that as an IPv6 person one should generally be very careful with mentioning the principle of least astonishment (POLA) 😉, but I’m not fully sure I can follow the client provisioning approach taken here.
Finally let me mention that one could take the RIPE NCC IPv6 certifications for free at the venue (which I did for the IPv6 Security Expert, and I luckily passed 😅). Offering these on-site at the meetings is an excellent idea imho (those who ever tried to perform them on-line might have an idea why I state this).
Overall it was a great week with lots of technical learnings and, more importantly, lots of good hallway-track encounters. Hope to see some of you folks at Belgrad in October!
Last week I attended the IETF 113 meeting in Vienna. I primarily went there to reconnect in person with some old IPv6 fellows, but also to see what’s going on in the IPv6 standardization space which I hadn’t been following closely in recent times. In this post I’ll shortly summarize some contributions presented in the main v6-related working groups (wg), that are v6ops and 6man.
v6ops
Video recording of the full session here. Individual comments here.
IPv6 Deployment Status Current draft here. Slides from wg session here. This is the abstract of the draft:
I for one am not sure if this draft/effort is really needed in 2022. There are many reasons why the global IPv6 deployment is not happening at the speed/scale that IPv6 proponents have been hoping for, and those reasons might be very diverse in nature on the one hand, and might not need another discussion/documentation on the other hand.
NAT64/DNS64 detection via SRV Records Current draft here. Slides from wg session here.
NAT64 currently gains ground & is actively discussed in many environments, but a number of operational aspects like the placement of the NAT64 function within the network, or which prefixes to use have to be considered. This is why I think this is an important draft. Martin presented existing methods (for one specific aspect ;-), why those might be insufficient, the goals of their suggested approach, and the layout of the planned SRV records. Also proof-of-concept code is now available.
Scalability of IPv6 Transition Technologies for IPv4aaS Current draft here. Slides from wg session here.
Again I think this is a relevant effort as, evidently, scalability considerations play a huge role once a transition technology gets deployed, but there’s not much existing work available (not in the methodology space, and not when it comes to real-life metrics & measurements). Their work/this draft might hence provide
some indication re: what’s realistic.
types of measurements to request from vendors.
Neighbor Discovery Protocol Deployment Guidelines Current draft here. Slides from wg session here.
Here I’m not certain if the IPv6 world needs this type of guidance/documentation, as the respective issues have already been extensively discussed for the last ten years, and several architectural or implementation-level approaches how to deal with ND (security) shortcomings have been developed (e.g. see ‘client isolation’ section in this post).
Requirements to Multi-domain IPv6-only Network Current draft here. Slides from wg session here.
This draft discusses some scenarios in multi-operator settings using v6-only, which I hadn’t thought about earlier. Interesting work to be followed.
Just Another Measurement of Extension header Survivability (JAMES) Current draft here. Slides from the wg session here.
Éric Vyncke supervises these measurements performed by Raphaël Léas and Justin Iurman from the University of Liège. This was also presented at the IEPG meeting covered by Geoff Huston in this blogpost. Important effort in general, and I always welcome IPv6 research work performed together with academia. Some may say the results are not too surprising 😉 – here’s a tweet with some data, and Geoff commented as follows:
6man
Video recording of the full session here. Individual comments here.
IPv6 Hop-by-Hop Options Processing Procedures Current draft here. Slides from wg session here.
Taking the results from the last presentation in v6ops into account (see above), there might be a bit of irony here, but I found especially the discussion after the presentation quite enlightening.
Source Address Selection for foreign ULAs Slides from wg session here.
In this one Ted Lemon spoke about an interesting scenario in a home network with multiple routers and multiple ULA prefixes, where certain destination hosts are not reachable from specific (source) hosts, due to a combination of factors (routers themselves ignoring RAs and hence not learning prefixes originated from other routers’ RAs & the way how source address selection works as of RFC 6724). This talk triggered a long & interesting discussion. Some people stated that a misconfiguration must be present in the scenario (I don’t think there is, and I know a bit about the background of the talk/scenario), others stated that the C[P]E router ‘violated RFCs’ (namely RFC 7084Basic Requirements for IPv6 Customer Edge Routers) which I think is a ridiculous stance. Still overall very good discussion which was helpful for identifying approaches how to deal with such situations.
I hope to be able to meet some of you, dear readers, at the upcoming RIPE meeting in Berlin. I even consider reviving the tradition of an ‘IPv6 Practitioners Dinner’ – let me know if you want to join.
Recently RFC 9099 Operational Security Considerations for IPv6 Networks was published. It was authored by Éric Vyncke, Kiran Kumar ‘KK’ Chittimaneni, Merike Kaeo und myself, and we plan to write a little series on its objectives & main recommendations on the APNIC Blog. To prepare for that let me provide a short overview of it in this post.
RFC 9099 was a long time in the making (nearly nine years! between the first Internet-Draft in the OPSEC working group and the final publication). As you’ll see in a second it covers many IPv6 areas which by themselves are in the centre of nearly religious debates (like filtering of extension headers, or ULAs + other addressing topics). Hence quite a bit lengthy e-mail threads on the WG’s mailing list were created, which made reaching consensus not necessarily easier. Also at some point IETF procedures – this sounds better than ‘politics’, doesn’t it? 😉 – kicked in which led to additional delays (for those interested in this dimension of work within the IETF see Geoff Huston’s lucid Opinion: The making of an RFC in today’s IETF).
The document is focused on what we call ‘managed environments’ like service provider/operator networks or enterprise environments, and it is organized in several sections:
Addressing: evidently the addressing architecture chosen for a specific IPv6 deployment can have significant impact on a network’s security posture (when it comes to routing, traffic filtering or logging), so the various types of IPv6 addresses and their security implications are presented in detail in this section.
Extension headers: as those constitute one of the main technical differences between IPv4 and IPv6, and at the same time they have interesting (one could even write: ‘challenging’) security properties, they’re discussed in a dedicated section.
Link-layer security: examining the local communication mechanisms of IPv6 both from an offensive and from a defense point of view makes the main content of this section. Here all the stuff like NDP attacks, rogue router advertisements, and their related protection mechanisms are described. Again, this is an area where major differences between IPv4 and IPv6 exist.
Control plane security: very important topic from an infrastructure security perspective which is why it has an own section.
Routing security: same as for the previous section – overall very similar security best practices as in IPv4 networks have to be applied for IPv6 in this space as well, e.g. the excellent guidance provided in RFC 7454 BGP Operations and Security.
Logging/monitoring: some elements of the overall IPv6 architecture (like the ephemeral nature of IPv6 addresses, the fact that usually several of them co-exist on a given interface, or their general format) have significant impact on the way how logging and security monitoring are done in many organizations. These are looked at in detail in this segment.
Transition/Coexistence Technologies: from my experience various organizations underestimate the efforts for properly security dual-stack deployments (which btw is another argument for going v6-only where you can). Furthermore the use of tunnel technologies traditionally creates headaches for security practitioners, so they merit respective considerations (at least we thought so. This section was heavily contested during the development of the RFC as people thought that the related security challenges do not stem from IPv6 itself but mostly from operational deficiencies in IPv4 networks, namely those not aware of the concurrent presence of IPv6 in their world).
General device hardening: a security guidance document wouldn’t be complete without this, right? 😉
Enterprise-specific security considerations: deploying IPv6 in enterprise environments needs some additional reflections (see also RFC 7381 Enterprise IPv6 Deployment Guidelines) which is why we cover the security side of things in a dedicated chapter, which in turn is split into two subsections on external and on internal security.
Service provider security considerations: obviously operator networks need proper IPv6 security. While many of the needed security controls are already covered in earlier parts of the RFC some operator-specific aspects like lawful intercept are discussed here.
This post was meant to make you aware of RFC 9099 in case you didn’t know it before, and to provide a quick overview of its content. Additional posts with technical details on its individual areas will be published on the APNIC blog.
At first I wish all readers a very happy new year and all the best for 2022! May the force be with you for your IPv6 efforts ;-).
In this post I’m going to discuss some characteristics of IPv6 in common organization-level (as opposed to home networks) Wi-Fi deployments. These characteristics have to be kept in mind both during design & implementation and in the course of troubleshooting. Many IPv6 practitioners learn(ed) about IPv6 fundamentals in Ethernet networks (quick hint on terminology: in this post the term ‘Ethernet’ always means ‘wired Ethernet’ as of IEEE 802.3 standards, and ‘Wi-Fi’ refers to technologies in the context of IEEE 802.11), and it’s probably a safe assumption that the designers of IPv6 (in the 90s) mostly had such networks in mind when core parts of IPv6 and its communication behavior on the local-link where specified. While IPv6 neighbor discovery (NDP) as of RFC 4861 strictly speaking supports many different link types (section 3.2), the protocol overview in section 3.3 heavily relies on multicast transport (which doesn’t make sense on certain link types). This is aligned with a mental model of IPv6 behavior that quite a few of us (practitioners) have, and which is based, among others, on the following assumptions:
(1) on the local link there are usually (at least) some neighbors, and if so, then interaction with them is possible by certain mechanisms like NS/NA messages.
(2) multicast is a somewhat reliable mechanism (otherwise NDP would be unreliable), and it has at least similar performance properties as broadcast (otherwise NDP would be slower than ARP in IPv4 which certainly wouldn’t have been an acceptable objective ;-).
(3) sniffing ICMPv6 messages (which encompasses all NDP packets incl. router advertisements) will provide an initial understanding of the local environment.
As we will see in the following, very often Enterprise-level Wi-Fi networks are implemented in a way that renders quite some of these assumptions debatable. Again, it should be noted that the resulting differences do not apply to IPv6 in home networks which hence can be expected to work in a way that aligns with the above assumptions. The mentioned differences mainly stem from two sources which is why it can be helpful to understand those first.
Assumptions & Security Properties
Wi-Fi networks are often treated slightly different from a security perspective, based on certain assumptions incl. (but not limited to 😉 ) the following:
They are considered more hostile environments than ‘the trusted corporate LAN’ (based on thinking along the lines of “heard of those guys getting into our Wi-Fi network from the parking lot, via that compromised PSK?”). So more scrutiny is put onto basic network security measures (like just dropping certain packets, see 3rd point).
their traffic is expected to be primarily ‘eyeball traffic’ flowing from clients to servers either in the Internet or in the organization’s data centers, hence no need to communicate with other systems within the same Wi-Fi network/VLAN (as opposed to the Ethernet VLANs where a user/system might still need to reach that lab system under the desk, that printer over there, or the web interface of that building management system which is placed in the same VLAN, for ‘historical reasons’). In enterprise-grade Wi-Fi networks subsequently very often mechanisms to isolate clients from each other can be found (discussed in more detail below).
infrastructure systems like routers or DHCPv6 servers are expected to never reside in the Wi-Fi which is why packets supposed to originate from such systems (IPv6 router advertisements or DHCPv6 Advertise, Reply or Reconfigure messages) can be & actually get dropped by default. Please note that the presence of devices implementing Thread networking (like the HomePod mini) puts this approach into question, but that’s another discussion, and the respective filters might not even be (easily) configurable.
Now let’s look at some technologies in more detail, together with their impact on troubleshooting.
Client Isolation
This is a feature that blocks ‘direct’ connections between clients associated to the same WLC or the same AP. The actual technologies are vendor-specific (‘Peer-to-Peer Blocking’ in Cisco land or ‘Deny Inter-User Traffic’ in Aruba land) but the impact can essentially be broken down to: wireless clients can’t ‘see’/reach other by means of unicast traffic nor by certain multicast traffic (which usually includes IPv6 NDP but *not* mDNS/LLMNR, so the latter commonly pass the boundary). It should further be noted that this feature is implemented on the WLC/AP level, so attackers might still be able to send packets directly to individual stations. Impact on behavior, in particular in the context of troubleshooting:
the actual implementations of different vendors might vary, so one should be extra careful with conclusions. This applies to both handling of specific multicast traffic and to traffic to/from the Ethernet side of things (commonly at least some of this is passed — think: physical router sends RAs to ff02::1 — but other stuff might be dropped, e.g. neighbor solicitations to SNMA of individual Wi-Fi clients. Some devices allow configuring some properties, e.g. look for ‘Forward-Upstream’).
keep this feature in mind when troubleshooting connection issues with colleagues (‘can you ping my MacBook?’ might not work as expected ;-).
Performance- or Security-oriented Optimizations of NDP Traffic
A number of mechanisms/configuration tweaks exist in the context of NDP (router advertisements and NS/NA packets). The most known ones are the following (the terminology is a bit Cisco-oriented, based on stuff we used to do at Troopers, but these features can be found, under one name or another, in most Enterprise-level Wi-Fi solutions):
RA Throttling: WLC/AP rate limits forwarding of RAs to Wi-Fi, based on certain thresholds & related timers. From an operator perspective one has to make sure that the Router Lifetime in the RAs exceeds the timers used here (see also section 4 of RFC 7772. Andrew Yourtchenko, one of its authors, used to use 9000s in one of his networks, see this post). Some years ago the default Router Lifetime on Junos was 180s which could to lead to issues in networks using RA Throttling (wireless clients losing their default route as they did not receive a new RA before the default route generated from last received RA timed out).
Unicast RAs: router advertisements sent in response to a RS are only sent to unicast address of requesting node (instead of sending them to the all-nodes multicast address/ff02::1. RFC 4861 states [in section 6.2.6] that a router ‘MAY’ do this, so it’s a valid, and commonly used, approach).
‘NDP proxy’: when using this feature the WLC responds to NS packets from the Ethernet side by sending NAs ‘on behalf’ of Wi-Fi stations. At this point it can also convert (for unknown MAC addresses) the multicast NS into a unicast packet sent to the MAC address of the wireless client, and some implementations have a dedicated mode for DAD. See also RFC 8929 for a technical description of an ‘ND proxy’.
RA Guard (I tested this some years ago with surprisingly solid results).
IP Source Guard: this is a security feature that checks MAC address-to-IP(v6) address bindings. From an operations perspective one may keep in mind that there’s a threshold of IPv6 addresses which can be associated with one MAC address (iirc, on Cisco devices it’s eight [8]), and subsequently apparent violations might occur once clients regularly generate privacy addresses after coming back from sleep mode or similar. While I’ve never seen this irl I’m not sure which risk is supposed to be mitigated by the feature anyway (connectionless spoofing of a station’s IP address by another? for which attack vector? who would ever do this?).
Impact on behavior, in particular in the context of troubleshooting:
these features are vendor-specific. Their default settings, configuration approaches, and working modes might vary, even between devices from the same vendor (e.g. see this thread).
expected behavior re: link-local traffic might differ from observed behavior (certain NDP messages not seen on Wi-Fi due to controller interaction, RAs seemingly missing due to RA throttling etc.)
‘Mobility’ / Layer 2 Will Never Be the Same
In order to allow stations to physically move/to roam between areas covered by different APs, all modern controller-based Wi-Fi solutions implement techniques that span kind-of virtual Layer-2 domains across multiple APs or even across multiple controllers. Furthermore traffic can be tunneled between controllers over Ethernet over IP (EoIP) — this is often, but not only used for Wi-Fi guest networks — which then includes so-called anchor controllers providing a break-out point of the traffic towards certain parts of the corporate network or to the Internet. The main thing to keep in mind here is that a neighbor (in IPv6 terms) can actually be a system separated from a vantage point by many Layer-2 and Layer-3 devices/hops (this is the same in VXLAN environments, but from my experience in Wi-Fi space diagnosing errors might be more difficult due to lack of proper tooling or even proper CLI access/commands).
Impact on behavior, in particular in the context of troubleshooting:
from an operator perspective one should note that any type of tunneling can have an impact on the MTU, an area where IPv6 traditionally does not have a great reputation, so to say (e.g. see this Cisco bug).
while troubleshooting carefully consider the impact of (all of) the above technologies. For example imagine you don’t see an NA after a station has sent an NS for the default router’s IPv6 address. The actual traffic flow could easily be the following: for the NS packet: station sends multicast packet -> AP -> (over a tunnel protocol to) WLC -> Ethernet -> SVI for the corresponding NA: unicast packet all the way back, but potentially slightly different path (AP).
tl;dr: In W-Fi networks usually a number of techniques can be found which interact with IPv6 in various ways. As a network designer you should be familiar with those technologies. When doing troubleshooting in such networks it might be helpful to keep them in mind, too ;-).
Thank you for reading so far. I’m always happy to receive feedback, either here on the blog or via Twitter. Happy IPv6 networking to you all!
In large environments security controls based on packet filtering, such as firewalls and ACLs on network devices, often face an unfortunate dilemma: there’s a gap between the parties understanding the communication needs of an application (say: the application owners) and the parties implementing the actual security enforcement (e.g. the firewall ops team). Those also have different motivations: “it has to work” (see RFC 1925 rule 1 😉 for the former group versus “it has to be secure = fulfill certain security objectives” for the latter. This gap can manifest in many socio-technical ways, which is the reason why ‘firewall rule management’ has been subject of many discussions over recent years. In another post which I wrote a few years ago I stated that going for the upper-right quadrant in the following diagram usually requires high operational effort (which can actually produce the opposite outcome due to added process complexity), a high level of automation, accepting trade-offs, or a combination of these.
That’s why several organizations are considering another approach or have already started deploying it. Here I’ll call it ‘self-service ACLs’, and it can be summarized as follows:
move (the enforcement function) of packet filters towards the hosts (e.g. via ip[6]tables running locally or some set running on a network device ‘just in front’ of a group of hosts, e.g. a VPC).
provide a nice web-based management interface to these rules
store all rules in a centralized database
allow application teams themselves to manage the rules. Besides the technical decentralization (or to put into more familiar lingo from the networking space: ‘disaggregation’), this one constitutes the main paradigm shift.
The underlying idea is simple: “let the owners of an asset/a service handle what they need, in a flexible manner”, without all those organizational or process-induced gaps, and it’s seems like a good idea to solve the issues I laid out above.
Alas – as so often with simple ideas seemingly solving complex problems – there are some often over-looked pitfalls. These are going to be the subject of this post. Quick disclaimer re: terminology: I’ll use the terms ‘rules’, ‘firewall rules’, and ‘ACLs’ interchangeably. Just think of rules being part of larger rule set, implementing packet filtering based on a traditional approach of sources, destinations and services with the former represented by IP addresses/ranges and the latter by protocols or ports in some notation.
Let’s start with looking at the lifecycle/dimensions of such a rule:
(1) a ‘management’ step/function, like the creation of the rule (by some party) or the modification of the rule (by some party)
(2) the actual enforcement function of the rule
(3) logging (of certain enforcement events, e.g. dropping a packet)
(4) analysis of a rule (e.g. as an intellectual exercise performed in certain life situations 😉 or as a contributing element to metrics)
(5) troubleshooting network communication flows which often involves the functions (3) and (4).
review
In the ‘traditional model’ most of these were performed by the same party (‘firewall ops team’), but here the self-service model induces significant changes. The expected benefit is centered around moving (1) (into the hands of the app owners holding the contextual intelligence) and (2) (topographically towards the assets actually needing the protection). Sadly this also brings changes to the other functions, with some interesting effects. Let’s look at two affected functions/lifecycle elements: logging and review.
Logging
In infosec circles there’s an old adage ‘each security layer should provide logging’. Let’s assume the log files are still written to a central place (this is what most organizations do, for a variety of reasons, and I for one think that this makes sense). This can create interesting situations:
The new owners of the rules will, somewhat legitimately, think that they own the logs , too. (“These are our rules, we manage them, and we should be able to see what’s happening”).
How do they then get access to the (centralized) log files?
More importantly: get access in a tenant-proper way? (you don’t want the database team to be able to see the log files of the authentication servers, do you?)
I’ve yet to see an organization which has solved this problem in a way that fulfills the requirements of the different parties. So one might have to accept some trade-off here (e.g. the loss of visibility into log files for one of the involved parties).
Rule Review
A similar conflict of interests arises in the context of rule review. Can one reasonably expect the party whose main interest is essentially ‘it has to work’ to perform a review of rules based on corporate policy, PCI requirements or the like? Again, this would be an inherent dilemma only solvable by a high degree of collaboration (self-service ACLs are often supposed to reduce needed collaboration). On the other hand rule review might a bit of a dysfunctional process in some organizations anyway, as this recent Twitter poll seems to imply 😉
General Paradigm Shift
Finally one should keep in mind that the introduction of self-service ACLs can mean a cultural shift for application teams, from opening tickets (which includes a partial transfer of responsibility) to managing rules (= security controls) in their own hands (which in turn also requires developing security skills & practice). Not all app teams might be happy with that; especially those running core applications with high availability requirements might be a bit risk averse in this context ;-).
tl;dr: While self-service ACLs can address long-existing process-level deficiencies in some organizations, they might well introduce new ones. Understanding the demarcations of the individual functions within a rule’s lifecycle, and the incentives of the different involved parties, will be crucial for a successful deployment of the approach.
I know that some of the readers of this blog are IPv6 cheerleaders in their respective organizations, and as such they might occasionally face questions along the lines of “what’s the state of IPv6 in our company?” or “are we progressing IPv6-wise?” (the latter in particular when dedicated resources are spent on the IPv6 transition on a regular basis, as opposed to those 50-person-days “let’s get ready for IPv6 (by writing some documents)” projects once every few years). In this post I’ll discuss some approaches to make IPv6 progress within an environment visible.
For those interested in IPv6 progress in the global Internet or on a country level, an overview of the main sources for numbers can be found here. When undertaking a similar exercise for an individual organization let’s keep in mind that in general reporting in corporate life has three main aspects:
a question (to be answered, or: which message do we want to convey by means of the reporting effort?)
A main differentiator between other types of measurements commonly found in large enterprises like “number of systems being on the latest OS patch level” and IPv6 reporting is that in our case usually two dimensions are of interest:
(1) to what extent are we ready/prepared for IPv6?
(2) to what extent is IPv6 (traffic) really happening?
On the one hand those are fundamentally different dimensions (or: ‘questions’ to be answered to the audience of the respective reporting efforts), on the other hand there are relationships between them. Still I think it’s important to understand the differences. I know quite a few companies where the question ‘do we support IPv6?’ asked to the ‘network infrastructure folks’ would be answered with a resounding ‘yes, of course, we do!’. However that does not necessarily mean that much real-life application traffic happens over IPv6 in those environments. As long as there are no applications or services actually being IPv6-enabled, that nice positive answer from the network people might not be of much practical value, unfortunately (don’t get me wrong dear networking colleagues, of course I know it all starts with the network…). And enabling IPv6 for applications and services usually depends on certain core infrastructure services (those in the realm of authentication, provisioning, monitoring, or security are common examples; see also the ‘three dimensions of IP addresses’ discussed here). From that perspective displaying the current (IPv6-related) state of those (as I call them) dependencies might be more important – at least for a certain audience – than just coming up with numbers of ‘dual-stacked hosts’ (which might not use IPv6 at all as there’s nobody reachable to talk to over IPv6 ;-).
Let’s look at some typical parameters of the two above categories. In the space of (1) the following ones come to mind:
State of important dependencies, maybe even in a non-numeric way (like, per dependency: ‘no ongoing v6 efforts’, ‘has v6 in dev’, ‘has v6 in production’ etc.)
Number of dual-stacked hosts. Assuming that the default preference of OSs is to use IPv6 when possible (of course it’s more complicated considering Happy Eyeballs and Java options/preferences) this gives an idea of ‘which percentage of our systems would use IPv6 if they had the opportunity [read: when they connect to proper IPv6 endpoints]?’.
When many users work from home (like in quite a few organizations nowadays, in October 2021), looking at some stats related to VPN connections might be of interest. (hint: in such times bringing IPv6 to your VPN should be a high priority effort anyway if you want to drive IPv6 across your organization).
number of IPv6-enabled VIPs on your load balancers in case those are infrastructure elements being used in your organization. This is a particularly interesting one as one can assume that such a VIP is only created when other infrastructure requirements are already met (read: when an application team is serious about supporting IPv6 connectivity to their respective application, and some dependencies as for authentication, monitoring or logging are already solved).
subnets which have IPv6. Again, this one does not tell anything about actual IPv6 traffic, but about the ‘level of preparedness’ (which is the overall purpose of this category).
number of AAAA records in DNS. Depending on the point of time (within an application’s journey towards IPv6) when those are created this might be an interesting indicator. For example during LinkedIn’s IPv6 initiative they deliberately did this rather late during the infrastructure services transition (slides and video of their related talk at the UK IPv6 Council can be found here), so the mere existence of an AAAA record meant serious (IPv6) business.
For some of the above values dashboards with ongoing data might be proper (display) instruments, but that would the depend on the ability of collecting the respective numbers in an automated manner, and on the audience the reporting is intended for (not everybody likes clicking on a link to a dashboard and to interpret numbers on their own, but some recipients might prefer to get a quarterly e-mail with some core numbers which can be digested in 10 minutes, on an iPad).
Typical category (2) parameters, to show that IPv6 is actually happening (as in: end-to-end IPv6 connections take place over the network), include the following:
all sorts of traffic statistics (e.g. from NetFlow) which usually show the ratio of IPv6 traffic to overall traffic. Here one should keep in mind that one single but traffic-heavy application getting IPv6 might create the idea that significant progress happened which might not simultaneously be the case in the space of infrastructure dependencies, which is exactly why both main categories (and being transparent with regard to their differences) are important.
similar stats for connections to major applications or infrastructure services (e.g. authentication servers). Bonus: measure associated RTT or latency from vantage points or sth similar, as a follow-up question from the audience might be: “ok, and now tell us how application performance compares between IPv6 and IPv4”.
number of IPv6-only hosts. Assuming that those hosts establish or receive whatever type of network connections (and why else would they exist in the first place ;-), this one can really provide insight into IPv6 progress within an environment. To avoid misinterpretations in rapidly growing environments, ideally put this in relation to overall number of hosts, in case you have that number at hand (see the poll at the beginning of this post on the APNIC blog why I mention this ;-). Also please note that I use the term ‘host’ here in a loose manner which encompasses containers and other types of ephemeral entities having an IP address.
I hope the above provides some inspiration for those dealing with the task of making IPv6 progress within an environment visible. Always happy to receive feedback of any type either here or on Twitter. Thank you! for reading so far, and good luck with your IPv6 efforts.
Reflecting on IP addresses, and about factors contributing to having a proper inventory of active ones, recently led me to putting up a Twitterpoll. Here are the results:
Looking at these numbers it seems that quite a few organizations struggle with maintaining a more or less accurate inventory of active addresses in their networks. At this point infosec purists may stress the importance of a thing called asset management (I mean there’s a reason why it’s the 1st function in the 1st category of the NIST Cybersecurity Framework, right? ;-), but I for one just felt I should reflect a bit more on the role of IP addresses for certain processes within an organization. Not least as related aspects and questions may become even more important once a whole new class of addresses enters the corporate infosec ring: enter IPv6. Let’s hence imagine there’s a certain number of systems in your organization’s network, and some of those systems are now getting IPv6. Which security processes could potentially be affected (read, in a sufficiently large organization: which teams should know about the ongoing IPv6 effort? ;-).
From a simplified perspective security processes can be grouped into the following categories (those interested in other categories find some here):
preventative ones (‘avoid that a threat can materialize against an asset’). For our purpose here let’s take filtering of network traffic and patch management as examples.
detective ones: these can include the detection of deviations of a desired security state like the detection of vulnerabilities (=> vulnerability management) or the detection of security violations (e.g. by system-local mechanisms like log files or agents, or by means of network [security] telemetry).
reactive security processes, e.g. incident response.
Traffic filtering is usually mainly done in one of these two flavors:
gateway-based filtering (firewalls or router ACLs). Once this is used it may be of (security) interest if there’s at least one active IP/system of a specific address family (here: IPv6) in use within a given subnet.
local packet filtering. Once the respective rules are centrally managed those in charge better know about active IP speakers, I’d say. Once those are not centrally managed (which is the case in many environments), first+foremost one might ask: who manages them at all? 🤔😂 Kidding aside, evidently for those entities responsible for the configuration of local packet filters knowledge of active addresses of all address families is probably valuable.
To properly perform patch management one usually has to know: which systems are out there? And how does one identify those systems? For example, in environments using MS Active Directory probably all domain members can be identified by some AD-inherent logic, and other systems might be identifiable by means of their FQDNs. Still it’s a safe assumption that at least some fraction of to-be-patched systems are identified by their IP address(es), so having a proper inventory can be of help here, too. Evidently in the course of patch management one might not only be interested in actively running systems, but also & namely in their OS + software components and their respective versions. Coming up with that type of information commonly falls into the realm of vulnerability management. The vast majority of vulnerability management tools & frameworks I’m aware of primarily use addresses as identifiers of systems. From that angle an accurate inventory of addresses certainly helps. The advent of IPv6 might bring some extra challenges here, as simply scanning IP subnets for active systems (in order to subsequently subject those to vuln scanning) doesn’t work any longer with IPv6, so the need of a proper source of truth becomes more crucial (I already discussed this in the IPv6 Security Best Practices post).
Finally in the context of incident response generally situational awareness (what’s happening, which assets are affected, what’s the source of certain things going on etc.) is needed. I could imagine that the ability to map IP addresses to systems can be helpful here (for identification, evaluation, follow-up), so a proper IP address inventory might be of value, so to say.
tl:dr: having a understanding of active IP addresses within an organization affects (at least) the following security controls/processes:
(potentially) traffic filtering
(potentially) patch management
(definitely) vulnerability management
(definitely) detection of security violations
(definitely) incident response
So when deploying IPv6 in an environment, talking to the owners of these processes is needed, in order to make sure that IPv6 does not lead to an increased risk exposure.