Reflections on certificates, Part 1

I’ve written a couple of posts on (X.509v3) certificates in the past, starting with this one in 2001. In the two decades since then a number of developments have taken place (to name a few: OCSP, ACME, Let’s Encrypt certificates and the general role of automation). On the other hand the fundamental mechanisms of certificates have stayed the same. In this post I argue that understanding the inherent (but often hidden) complexity, the trust relationships and the trade-offs of certificate use in a given environment can lead to better decision making and to more efficient operations.

The basic scheme (for the purposes of this post) usually involves a set of parties:

  • (1) A server (in the sense of an entity receiving a connection request, incl. network devices)
  • (2) A client (an entity that initiates a connection)
  • (3) A user who uses the client, and we can safely assume this is a human, so motivations & desires come into play (which can influence trust decisions)
  • (4) An operator being in charge of (1), or of (2), or of both. Here again we assume humans, so they have objectives (in particular “make the users happy by providing a service which is available, and which they can use with their present skill set”)
  • (5) CAs who issue certificates to be used on (1), or on (2), or on both. Evidently this involves (potentially complicated) relationships with the operators
  • (6) developers
  • (7) infosec people


Let’s start with some high-level concepts (yep, regular readers remember my love for those ;-).

Complexity

Working with certificates frequently induces a high level of complexity (definition of the term here), for a number of reasons:

  • multiple standards bodies have contributed to specifying what we have today, one of them (ITU) being notorious for complex outcomes. The main IETF document, that is RFC 5280, has 151 pages.
  • using certificates often involves other, not necessarily simple, things like ASN.1 or DER.
  • most importantly there are all types of extensions which can be employed for nearly unlimited creative uses ;-). See this part of the table of contents of RFC 5280

Unfortunately one of the objectives of the ‘traditional’ certificate use case (that is: securely buying stuff in the Internet) was to hide this complexity from the users. At the same time certificates being capabilities (see below) – which get deployed once and, seemingly, don’t have to be ‘operationally taken care of’ at least for a while – causes them & their complexity being underestimated (& being ‘invisible until something breaks’) in fast-moving environments.
Realizing that certificates are complex beasts, and especially so when employed for certain use cases (=> below), might be the first step of getting better at handling them ;-).

Trust

By their very core value proposition trust (some definition of the term here) plays a huge role when certificates come into play. They’re exactly meant to contribute to trust between communication partners (by assuring the identity of one or multiple of them). In the classic use case this works as follows:

  • I can trust that this web site I’m visiting belongs to the organization holding the domain name I typed into my browser, because I see that little lock in the URL.
  • Behind the scenes this trust is established as another party (the CA) assured the binding of some cryptographic material to some identity information, based on some more or less rigorous checks. I might not know this other party but my browser does, and the mere existence in my browser’s (or OS’s) certificate store expresses this trust.

Alas, matters involving trust can be way more complex in today’s world. Imagine you operate an application which runs on several systems and at some point connects to a system operated by a 3rd party (called $ORG in the following), e.g. for querying a database. As smart & security-conscious people are involved, certificates are used everywhere incl. that one external system. When asked about the dimension of trust as for the certificate over there (in the following: $CERT) one might be tempted to respond “well, that one enables us to trust we’re connecting to the right system (and infosec told us to ubiquitously use certificates anyway)”.
However, in reality

  • you now inherently trust that $ORG to has done a reasonable job when getting an appropriate certificate for the purpose.
  • you trust the respective CA to have done a proper job vetting $ORG (and to have issued an appropriate certificate for the purpose).
  • you now inherently trust that $ORG knows or monitors the expiry date of $CERT (and, evidently/subsequently, that related alerting capabilities are in place).
  • you inherently trust that some sufficiently qualified personnel will be available latest on the day when $CERT expires.
  • overall you inherently trust $ORG’s operational maturity to properly handle certificates ;-).

Looking closer you may also find out that $CERT is a wildcard cert covering the full domain of $ORG, so the initial assumption of trust (‘make sure we connect to the right system’) might be… debatable.
In short understanding the (hidden) trust relationships in an environment can generally be beneficial for prioritizing operational resources. Which brings me directly to the next point.

Trade-offs

The world of certificates is full of trade-offs (as, of course, are all settings with many different parties and their – differing – objectives). Here they are usually clustered around two main themes:

  • performing certificate validation at all ;-). This may sound strange at the 1st glance – I mean, using certificates only makes sense once you validate them, right? – but many of us know situations of the “oops, that expired cert over there breaks our service delivery right now. what about temporarily [ed.: by some definition of temporary 😂] disabling cert validation for the TLS connections between those systems to quickly fix the issue?” type. You may also look at the Wi-Fi authentication use case below.
  • how to determine if a certificate is (still) valid. This can be time-based, or based on checks of the revocation status, or both. Such checks (and the concept of certificate lifetimes/validity periods as a whole) are related to a specific property of certificates (them being capabilities, see next section), and these checks can induce significant operational complexity (e.g. see the post I referenced at the beginning of this one). I will cover certificate revocation & checking in a later part of this series.

Finding the right balance between objectives of different parties, read: going with the right trade-offs, can greatly help to efficiently steer operational resources (in all directions, e.g. increasing cert lifetimes between systems which are all part of the same – your – operational domain can be a good idea when cert expiry is a frequent cause of issues. better yet to increase the level of automation for renewal then ;-).
You may hence spend some intellectual cycles on understanding/questioning the trade-offs in your environment.
As stated above quite some of the trade-offs are commonly related to the most important, yet at times least understood, point of my little theory discourse here, that is:

Certificates are capabilities

Imagine there’s a subject (a user/process) that wants to access an object, e.g. a resource (network, file etc.). The enforcement mechanism controlling the subject’s access to the object can then look

  • at an attribute of the object itself (we could call it sth like ‘access control list’). This attribute/list is then checked every time the subject shows up and asks for access, and it’s usually maintained by the object’s owner.
  • for an entitlement (not to be confused with, but similar to these) which at an earlier point of time was granted to the subject and which generally allows some access. Such a thing is sometimes called a capability, and certificates can be perfect examples of capabilities (strictly speaking & technically the private key corresponding to a cert’s pub key constitutes the actual capability, but let’s keep it simple).

I’m using the above terms a bit loosely here, and there’s a lot of theoretical discussion in OS security circles on these. In any case capabilities have two main challenges:

  • Delegation: how can you make sure that one subject does not transfer the capability to another subject after it has been granted.
  • Revocation: if circumstances change (e.g. a system/key material is compromised or when a user leaves an organization) how can you make sure that the once-granted entitlement can no longer be used.

Both are well-known in certificate circles, and various architectural or technical approaches exist how to deal with them incl.

  • Come up with a flag (‘non-exportable’) for private keys and hope that the OS environment properly enforces it.
  • Store the private key(s) in some extra-secure place. That’s the main reason why smart cards once gained a lot of popularity in some industry sectors (namely heavily regulated ones like banks), and why hardware security modules (HSMs) exist.
  • Implement an additional layer where, at the very moment of a certificate’s use, some extra check of the ‘ok, it is still within its validity period, but has it been revoked?’ type happens. Voilà the birth of certificate revocation checking, and welcome to a whole new space of complexity, trust relationships, and trade-offs (=> detailed discussion in next post).

It should be noted that

  • revocation checks significantly change the trust relationships (“ok, I see the cert that you present to me. It was meant to create trust between you and me, but I’m not convinced. let me reach out to somebody else to verify.”)
  • they kind-of move the needle towards an object-based security model which many people intuitively prefer as it gives them the notion of being in control (also this is better aligned with many compliance frameworks 😉 ).

Let’s now discuss some certificate use cases from the above perspectives. In the following I will look at five of them (the first two in this post, the others in the next):

  • E-commerce web server offering HTTPS
  • Authentication in enterprise Wi-Fi networks
  • Client/user authentication (e.g. for VPN access)
  • Client/user authorization (as in “enrich a certificate with additional information which is then parsed in order to take security decisions like controlling access to a specific resource”)
  • mTLS 

Use case: e-commerce web server with HTTPS

This is probably the most classic use case, and it’s the one the paved the way for widespread use of certificates. When e-commerce became a thing, there were two challenges to be solved from a user’s (buyer’s) perspective:

  • How do I know I’m connected to the right server (assuming that this one only uses my credit card data for the goods I want to purchase)? 
  • How can I be sure that my payment data is not compromised when using the Internet for its transfer?

Both could be addressed by deploying a cert on the web server(s) and enabling HTTPS.

To note:

  • From a trust perspective this is a kind-of easy one. The user has a certain desire (e.g. to buy something, or to watch specific content) which generally highly influences trust decisions (otherwise Ponzi schemes wouldn’t work). The CAs were trusted as there were only a few of them, and their trustworthiness was rarely questioned or verified from the ppl requesting certificates (in the early days the latter sometimes even were part of a company’s marketing team, which usually have a more optimistic approach to life – than those ever-skeptical infosec folks – anyway).
  • From a company’s security objective perspective it was an easy one, too: none of the to-be-protected assets (user’s credit card data) were really of relevance (wrt to protection need) for the owners of the web servers. This only changed when PCI came up.
  • From an operations perspective it wasn’t particularly difficult either: certs had comparably long lifetimes (usually two years), there were only a few of them, and while renewal was known to be somewhat inconvenient it was at least less cumbersome than the initial request.

Use case: authentication protocols used in enterprise Wi-Fi networks

Pretty much all extensible authentication protocols (EAPs, some overview here) used in enterprise Wi-Fi networks employ certificates, some of them only on the side of infrastructure elements (e.g. PEAP), others (EAP-TLS) also for clients. Especially the latter one brings high operational complexity (see for example this old setup guide which my fine buddy Chris Werny authored many yrs ago). With that come both heavily differing objectives of the involved parties and quite interesting failure scenarios.
Let’s analyze some of the involved parties.

  • operators of the RADIUS servers. They might not be super-familiar with certificates, hence installing those may not a daily task for them, so they’d be happy with generally longer cert lifetimes.
  • ‘enterprise desktop team’ – they will strive for auto-enrollment & renewal, and again they will want to keep things simple (“why do they bother us with this certificate stuff, our life is already difficult”). This group/task could be outsourced (=> $CONTRACTOR1).
  • the users just want Wi-Fi to work, they (legitimately) don’t care about the underlying technologies, and they will happily click away any certificate-related warnings “as long as the damn corporate Wi-Fi works”.
  • the infosec people want to prevent the users from doing the latter, and they’d be happy if lifetimes of involved certificates were better shorter than longer. Bonus if they come up with the idea of implementing some additional scheme where “Wi-Fi (security) profiles” are mapped to certain parts of the certificate (did I already mention that certificates have various types of fields which can be overloaded populated with all types of information?)
  •  the operators of the whole Wi-Fi infrastructure want to keep the users happy. Some chance here that operations of (some parts of) the network infrastructure might be outsourced/provided by contractors ($CONTRACTOR2).
  • The CA issuing the involved certificates might be in-house, or not. Common scenario that this is another contracted service ($CONTRACTOR3). Bonus if the wireless infrastructure uses intermediary certificates from another CA ($CONTRACTOR4).

Let’s imagine at some point one of the following two things happens

  • Something breaks
  • One of the certificates, in particular on the infrastructure level (RADIUS server, AP, wireless controllers) level expires. High chance that the renewal requires human labor & skills and, evidently, requires touches availability-critical network infrastructure. Maybe the certificates in question are not monitored. Overall quite some probability that cert expiry leads to “something breaks”.

How well do you think, will CONTRACTORS{1,4} interact in such a case? – Exactly 😉

It should be noted that most of the above parties do not have a deep familiarity with certificates in their daily life. Those being mostly invisible until sth breaks, doesn’t help either (=> incentives?).
I can also tell you from practical experience (from my days as a network consultant in a US Fortune 10 company 15 years ago) that all of the above parties (except the infosec folks) will happily & immediately sacrifice all cert-related security properties once, say, 50K users might not be able to use the corp Wi-Fi anymore (due to expiring intermediate certs from a vendor with whom $CONTRACTOR2 had ended their contractual relationship). Then the following suggestions might show up on the table:

  • Can’t we just disable certificate validation as a whole, on certain $INFRASTRUCTURE_ELEMENTS?
  • What about publishing guidance in which we tell users to ignore certificate warnings?
  • Any chance of configuring some grace period, say 4–8 weeks, during which we still accept the expired certs? $VENDOR already promised us a custom image which somehow avoids the issue (don’t ask…).

If only some group of experts had reflected on the certificate deployment in that environment, its operational complexities, its inherent trust relationships, and the trade-offs between the different parties & their incentives earlier ;-).
If this post makes you think about these aspects in your own world, I’m a happy man. Thank you! for your time spent reading, and see you in a few weeks for the next part.

Hexacon 2022

I spent the last two days in Paris to attend Hexacon 2022. As usual when I write here about conferences I’ll summarize some talks & observations. I don’t go to many offensive security-only events (it’s well-known that I have thoughts on a certain scene and its [non-] ethics, but on the other hand a periodic reality check of such sentiments shouldn’t hurt either). Hexacon had caught my attention due to the superb speaker line-up, and I could reasonably expect to meet some old friends there.

Having been a conference organizer for a while in my life I can say that I was seriously impressed by what Renaud & team have put together (fair chance that the team did the vast majority of the real work ;-). Very well organized event, excellent talks (not a single rly weak one) and good community spirit. Great job, folks!
That said, let’s have a look at the talks. For time reasons I will only cover some of them, and slides/videos for quite a few have not been published yet (it’s announced though), hence let’s hope that my memory serves me correctly…

Luca Tedesco: Life and Death as an iOS attacker

Luca started with an overview of the fundamental pieces of the iOS security model & of recent advances both in the space of attack vectors and when it comes to protections:


He repeatedly emphasized the value of Lockdown Mode, assuming it might have taken a couple of afternoons to implement 😂. He summarized that ‘Apple is finally winning’ which, according to Shahar Tal, was met with ‘crowd silence‘ (I can confirm this).
Luca then provided some conclusions on the future business of iOS-oriented offensive security research which at the 1st glance can be summarized as follows:

– given the complexity of iOS and its security measures it’s very unlikely any individual can succeed alone. Some players will go out of business.
– to survive (in the business) access to significant amounts of private knowledge is needed as public information is years behind the mitigations.
– exploit-based public jailbreaks (JBs) are most likely over.


So far, so good. However an alternative reading of those statements could be this one:
– groups who have what he calls ‘private knowledge’ will still make the deals $$$.
– JBs won’t be released to the public (I’m not following that scene closely, but I think this has been already the case for a while now). They will instead be sold to the highest bidders.
– ofc the speaker and their company belong to the privileged as of the 1st statement.

To – maybe – support this reading Luca pulled a final trick by ending with a short video (which then again, from my limited perception, has been a common turn of such talks in the last years) which – maybe – showed a working JB against iOS 16.1. There was a comment on Twitter that a photo of that demo ‘misses context for those not in attendance’. I can confirm that it also missed context for some in attendance (incl. myself ;-), but this might (srsly) be attributed to my non-familiarity with the space.
Overall a solid technical talk (I learned quite a bit), together – maybe – with a pitch for services which – maybe – can only be offered by a privileged few.


Anaïs Gantet, Nicolas Devillers, Jean-Romain Garnier: The Unavoidable Pain Of Backups. Security Deep-Dive Into The Internals Of NetBackup

The team from the Airbus Security Lab has been doing very interesting research for many years, together with the release of results and tools. In this talk they discussed their findings from performing an in-depth assessment of NetBackup. NetBackup here being a perfect example of
– a piece of 3rd party software found in many large enterprises.
– which runs with high privileges and/or has access to highly sensitive data.
– is complex in itself, and may use old & complex standards (e.g. in this case CORBA).

I generally think it’s super-important to publicly discuss the results of such assessments (presumably well-funded actors look at enterprise tools, too, albeit without publishing the results…). Similar stuff from another research group can be found here or here.

The speakers started by laying our their methodology & research questions:

They then provided a detailed overview of the inner architecture of NetBackup, its daemons & processes, the ports those run on, and how those interact.

Finally, evidently, their findings were presented, incl. a very nice demo (pay close attention to the names of the phases of the demo, in the top left part of bottom pic):

Overall this was one of my favorite Hexacon talks: relevant research, extremely well-structured presentation, and a cool demo.
Slides can be found here.

Thomas Chauchefoin: You’ve got mail! And I’m root on your Zimbra server

Another talk dissecting (and pwning) a piece of enterprise software, in this case an e-mail & collaboration suite called Zimbra. This one being a perfect example of a commercial product which
– uses many OSS components, loosely coupled together + masked behind web frontends.
– undertakes more or less successful attempts to filter/sanitize input, which is then processed in the chain of those various, loosely coupled, components.
– does not rly use sandboxing of components or stripping-down of privileges.

What could go wrong with such a piece? – Right…
(btw, many parts of this talk reminded me of the days when Felix owned FireEye boxes)
Thomas discussed the inner workings of Zimbra and subsequently several vulnerabilities he found (incl. CVE-2022-27924), accompanied by some proper demos.

Generally there seem to be a lot of Zimbra vulns in 2022 😱, looking the list of their security advisories. This is his summary:

Overall an interesting presentation, and apparently quite timely as active exploitation of Zimbra seems to happen these days.

Ophir Harpaz & Stiv Kupchik: Exploring Ancient Ruins to Find Modern Bugs: Discovering a 0-Day in MS-RPC service

MS-RPC is a juicy target – it runs on every Windows machine, the endpoint mapper service listens on a fixed port (TCP 135), and vulnerabilities might be worm-able (Blaster used it, back in 2003). After Ophir laid out the general architecture and core terminology, Stiv explained how the interaction of authentication and caching of access information can lead to bypass attacks.


He went on detailing the steps needed to find CVE-2022-38034 which was patched on this month’s patch Tuesday (= three days before their Hexacon talk ;-). Overall another excellent technical presentation & very relevant research.
Slides can be found here.

David Berard & Vincent Dehors: I feel a draft. Opening the doors and windows: 0-click RCE on the Tesla Model3

Certainly one of the most anticipated talks of Hexacon, and they did not disappoint. To own the car they focused on the infotainment system (synecdoche used deliberately here as I seem to have missed the part in which they discussed the strong isolation between infotainment and CAN bus which Tesla uses, or not):

It runs Linux with some COTS components for embedded systems like ConnMan which turned out to be the path for (attacker 😉 ) interaction:

Some remarks they did made me feel that Tesla was not always super-cooperative during their research (and, afai understood they did not receive the pay-out which would have been appropriate for their findings, but I might recall that part incorrectly…). David & Vincent concluded their – excellent– talk with an important reminder of the value of persistence in the life of a security researcher:

Slides can be found here.


Felix Wilhelm: Hacking the Cloud with SAML

This was one of the Hexacon talks I was most looking forward to as I closely worked with Felix for many years, and I know that he has vast skills both exploiting stuff and explaining how he did it ;-).
While in enterprise settings the SAML Identity Providers (IdPs) can be considered trusted from the Service Providers (SPs), this picture completely changes in cloud environments where the cloud provider has to interact with many potentially untrustworthy IdPs. When analyzing the attack surface in the space of XML signatures (which are used by SAML) Felix identified several vulnerabilities (CVE-2022-34716 External Entity Injection during XML signature verification, CVE-2022-29824 heap-buffer-overflow in xmlBufAdd, CVE-2022-34169 Integer Truncation in XSLTC). To quote from the Google P0 blog the latter “would allow for arbitrary code execution in software using Xalan-J for processing untrusted XSLT stylesheets. As Xalan-J is used for performing XSLT transformations during XML signature verification in OpenJDK, this bug potentially affects a large number of Java based SAML implementations”.
These are his conclusions:


Great talk in which I learned a lot about cloud trust models & modern attack surfaces based on old complex standards. (it seems IPv6 is not an exception here 😉 )
Slides can be found here.

Slides of talks which I did not discuss:

Hara-Kirin: Dissecting the Privileged Components of Huawei Mobile Devices – slides here

A journey of fuzzing Nvidia graphic driver leading to LPE exploitation – slides here

Toner Deaf – Printing your next persistence – slides here

Attacking Safari in 2022 – slides here

RIPE 84

What a lovely week! An in-person RIPE meeting – Jan Žorž said to me over dinner “it immediately felt like home”, and I totally agree.
Following some tradition I will summarize a few interesting, IPv6-related talks & other observations from last week in this post.

Constanze Bürger: Challenges and Chances of IPv6 Deployment in Public Authorities in Germany

Constanze serves a state secretary (‘Staatssekretärin’) in the German Federal Ministry of the Interior and Community. She has been driving IPv6 in the public administration space for a long time, and for that reason she’s been present at pretty much all RIPE meetings over the last years. In her talk she spoke about the challenges of getting IPv6 traction in her world, due to the distributed nature of responsibilities and to the high degree of siloization (sounds familiar to some of you large-enterprise folks? ;-). She included a very nice – positive – case study though: the German online tax system called ELSTER, which has IPv6 enabled since 2020 (which seems not to be the case for similar systems in other countries).
In October 2021 52% of the connections to it happened over IPv6 (Antonios Atlasis suggested those filing over v6 should get a tax discount, which given the current prices of IPv4 addresses could be worth a discussion ;-), and I could imagine that number is even higher in the interim.

Slides: here
Video: here

Carsten Strotmann: Frag-DNS. IP Fragmentation and Measures Against DNS-Cache-Poisoning

IP fragmentation attacks against DNS have been known for a while (research overview on the APNIC blog here, paper by Shulman et al. on DNS over TCP from 2021 here), but their Internet-scale impact was unclear, and members of the DNS operator community considered them theoretical (see discussion at RIPE78). This is why the German BSI decided to commission a study evaluating both the real-life impact and discussing mitigations. The results of this study were presented in this talk:

Slides: here
Video: here

Wilhelm Boeddinghaus: IPv6 and the Windows 10 Firewall

In this talk Wilhelm spoke about the intricacies of the default rule set of the integrated firewall of Windows 10 when it comes to IPv6, namely in the space of ICMPv6. While I don’t share his perspective that these rules are overly risky (and I think for such types of security controls, very understandably, usability often wins over strictness, which in turn might even increase overall risk reduction as users do not disable the whole thing then), it was an interesting technical exercise nevertheless.

Slides: here
Video: here

Paolo Volpato: IPv6 Deployment Status. Update and Remaining Challenges

This was similar to Paolo’s recent talk in the IETF v6ops working group, hence I refer to my comments in this blog post. To note that in the subsequent Q&A some challenging questions were asked, which were not directly related to the talk.

Slides: here
Video: here

Justin Iurman: Just Another Measurement of Extension Header Survivability (JAMES)

This was also presented at IETF 113 which is why I, again, point to this post.
Tl;dr: IPv6 extension headers can be considered unusable for any Internet-level service.

Slides: here
Video: here

Matthias Scheer (AVM): IPv6 Addressing Inside a VPN Tunnel Between Endpoints With Rotating Prefixes

The talk itself might not win the title of the most entertaining or most exciting technical presentation of the week ;-), but given the strong presence of AVM in the German market many practitioners incl. myself heavily welcomed the fact that the vendor sought interaction with IPv6 folks at a RIPE meeting. I mean this is not least what those meetings are for, and it’s a great move by AVM to work on their IPv6 capabilities based on feedback from the IPv6 community (at least the part represented in RIPE circles).

Slides: here
Video: here

The meeting network & IPv6

Several quick things to note as for the meeting network and IPv6:

  • During the whole week I was connected with my iPhone with the v6-only/NAT64 network, and everything worked smoothly.
  • In the terminal room there was a networked printer and connecting (thanks mDNS) to the printer over IPv6 and more importantly printing (over IPv6, ofc) worked like a charm.
  • Here’s a router advertisement from the main conference network. I know that as an IPv6 person one should generally be very careful with mentioning the principle of least astonishment (POLA) 😉, but I’m not fully sure I can follow the client provisioning approach taken here.

Finally let me mention that one could take the RIPE NCC IPv6 certifications for free at the venue (which I did for the IPv6 Security Expert, and I luckily passed 😅). Offering these on-site at the meetings is an excellent idea imho (those who ever tried to perform them on-line might have an idea why I state this).

Overall it was a great week with lots of technical learnings and, more importantly, lots of good hallway-track encounters. Hope to see some of you folks at Belgrad in October!

IETF 113

Last week I attended the IETF 113 meeting in Vienna. I primarily went there to reconnect in person with some old IPv6 fellows, but also to see what’s going on in the IPv6 standardization space which I hadn’t been following closely in recent times.
In this post I’ll shortly summarize some contributions presented in the main v6-related working groups (wg), that are v6ops and 6man.

v6ops

Video recording of the full session here.
Individual comments here.

IPv6 Deployment Status
Current draft here.
Slides from wg session here.
This is the abstract of the draft:

I for one am not sure if this draft/effort is really needed in 2022. There are many reasons why the global IPv6 deployment is not happening at the speed/scale that IPv6 proponents have been hoping for, and those reasons might be very diverse in nature on the one hand, and might not need another discussion/documentation on the other hand.

NAT64/DNS64 detection via SRV Records
Current draft here.
Slides from wg session here.

NAT64 currently gains ground & is actively discussed in many environments, but a number of operational aspects like the placement of the NAT64 function within the network, or which prefixes to use have to be considered. This is why I think this is an important draft. Martin presented existing methods (for one specific aspect ;-), why those might be insufficient, the goals of their suggested approach, and the layout of the planned SRV records. Also proof-of-concept code is now available.

Scalability of IPv6 Transition Technologies for IPv4aaS
Current draft here.
Slides from wg session here.

Again I think this is a relevant effort as, evidently, scalability considerations play a huge role once a transition technology gets deployed, but there’s not much existing work available (not in the methodology space, and not when it comes to real-life metrics & measurements). Their work/this draft might hence provide

  • some indication re: what’s realistic.
  • types of measurements to request from vendors.

Neighbor Discovery Protocol Deployment Guidelines
Current draft here.
Slides from wg session here.

Here I’m not certain if the IPv6 world needs this type of guidance/documentation, as the respective issues have already been extensively discussed for the last ten years, and several architectural or implementation-level approaches how to deal with ND (security) shortcomings have been developed (e.g. see ‘client isolation’ section in this post).

Requirements to Multi-domain IPv6-only Network
Current draft here.
Slides from wg session here.

This draft discusses some scenarios in multi-operator settings using v6-only, which I hadn’t thought about earlier. Interesting work to be followed.

Just Another Measurement of Extension header Survivability (JAMES)
Current draft here.
Slides from the wg session here.

Éric Vyncke supervises these measurements performed by Raphaël Léas and Justin Iurman from the University of Liège. This was also presented at the IEPG meeting covered by Geoff Huston in this blogpost. Important effort in general, and I always welcome IPv6 research work performed together with academia. Some may say the results are not too surprising 😉 – here’s a tweet with some data, and Geoff commented as follows:

6man

Video recording of the full session here.
Individual comments here.

IPv6 Hop-by-Hop Options Processing Procedures
Current draft here.
Slides from wg session here.

Taking the results from the last presentation in v6ops into account (see above), there might be a bit of irony here, but I found especially the discussion after the presentation quite enlightening.

Source Address Selection for foreign ULAs
Slides from wg session here.

In this one Ted Lemon spoke about an interesting scenario in a home network with multiple routers and multiple ULA prefixes, where certain destination hosts are not reachable from specific (source) hosts, due to a combination of factors (routers themselves ignoring RAs and hence not learning prefixes originated from other routers’ RAs & the way how source address selection works as of RFC 6724). This talk triggered a long & interesting discussion. Some people stated that a misconfiguration must be present in the scenario (I don’t think there is, and I know a bit about the background of the talk/scenario), others stated that the C[P]E router ‘violated RFCs’ (namely RFC 7084 Basic Requirements for IPv6 Customer Edge Routers) which I think is a ridiculous stance. Still overall very good discussion which was helpful for identifying approaches how to deal with such situations.


I hope to be able to meet some of you, dear readers, at the upcoming RIPE meeting in Berlin. I even consider reviving the tradition of an ‘IPv6 Practitioners Dinner’ – let me know if you want to join.

RFC 9099 / Intro & Overview

Recently RFC 9099 Operational Security Considerations for IPv6 Networks was published. It was authored by Éric VynckeKiran Kumar ‘KK’ Chittimaneni, Merike Kaeo und myself, and we plan to write a little series on its objectives & main recommendations on the APNIC Blog. To prepare for that let me provide a short overview of it in this post.

RFC 9099 was a long time in the making (nearly nine years! between the first Internet-Draft in the OPSEC working group and the final publication). As you’ll see in a second it covers many IPv6 areas which by themselves are in the centre of nearly religious debates (like filtering of extension headers, or ULAs + other addressing topics). Hence quite a bit lengthy e-mail threads on the WG’s mailing list were created, which made reaching consensus not necessarily easier. Also at some point IETF procedures – this sounds better than ‘politics’, doesn’t it? 😉 – kicked in which led to additional delays (for those interested in this dimension of work within the IETF see Geoff Huston’s lucid Opinion: The making of an RFC in today’s IETF).

The document is focused on what we call ‘managed environments’ like service provider/operator networks or enterprise environments, and it is organized in several sections:

  • Addressing: evidently the addressing architecture chosen for a specific IPv6 deployment can have significant impact on a network’s security posture (when it comes to routing, traffic filtering or logging), so the various types of IPv6 addresses and their security implications are presented in detail in this section.
  • Extension headers: as those constitute one of the main technical differences between IPv4 and IPv6, and at the same time they have interesting (one could even write: ‘challenging’) security properties, they’re discussed in a dedicated section.
  • Link-layer security: examining the local communication mechanisms of IPv6 both from an offensive and from a defense point of view makes the main content of this section. Here all the stuff like NDP attacks, rogue router advertisements, and their related protection mechanisms are described. Again, this is an area where major differences between IPv4 and IPv6 exist.
  • Control plane security: very important topic from an infrastructure security perspective which is why it has an own section.
  • Routing security: same as for the previous section – overall very similar security best practices as in IPv4 networks have to be applied for IPv6 in this space as well, e.g. the excellent guidance provided in RFC 7454 BGP Operations and Security.
  • Logging/monitoring: some elements of the overall IPv6 architecture (like the ephemeral nature of IPv6 addresses, the fact that usually several of them co-exist on a given interface, or their general format) have significant impact on the way how logging and security monitoring are done in many organizations. These are looked at in detail in this segment.
  • Transition/Coexistence Technologies: from my experience various organizations underestimate the efforts for properly security dual-stack deployments (which btw is another argument for going v6-only where you can). Furthermore the use of tunnel technologies traditionally creates headaches for security practitioners, so they merit respective considerations (at least we thought so. This section was heavily contested during the development of the RFC as people thought that the related security challenges do not stem from IPv6 itself but mostly from operational deficiencies in IPv4 networks, namely those not aware of the concurrent presence of IPv6 in their world).
  • General device hardening: a security guidance document wouldn’t be complete without this, right? 😉
  • Enterprise-specific security considerations: deploying IPv6 in enterprise environments needs some additional reflections (see also RFC 7381 Enterprise IPv6 Deployment Guidelines) which is why we cover the security side of things in a dedicated chapter, which in turn is split into two subsections on external and on internal security.
  • Service provider security considerations: obviously operator networks need proper IPv6 security. While many of the needed security controls are already covered in earlier parts of the RFC some operator-specific aspects like lawful intercept are discussed here.

This post was meant to make you aware of RFC 9099 in case you didn’t know it before, and to provide a quick overview of its content. Additional posts with technical details on its individual areas will be published on the APNIC blog.

Additional references

IPv6 in Enterprise Wi-Fi Networks

At first I wish all readers a very happy new year and all the best for 2022! May the force be with you for your IPv6 efforts ;-).

In this post I’m going to discuss some characteristics of IPv6 in common organization-level (as opposed to home networks) Wi-Fi deployments. These characteristics have to be kept in mind both during design & implementation and in the course of troubleshooting. Many IPv6 practitioners learn(ed) about IPv6 fundamentals in Ethernet networks (quick hint on terminology: in this post the term ‘Ethernet’ always means ‘wired Ethernet’ as of IEEE 802.3 standards, and ‘Wi-Fi’ refers to technologies in the context of IEEE 802.11), and it’s probably a safe assumption that the designers of IPv6 (in the 90s) mostly had such networks in mind when core parts of IPv6 and its communication behavior on the local-link where specified. While IPv6 neighbor discovery (NDP) as of RFC 4861 strictly speaking supports many different link types (section 3.2), the protocol overview in section 3.3 heavily relies on multicast transport (which doesn’t make sense on certain link types). This is aligned with a mental model of IPv6 behavior that quite a few of us (practitioners) have, and which is based, among others, on the following assumptions:

  • (1) on the local link there are usually (at least) some neighbors, and if so, then interaction with them is possible by certain mechanisms like NS/NA messages.
  • (2) multicast is a somewhat reliable mechanism (otherwise NDP would be unreliable), and it has at least similar performance properties as broadcast (otherwise NDP would be slower than ARP in IPv4 which certainly wouldn’t have been an acceptable objective ;-).
  • (3) sniffing ICMPv6 messages (which encompasses all NDP packets incl. router advertisements) will provide an initial understanding of the local environment.

As we will see in the following, very often Enterprise-level Wi-Fi networks are implemented in a way that renders quite some of these assumptions debatable. Again, it should be noted that the resulting differences do not apply to IPv6 in home networks which hence can be expected to work in a way that aligns with the above assumptions.
The mentioned differences mainly stem from two sources which is why it can be helpful to understand those first.

Assumptions & Security Properties

Wi-Fi networks are often treated slightly different from a security perspective, based on certain assumptions incl. (but not limited to 😉 ) the following:

  • They are considered more hostile environments than ‘the trusted corporate LAN’ (based on thinking along the lines of “heard of those guys getting into our Wi-Fi network from the parking lot, via that compromised PSK?”). So more scrutiny is put onto basic network security measures (like just dropping certain packets, see 3rd point).
  • their traffic is expected to be primarily ‘eyeball traffic’ flowing from clients to servers either in the Internet or in the organization’s data centers, hence no need to communicate with other systems within the same Wi-Fi network/VLAN (as opposed to the Ethernet VLANs where a user/system might still need to reach that lab system under the desk, that printer over there, or the web interface of that building management system which is placed in the same VLAN, for ‘historical reasons’). In enterprise-grade Wi-Fi networks subsequently very often mechanisms to isolate clients from each other can be found (discussed in more detail below).
  • infrastructure systems like routers or DHCPv6 servers are expected to never reside in the Wi-Fi which is why packets supposed to originate from such systems (IPv6 router advertisements or DHCPv6 Advertise, Reply or Reconfigure messages) can be & actually get dropped by default. Please note that the presence of devices implementing Thread networking (like the HomePod mini) puts this approach into question, but that’s another discussion, and the respective filters might not even be (easily) configurable.

Handling of Multicast Traffic

For reasons laid out in section 3 of RFC 9119 Multicast Considerations over IEEE 802 Wireless Media the use of multicast transport in Wi-Fi networks brings various challenges which can heavily degrade the network performance and/or the battery life of connected devices (for the latter see also RFC 7772 Reducing Energy Consumption of Router Advertisements, and for the former it can be a good idea to read the excellent ‘Wireless Means Radio‘ article). That’s the reason for a number of related optimizations commonly found in enterprise Wi-Fi networks (or, for that matter, in conference networks, see Chris Werny’s talk on IPv6 in the Troopers Wi-Fi).

Now let’s look at some technologies in more detail, together with their impact on troubleshooting.

Client Isolation

This is a feature that blocks ‘direct’ connections between clients associated to the same WLC or the same AP. The actual technologies are vendor-specific (‘Peer-to-Peer Blocking’ in Cisco land or ‘Deny Inter-User Traffic’ in Aruba land) but the impact can essentially be broken down to: wireless clients can’t ‘see’/reach other by means of unicast traffic nor by certain multicast traffic (which usually includes IPv6 NDP but *not* mDNS/LLMNR, so the latter commonly pass the boundary). It should further be noted that this feature is implemented on the WLC/AP level, so attackers might still be able to send packets directly to individual stations.
Impact on behavior, in particular in the context of troubleshooting:

  • the actual implementations of different vendors might vary, so one should be extra careful with conclusions. This applies to both handling of specific multicast traffic and to traffic to/from the Ethernet side of things (commonly at least some of this is passed — think: physical router sends RAs to ff02::1 — but other stuff might be dropped, e.g. neighbor solicitations to SNMA of individual Wi-Fi clients. Some devices allow configuring some properties, e.g. look for ‘Forward-Upstream’).
  • keep this feature in mind when troubleshooting connection issues with colleagues (‘can you ping my MacBook?’ might not work as expected ;-).

Performance- or Security-oriented Optimizations of NDP Traffic

A number of mechanisms/configuration tweaks exist in the context of NDP (router advertisements and NS/NA packets). The most known ones are the following (the terminology is a bit Cisco-oriented, based on stuff we used to do at Troopers, but these features can be found, under one name or another, in most Enterprise-level Wi-Fi solutions):

  • RA Throttling: WLC/AP rate limits forwarding of RAs to Wi-Fi, based on certain thresholds & related timers. From an operator perspective one has to make sure that the Router Lifetime in the RAs exceeds the timers used here (see also section 4 of RFC 7772. Andrew Yourtchenko, one of its authors, used to use 9000s in one of his networks, see this post). Some years ago the default Router Lifetime on Junos was 180s which could to lead to issues in networks using RA Throttling (wireless clients losing their default route as they did not receive a new RA before the default route generated from last received RA timed out).
  • Unicast RAs: router advertisements sent in response to a RS are only sent to unicast address of requesting node (instead of sending them to the all-nodes multicast address/ff02::1. RFC 4861 states [in section 6.2.6] that a router ‘MAY’ do this, so it’s a valid, and commonly used, approach).
  • ‘NDP proxy’: when using this feature the WLC responds to NS packets from the Ethernet side by sending NAs ‘on behalf’ of Wi-Fi stations. At this point it can also convert (for unknown MAC addresses) the multicast NS into a unicast packet sent to the MAC address of the wireless client, and some implementations have a dedicated mode for DAD. See also RFC 8929 for a technical description of an ‘ND proxy’.
  • RA Guard (I tested this some years ago with surprisingly solid results).
  • IP Source Guard: this is a security feature that checks MAC address-to-IP(v6) address bindings. From an operations perspective one may keep in mind that there’s a threshold of IPv6 addresses which can be associated with one MAC address (iirc, on Cisco devices it’s eight [8]), and subsequently apparent violations might occur once clients regularly generate privacy addresses after coming back from sleep mode or similar. While I’ve never seen this irl I’m not sure which risk is supposed to be mitigated by the feature anyway (connectionless spoofing of a station’s IP address by another? for which attack vector? who would ever do this?).

Impact on behavior, in particular in the context of troubleshooting:

  • these features are vendor-specific. Their default settings, configuration approaches, and working modes might vary, even between devices from the same vendor (e.g. see this thread).
  • expected behavior re: link-local traffic might differ from observed behavior (certain NDP messages not seen on Wi-Fi due to controller interaction, RAs seemingly missing due to RA throttling etc.)

‘Mobility’ / Layer 2 Will Never Be the Same

In order to allow stations to physically move/to roam between areas covered by different APs, all modern controller-based Wi-Fi solutions implement techniques that span kind-of virtual Layer-2 domains across multiple APs or even across multiple controllers. Furthermore traffic can be tunneled between controllers over Ethernet over IP (EoIP) — this is often, but not only used for Wi-Fi guest networks — which then includes so-called anchor controllers providing a break-out point of the traffic towards certain parts of the corporate network or to the Internet. The main thing to keep in mind here is that a neighbor (in IPv6 terms) can actually be a system separated from a vantage point by many Layer-2 and Layer-3 devices/hops (this is the same in VXLAN environments, but from my experience in Wi-Fi space diagnosing errors might be more difficult due to lack of proper tooling or even proper CLI access/commands).

Impact on behavior, in particular in the context of troubleshooting:

  • from an operator perspective one should note that any type of tunneling can have an impact on the MTU, an area where IPv6 traditionally does not have a great reputation, so to say (e.g. see this Cisco bug).
  • while troubleshooting carefully consider the impact of (all of) the above technologies. For example imagine you don’t see an NA after a station has sent an NS for the default router’s IPv6 address. The actual traffic flow could easily be the following:
    for the NS packet: station sends multicast packet -> AP -> (over a tunnel protocol to) WLC -> Ethernet -> SVI
    for the corresponding NA: unicast packet all the way back, but potentially slightly different path (AP).

tl;dr: In W-Fi networks usually a number of techniques can be found which interact with IPv6 in various ways. As a network designer you should be familiar with those technologies. When doing troubleshooting in such networks it might be helpful to keep them in mind, too ;-).

Thank you for reading so far. I’m always happy to receive feedback, either here on the blog or via Twitter. Happy IPv6 networking to you all!

Disaggregated Security Enforcement / Self-service ACLs

In large environments security controls based on packet filtering, such as firewalls and ACLs on network devices, often face an unfortunate dilemma: there’s a gap between the parties understanding the communication needs of an application (say: the application owners) and the parties implementing the actual security enforcement (e.g. the firewall ops team). Those also have different motivations: “it has to work” (see RFC 1925 rule 1 😉 for the former group versus “it has to be secure = fulfill certain security objectives” for the latter. This gap can manifest in many socio-technical ways, which is the reason why ‘firewall rule management’ has been subject of many discussions over recent years. In another post which I wrote a few years ago I stated that going for the upper-right quadrant in the following diagram usually requires high operational effort (which can actually produce the opposite outcome due to added process complexity), a high level of automation, accepting trade-offs, or a combination of these.

That’s why several organizations are considering another approach or have already started deploying it. Here I’ll call it ‘self-service ACLs’, and it can be summarized as follows:

  • move (the enforcement function) of packet filters towards the hosts (e.g. via ip[6]tables running locally or some set running on a network device ‘just in front’ of a group of hosts, e.g. a VPC).
  • provide a nice web-based management interface to these rules
  • store all rules in a centralized database
  • allow application teams themselves to manage the rules. Besides the technical decentralization (or to put into more familiar lingo from the networking space: ‘disaggregation’), this one constitutes the main paradigm shift.

The underlying idea is simple: “let the owners of an asset/a service handle what they need, in a flexible manner”, without all those organizational or process-induced gaps, and it’s seems like a good idea to solve the issues I laid out above.

Alas – as so often with simple ideas seemingly solving complex problems – there are some often over-looked pitfalls. These are going to be the subject of this post.
Quick disclaimer re: terminology: I’ll use the terms ‘rules’, ‘firewall rules’, and ‘ACLs’ interchangeably. Just think of rules being part of larger rule set, implementing packet filtering based on a traditional approach of sources, destinations and services with the former represented by IP addresses/ranges and the latter by protocols or ports in some notation.

Let’s start with looking at the lifecycle/dimensions of such a rule:

  • (1) a ‘management’ step/function, like the creation of the rule (by some party) or the modification of the rule (by some party)
  • (2) the actual enforcement function of the rule
  • (3) logging (of certain enforcement events, e.g. dropping a packet)
  • (4) analysis of a rule (e.g. as an intellectual exercise performed in certain life situations 😉 or as a contributing element to metrics)
  • (5) troubleshooting network communication flows which often involves the functions (3) and (4).
  • review

In the ‘traditional model’ most of these were performed by the same party (‘firewall ops team’), but here the self-service model induces significant changes. The expected benefit is centered around moving (1) (into the hands of the app owners holding the contextual intelligence) and (2) (topographically towards the assets actually needing the protection). Sadly this also brings changes to the other functions, with some interesting effects. Let’s look at two affected functions/lifecycle elements: logging and review.

Logging

In infosec circles there’s an old adage ‘each security layer should provide logging’. Let’s assume the log files are still written to a central place (this is what most organizations do, for a variety of reasons, and I for one think that this makes sense). This can create interesting situations:

  • The new owners of the rules will, somewhat legitimately, think that they own the logs , too. (“These are our rules, we manage them, and we should be able to see what’s happening”).
  • How do they then get access to the (centralized) log files?
  • More importantly: get access in a tenant-proper way? (you don’t want the database team to be able to see the log files of the authentication servers, do you?)

I’ve yet to see an organization which has solved this problem in a way that fulfills the requirements of the different parties. So one might have to accept some trade-off here (e.g. the loss of visibility into log files for one of the involved parties).

Rule Review

A similar conflict of interests arises in the context of rule review. Can one reasonably expect the party whose main interest is essentially ‘it has to work’ to perform a review of rules based on corporate policy, PCI requirements or the like? Again, this would be an inherent dilemma only solvable by a high degree of collaboration (self-service ACLs are often supposed to reduce needed collaboration). On the other hand rule review might a bit of a dysfunctional process in some organizations anyway, as this recent Twitter poll seems to imply 😉

General Paradigm Shift

Finally one should keep in mind that the introduction of self-service ACLs can mean a cultural shift for application teams, from opening tickets (which includes a partial transfer of responsibility) to managing rules (= security controls) in their own hands (which in turn also requires developing security skills & practice). Not all app teams might be happy with that; especially those running core applications with high availability requirements might be a bit risk averse in this context ;-).

tl;dr: While self-service ACLs can address long-existing process-level deficiencies in some organizations, they might well introduce new ones. Understanding the demarcations of the individual functions within a rule’s lifecycle, and the incentives of the different involved parties, will be crucial for a successful deployment of the approach.

IPv6 Reporting

I know that some of the readers of this blog are IPv6 cheerleaders in their respective organizations, and as such they might occasionally face questions along the lines of “what’s the state of IPv6 in our company?” or “are we progressing IPv6-wise?” (the latter in particular when dedicated resources are spent on the IPv6 transition on a regular basis, as opposed to those 50-person-days “let’s get ready for IPv6 (by writing some documents)” projects once every few years).
In this post I’ll discuss some approaches to make IPv6 progress within an environment visible.

For those interested in IPv6 progress in the global Internet or on a country level, an overview of the main sources for numbers can be found here. When undertaking a similar exercise for an individual organization let’s keep in mind that in general reporting in corporate life has three main aspects:

  • a question (to be answered, or: which message do we want to convey by means of the reporting effort?)
  • audience (who’s the recipient of the message?)
  • method (which approach/tooling/communication channel etc.)

A main differentiator between other types of measurements commonly found in large enterprises like “number of systems being on the latest OS patch level” and IPv6 reporting is that in our case usually two dimensions are of interest:

  • (1) to what extent are we ready/prepared for IPv6?
  • (2) to what extent is IPv6 (traffic) really happening?

On the one hand those are fundamentally different dimensions (or: ‘questions’ to be answered to the audience of the respective reporting efforts), on the other hand there are relationships between them. Still I think it’s important to understand the differences. I know quite a few companies where the question ‘do we support IPv6?’ asked to the ‘network infrastructure folks’ would be answered with a resounding ‘yes, of course, we do!’. However that does not necessarily mean that much real-life application traffic happens over IPv6 in those environments. As long as there are no applications or services actually being IPv6-enabled, that nice positive answer from the network people might not be of much practical value, unfortunately (don’t get me wrong dear networking colleagues, of course I know it all starts with the network…). And enabling IPv6 for applications and services usually depends on certain core infrastructure services (those in the realm of authentication, provisioning, monitoring, or security are common examples; see also the ‘three dimensions of IP addresses’ discussed here). From that perspective displaying the current (IPv6-related) state of those (as I call them) dependencies might be more important – at least for a certain audience – than just coming up with numbers of ‘dual-stacked hosts’ (which might not use IPv6 at all as there’s nobody reachable to talk to over IPv6 ;-).

Let’s look at some typical parameters of the two above categories.
In the space of (1) the following ones come to mind:

  • State of important dependencies, maybe even in a non-numeric way (like, per dependency: ‘no ongoing v6 efforts’, ‘has v6 in dev’, ‘has v6 in production’ etc.)
  • Number of dual-stacked hosts. Assuming that the default preference of OSs is to use IPv6 when possible (of course it’s more complicated considering Happy Eyeballs and Java options/preferences) this gives an idea of ‘which percentage of our systems would use IPv6 if they had the opportunity [read: when they connect to proper IPv6 endpoints]?’.
  • When many users work from home (like in quite a few organizations nowadays, in October 2021), looking at some stats related to VPN connections might be of interest. (hint: in such times bringing IPv6 to your VPN should be a high priority effort anyway if you want to drive IPv6 across your organization).
  • number of IPv6-enabled VIPs on your load balancers in case those are infrastructure elements being used in your organization. This is a particularly interesting one as one can assume that such a VIP is only created when other infrastructure requirements are already met (read: when an application team is serious about supporting IPv6 connectivity to their respective application, and some dependencies as for authentication, monitoring or logging are already solved).
  • subnets which have IPv6. Again, this one does not tell anything about actual IPv6 traffic, but about the ‘level of preparedness’ (which is the overall purpose of this category).
  • number of AAAA records in DNS. Depending on the point of time (within an application’s journey towards IPv6) when those are created this might be an interesting indicator. For example during LinkedIn’s IPv6 initiative they deliberately did this rather late during the infrastructure services transition (slides and video of their related talk at the UK IPv6 Council can be found here), so the mere existence of an AAAA record meant serious (IPv6) business.

For some of the above values dashboards with ongoing data might be proper (display) instruments, but that would the depend on the ability of collecting the respective numbers in an automated manner, and on the audience the reporting is intended for (not everybody likes clicking on a link to a dashboard and to interpret numbers on their own, but some recipients might prefer to get a quarterly e-mail with some core numbers which can be digested in 10 minutes, on an iPad).

Typical category (2) parameters, to show that IPv6 is actually happening (as in: end-to-end IPv6 connections take place over the network), include the following:

  • all sorts of traffic statistics (e.g. from NetFlow) which usually show the ratio of IPv6 traffic to overall traffic. Here one should keep in mind that one single but traffic-heavy application getting IPv6 might create the idea that significant progress happened which might not simultaneously be the case in the space of infrastructure dependencies, which is exactly why both main categories (and being transparent with regard to their differences) are important.
  • similar stats for connections to major applications or infrastructure services (e.g. authentication servers). Bonus: measure associated RTT or latency from vantage points or sth similar, as a follow-up question from the audience might be: “ok, and now tell us how application performance compares between IPv6 and IPv4”.
  • number of IPv6-only hosts. Assuming that those hosts establish or receive whatever type of network connections (and why else would they exist in the first place ;-), this one can really provide insight into IPv6 progress within an environment. To avoid misinterpretations in rapidly growing environments, ideally put this in relation to overall number of hosts, in case you have that number at hand (see the poll at the beginning of this post on the APNIC blog why I mention this ;-). Also please note that I use the term ‘host’ here in a loose manner which encompasses containers and other types of ephemeral entities having an IP address.

I hope the above provides some inspiration for those dealing with the task of making IPv6 progress within an environment visible. Always happy to receive feedback of any type either here or on Twitter. Thank you! for reading so far, and good luck with your IPv6 efforts.

The Role of IP Addresses in Security Processes

Reflecting on IP addresses, and about factors contributing to having a proper inventory of active ones, recently led me to putting up a Twitter poll. Here are the results:

Looking at these numbers it seems that quite a few organizations struggle with maintaining a more or less accurate inventory of active addresses in their networks. At this point infosec purists may stress the importance of a thing called asset management (I mean there’s a reason why it’s the 1st function in the 1st category of the NIST Cybersecurity Framework, right? ;-), but I for one just felt I should reflect a bit more on the role of IP addresses for certain processes within an organization. Not least as related aspects and questions may become even more important once a whole new class of addresses enters the corporate infosec ring: enter IPv6.
Let’s hence imagine there’s a certain number of systems in your organization’s network, and some of those systems are now getting IPv6. Which security processes could potentially be affected (read, in a sufficiently large organization: which teams should know about the ongoing IPv6 effort? ;-).

From a simplified perspective security processes can be grouped into the following categories (those interested in other categories find some here):

  • preventative ones (‘avoid that a threat can materialize against an asset’). For our purpose here let’s take filtering of network traffic and patch management as examples.
  • detective ones: these can include the detection of deviations of a desired security state like the detection of vulnerabilities (=> vulnerability management) or the detection of security violations (e.g. by system-local mechanisms like log files or agents, or by means of network [security] telemetry).
  • reactive security processes, e.g. incident response.

Traffic filtering is usually mainly done in one of these two flavors:

  • gateway-based filtering (firewalls or router ACLs). Once this is used it may be of (security) interest if there’s at least one active IP/system of a specific address family (here: IPv6) in use within a given subnet.
  • local packet filtering. Once the respective rules are centrally managed those in charge better know about active IP speakers, I’d say. Once those are not centrally managed (which is the case in many environments), first+foremost one might ask: who manages them at all? 🤔😂
    Kidding aside, evidently for those entities responsible for the configuration of local packet filters knowledge of active addresses of all address families is probably valuable.

To properly perform patch management one usually has to know: which systems are out there? And how does one identify those systems? For example, in environments using MS Active Directory probably all domain members can be identified by some AD-inherent logic, and other systems might be identifiable by means of their FQDNs. Still it’s a safe assumption that at least some fraction of to-be-patched systems are identified by their IP address(es), so having a proper inventory can be of help here, too.
Evidently in the course of patch management one might not only be interested in actively running systems, but also & namely in their OS + software components and their respective versions. Coming up with that type of information commonly falls into the realm of vulnerability management. The vast majority of vulnerability management tools & frameworks I’m aware of primarily use addresses as identifiers of systems. From that angle an accurate inventory of addresses certainly helps. The advent of IPv6 might bring some extra challenges here, as simply scanning IP subnets for active systems (in order to subsequently subject those to vuln scanning) doesn’t work any longer with IPv6, so the need of a proper source of truth becomes more crucial (I already discussed this in the IPv6 Security Best Practices post).

Finally in the context of incident response generally situational awareness (what’s happening, which assets are affected, what’s the source of certain things going on etc.) is needed. I could imagine that the ability to map IP addresses to systems can be helpful here (for identification, evaluation, follow-up), so a proper IP address inventory might be of value, so to say.

tl:dr: having a understanding of active IP addresses within an organization affects (at least) the following security controls/processes:

  • (potentially) traffic filtering
  • (potentially) patch management
  • (definitely) vulnerability management
  • (definitely) detection of security violations
  • (definitely) incident response

So when deploying IPv6 in an environment, talking to the owners of these processes is needed, in order to make sure that IPv6 does not lead to an increased risk exposure.

Quick Intro to IPv6

This post strives to provide an overview where (and why) IPv6 is different from IPv4. The intended audience are folks with a solid understanding of IPv4 but not too much exposure to IPv6 so far (I hear such an audience still exists ;-), and the post is intentionally kept short (regular readers of this blog may imagine that I’d love to extensively rant on several of the below items. some of them would deserve full posts on their own). Also I won’t go into technical detail too much.
In a nutshell the post tries to summarize why, under the hood, IPv6 is quite different from IPv4, and what those differences are.

Design Objectives

In order to understand certain elements of IPv6, it’s helpful to keep in mind that it was mainly developed in the mid-90s. It hence tried to solve some of the issues & challenges found in networking at the time, besides introducing general new ideas.
For the purpose of this post the following objectives are of interest:

  • autonomy: hosts should be able to come up with the configuration of basic IP parameters on their own, without the need for human intervention/administration or the need for additional services (like DHCP).
  • (restoration of the) end-to-end principle: hosts should be able to communicate with each other without ‘the network’ providing functions besides simple packet forwarding.
  • optimization: come up with some changes that ‘make network communications more efficient and hence faster’ such as replacing broadcast by multicast, or simplifying the IP header.

And, yes, the one main reason why IPv6 gets deployed in many environments today, that is larger address space, played a role, too.

The Unpleasant Reality

In case you’ve already been working a bit with IPv6, and you have scratched your head while reading the above section, thinking sth along the lines: “wait a second, those ideas haven’t really worked out, they’ve only been implemented half-baked, or they have even added a lot of complexity and operational pain”, you’re fully right.
Several factors have contributed to the mess we have today, like:

  • Dynamics within standard bodies based on (seemingly) voluntary work like the IETF, including the composition of working groups, their politics plus the associated way of finding compromises etc.
  • Over-engineering in general, further fueled by certain incentives within those working groups
  • Lack of ‘feedback from the field’: during the first 15 years or so after the initial specification not much deployment happened, so nobody told those well-intentioned and smart – seriously, no irony here: they are, but the ecosystem is complex in itself – engineers that what many networks needed was just more address space, and that all the other shiny enhancements and features primarily introduced complexity and operational efforts. And things got worse with every year that passed, with regard to protocol complexity, and with regard to the inability to make fundamental changes of the design.

I’m aware that some of these things and developments are hard to understand from a technical perspective or from a 2021 point-of-view, but protocol development doesn’t happen in a vacuum, and of course in hindsight it’s always 20/20.
Point is: from a deployment perspective, accept IPv6 as it is (the boat for significant changes has long sailed), and drive the right conclusions for your operational decisions.

Technical Elements & Changes

In this section I’ll list some of the main technical elements of IPv6 which are new for those coming from IPv4, and which play a huge role in both the way how an IPv6 stack works (by default) and how ‘an IPv6-enabled network’ behaves in general:

  • Router advertisements (‘RAs’): these are packets sent out by routers to their adjacent networks which carry information that enables hosts to perform autoconfiguration (remember the above autonomy objective). Understanding these packets, and their operational implications is crucial for smooth operations of the vast majority of IPv6 networks. I might add here that, based on some of the factors of the 2nd section, RAs are super-complex packets themselves, so they are somewhat metaphorical for the state of IPv6 ;-).
  • The link-local address (‘LLA’): in contrast to IPv4 where one and the same address is usually used both for communication within a subnet and with remote hosts, IPv6 strictly differentiates between local communication and non-local communication (the latter happening through a router/’the default gateway’). This differentiation includes a special address only used for local purposes. It uses the prefix fe80::/10.
  • Multicast: the approach of ‘general broadcasting’ when communication with multiple or ‘unknown’ hosts in the local subnet is needed, was replaced by using multicast groups (their addresses start with ‘ff’) for these types of communication. Combined with new/additional interactions (like RAs), at least in the local network (the ‘local link’ in IPv6 terms) one will usually see a lot of multicast traffic with different addresses, and for different purposes. Evidently this has a number of operational implications (which, again, are outside the scope of this post).

To some lesser extent one could add IPv6 extension headers (EHs) to this list, but – luckily – there’s a fair chance that many of you joining the IPv6 world in 2021 won’t ever see them in operational practice (besides security filters dropping them), so no need to discuss them further here.

As one can see, quite a few architectural changes have happened between IPv4 and IPv6. Understanding them can help to make well-informed decisions during the deployment of IPv6.

IPv6 Duplicate Address Detection

In this post I’ll take a closer look at IPv6 Duplicate Address Detection (aka ‘DAD’, which evidently bears all of types of jokes and wordplays). While the general mechanism should be roughly familiar to everybody working with IPv6 there are some interesting intricacies under the hood, some of which might even have operational implications.

DAD was already part of the initial specification of SLAAC in RFC 1971 (dating from 1996), which was then obsoleted by RFC 2462. RFC 4429 describes a modification called ‘Optimistic Duplicate Address Detection’. Neighbor discovery and SLAAC, incl. DAD, were later updated/specified in the RFCs 4861 and 4862 which are considered the main standards as of today. Finally DAD was enhanced in RFC 7527 but that’s of minor relevance here.

Its goal is to avoid address conflicts (within the scope of a respective address). To do so it is supposed to perform a specific verification procedure (‘ask a certain question’) and subsequently to act on the result of that procedure. However, as we will see, namely the latter can depend on a number of circumstances, in particular on the type of the address/IID.

How to ask the question?

Generally speaking a host is expected to perform the following (for a given unicast address):

  • send a Neighbor Solicitation (ICMPv6 type 135) message.
  • use the unspecified address (::, see RFC 4291, section 2.5.2) as source address, the requested unicast address’s Solicited-Node multicast address (SNMA, see RFC 4291, section 2.7.1) as target address and put the to-be-used unicast address as target address into the ICMPv6 payload.

This can look like this (ref. RFC 2464 for the ’33:33′ in the Ethernet multicast address):

It should be noted that RFC 4862 states that “Duplicate Address Detection MUST be performed on all unicast addresses prior to assigning them to an interface, regardless of whether they are obtained through stateless autoconfiguration, DHCPv6, or manual configuration”, but in practice this can be turned off on the OS level (and there might even exist situations where this could be desirable, see below). Still, the general verification procedure is mostly identical on the vast majority of operating systems.

Shall we wait for a response?

This is where the differences between scenarios start. As stated above RFC 4429 describes a thing called ‘Optimistic DAD’. The idea here is put an address into an ‘optimistic’ state right after sending out the NS and thereby make the address operational pretty much immediately (with some minor restrictions like not to send certain packets with said address as the Source Link-Layer Address Option [SSLAO]). This optimization is supposed to be used when – as of RFC 4429 section 3.1 – “the address is based on a most likely unique interface identifier” such as an EUI-64 generated one, a randomly generated one (Privacy Extensions, RFC 4941, more info here), a Cryptographically Generated Address (as for example used by Apple devices, see here) or a DHCPv6 address (note that the concept of ‘stable’ addresses as of RFC 7217 did not exist at the time). Optimistic DAD explicitly “SHOULD NOT be used for manually entered addresses”.
As of today it’s a fair assumption that all ‘client operating systems’ use Optimistic DAD, as can be observed in the above example, but this does not apply to servers using static addresses. This is how it looks like on macOS Big Sur (note that the router solicitation is sent already two milliseconds after the DAD neighbor solicitation)

What if the response indicates a conflict?

This is where things (differences) become really interesting. While RFC 4429 has a dedicated section on the ‘Collision Case’ (sect. 4.2), it remains relatively vague, includes terms like ‘hopefully’ 😉, and states that an address collision “may incur some penalty to the ON [optimistic node], in the form of broken connections, and some penalty to the rightful owner of the address” (which doesn’t sound right to me…).
RFC 4862 mandates (in “5.4.5.  When Duplicate Address Detection Fails”) that in case of a collision of an EUI-64 generated address the IPv6 operation of the respective interface “SHOULD be disabled”, but “MAY be continued” in other (address generation) scenarios. Furthermore “the node SHOULD log a system management error”.
An interface with a static address where DAD failed could look like this:

inet6 2001:db8:320:104::9/64 scope global tentative dadfailed 
valid_lft forever preferred_lft forever

So, overall no guidance is provided here how to proceed in case of a detected conflict for addresses based on RFC 3972 (CGAs), RFC 4941 (Privacy Extensions) or RFC 7217 (‘Stable IIDs’), but this may be specified in other places (see below), and/or might be left to the implementors of individual OS stacks. Many years ago Christopher Werny and myself performed some testing for Windows and Linux, creating various scenarios with address collisions, and from the top of my head I recall that their behavior was both quite different and not necessarily intuitive (sorry I don’t remember details).

CGAs have a dedicated Collision Count parameter which can be “incremented during CGA generation to recover from an address collision detected by duplicate address detection” (RFC 3972, section 3).

RFC 4941 includes this (with the TEMP_IDGEN_RETRIES defaulting to the value 3):

RFC 8415 on DHCPv6 specifies as follows (with a DEC_MAX_RC parameter indicating the number of client-side retries of getting a new address. it defaults to the value 4):

Furthermore the DHCPv6 server “SHOULD mark the addresses declined by the client so that those addresses are not assigned to other clients”.
I’m not sure about the exact sequence of things when the client uses optimistic DAD (which in turn should be the default for DHCPv6 addresses).

tl:dr of this section: the exact behavior of reacting to an address collision might not always be the same, and it might depend on several circumstances.

Operational Implications (1): Service Bindings

As laid out above optimistic DAD is not supposed to be performed when static IPv6 addresses are used. This can create issues when during system boot a service is to be bound to an address which is still in ‘tentative’ state (during DAD), as discussed in this thread (also interesting comment there at the bottom, on the differences re: DAD between FreeBSD and NetBSD).
This could look like this:

020/09/26 10:08:22 [emerg] 11298#11298: bind() to [2001:db8:104:1700::12]:80 failed (99: Cannot assign requested address)

Apparently this may be fixed by touching the following sysctl but I don’t fully understand its mechanism, so this might only work in certain scenarios:

sysctl net.ipv6.ip_nonlocal_bind=1

In any case the delay induced by DAD (with static addresses) should be considered for service bindings during startup.

Operational Implications (2): cni0 interface stuck in DAD

I once heard of a case where the cni0 bridge interface on Kubernetes clusters was stuck in DAD when initialized by standard CentOS initscripts (which in turn was difficult to troubleshoot as it only had veth members and wasn’t bound to any physical interface). This could presumably only be solved by disabling DAD as a whole. That might be a debatable approach (I for one think this is perfectly doable even in other settings once one has sufficient control over the [static] address assignment mechanisms), but for completeness sake here’s the relevant sysctl (from the current Linux kernel documentation):

Suffice to say that DAD might kick in various ways and in the context of different dependencies, so one has to be aware of its inner workings and of its role during interface initialization.
To contribute to such an understanding was the exact point of this post ;-). Thank you for reading so far, and as always I’m happy to receive feedback on any channel incl. Twitter.