Provenance for Data, AI Model and Software Artifacts – Combining OIDC and short-lived private keys

Provenance (Wikipedia) is an important concept in information technology: It essentially says that a digital artefact, such as a dataset, an AI model or software, meets the expectations of the artifact consumer.

Expectations can be of different nature, for instance, it can describe how it was generated, that it has been subject to certain automated tests or that it was generated by a trusted entity, for example, a specific organization or person(s) (see also in-toto).

The Supply-chain Levels for Software Artifacts (SLSA) standard includes provenance as one core aspect of secure software supply chains.

I will focus in this blog post on one important aspect of provenance – cryptographically signing the digital artefact by its producer so it can be verified in an automated fashion by the consumer that it is indeed coming from the producers and possibly also meets certain criteria.

The next section investigates how it is currently done since a couple of decades and what are the weaknesses of the current approach. I will then explain a novel emerging standard that addresses the weaknesses and is based of a combination of OpenID Connect (OIDC) and short-lived private keys for signing artefacts. Finally, I will conclude how the software, data and AI model ecosystem are including previous practices and the novel practice.

Signing Artefacts – Current State – Asymmetric cryptography and the web of trust

Provenance works by using digital signatures on the digital artefacts. To do this, one needs to employ asymmetric cryptography.

Most commonly this is currently done only for software artefacts and less for data artefacts or AI models.

For software artefacts, the following things are done:

  • the software providers sign the artefact using their private key that only they know. Each private key is associated with a public key that is publicly shared often stored in a standard format (see Cryptographic Message Syntax (CMS)). They often use a software, such as the GNU Privacy Guard (GPG) or Java BouncyCastle, for this.
  • The public key is shared through another channel than the signed digital artefact (e.g. another host etc.)

The Software consumers do the following:

  • They need to fetch the public key from somewhere and need to make sure that this public key is indeed the one of a software provider
  • They can verify the signature with the public key using also any software that supports the corresponding standards.

If the software consumer has obtained the public key in a secure manner from the software provider and trusts the software provider then it can use the software after doing further due diligence (e.g. check that it has been tested properly). If either the key verification fails or the public key is not trustworthy then the software is not by the trusted software provider and should not be used.

The latter refers to the Web of Trust. Software consumers can sign public keys of public software providers and this can then be used to establish a „web of trust“ among software providers and consumers.

For other digital artefacts, such as data or AI models it works similarly.

Issues with the current approach

Providing provenance for digital artefacts exists since decades. Examples are the package repositories of operating systems, such as the distributions of Linux (e.g. here for OpenSuse). It is not exactly sure when popular package manager introduced support for signatures. It might be related to the introduction of GnuPG, which happened 1999/2000, or PGP in 1991. According to Wikipedia the first software was signed in 1995.

Nevertheless, around the time it started to be widely used, it was because there were attacks to some repositories and packages were replaced by malicious ones. Thus they could be possible installed by all users using the distribution without them knowing as they could not verify the artefacts.

All packages in most Linux distributions are signed by one key of the distributor. In this way, the users need to „just“ trust the distributor who builds the packages from the source code. This reduces the complexity for the user as one does not need to trust the author(s) of each package.

This is often not the case when building software in various programming languages. Here, one often includes 50 or more dependencies (e.g. in Java, Typescript/Javascript, Python, DotNET, Rust etc.) from various vendors/individuals in medium-sized enterprise applications.

However, the first problem is often that these packages do not have any signatures. While it is enforced in Java (when using the most popular package ecosystem, Maven Central), it is not obliged in, for example, the Python ecosystem, where less than 10% are signed (according to this source). Pypi – the most popular package ecosystem for Python – deprecated signatures even in 2018.

Even if packages have signatures they are not verified by default by the build tools, such as Maven or pip, when they are integrated into an application. This may be surprising, but has some logic – how should the build tool know that the entity that has signed the library is trustworthy? For that the developers need to get the public keys somewhere and configure the build tools to use them for verification…

Even if the build tools would verify the signatures – and one would also need to trust the signatory. What would be if the private key was leaked and no one noticed it? This is not uncommon especially in modern CI/CD pipelines as they are often visible and if someone did a misconfiguration of permissions of a repository they might be stolen.

This is especially likely with the „traditional“ digital artefact signing system based on GnuPG and alike. Here, software providers often use very long-living private keys as it would be otherwise very difficult for the software consumers to keep up to date their list of trusted public keys that are connected to the corresponding private keys. The longer a private key lives the more likely it is that it gets stolen and abused for making malware look legit, because it is signed with a stolen private key.

This is not a theoretical attack – it happens very often in practice (cf. this article in Computerworld, stealing of MSI UEFI signing keys, stealing of NVIDIA code signing keys, the breach in a CI tool for the cloud lead to exposure of private signing keys of thousands of organizations etc.)

It is in fact easy – often Continuous Integration (CI)/Continuous Delivery (CD) pipelines are attacked and they are weakly protected because many organizations do not consider them as relevant for protection compared to applications that run on confidential data in production. This is a terrible mistake because those pipelines build the applications that are then later used in production to run on confidential data!

Longevity of private keys has another issue – often outdated cryptographic algorithms have been used to create the keys, because when they were created long time ago they were state of the art, but nowadays can be easily broken due to faster computers or weaknesses detected in the cryptographic algorithms in the meantime.

It gets worse: Even if the software provider would notice that their private key for signing were stolen, there is no easy way for them to communicate this to the software consumer! That means with a private key that was stolen one can continue to sign malware for a very long time after it is discovered that the key was stolen and software consumer will happily accept this signature as they have not noticed that the private key of the software provider was stolen.

Also signing of individual contributions to a source code repository are protected by the same mechanisms and there the situation is even worse as individual developers have their private keys for signing often on their laptops that can be easily stolen using targeted fishing attacks. Hardware keys, such as the OpenPGP smartcard, can prevent this, but they obviously do not address all the previous issues. I do not have statistics on how many developers on popular open source repositories, such as codeberg.org or github.com, actually sign their commits, but I assume it is few and only very popular projects do this or some enterprises that want to enforce some high security on certain internal projects.

We can conclude that currently signing is not often used and even if it would be used the current approach for signing has too many flaws to be useful.

Novel approach: Linking OIDC identities with short-lived private keys

The new approach (e.g. proposed by Sigstore) works as follows on the software provider side:

  • Using OpenID Connect a person (using their OpenID provider) or an automated process (using Workload Identity Federation or SPIFFE Verifiable Identity Documents – more on that in a future blog) authenticates to a certification authority
  • Certification authority provides a short-lived signing certificate for the given software artefact, a dataset or an AI model
  • Signing metadata is added to one or more transparency logs that are tamper-proof, i.e. they cannot be modified

On the consumer side, the process is as follows (see here):

  • They take the signing metadata associated with a binary
  • They verify that the signature is correct and has been made by a short-living signing certificate that confirms that the identity (e.g. the developer) using a given OIDC provider (e.g. github.com) has requested it – this is based that they check that the certificate has been indeed generated by the certificate provider

This implies also that the software consumer is sure that a given identity is indeed the true package maintainer. If it is open-source and published in a public repository, such as Github, then one can look at the code and see that the identity is indeed contributing to it. This also implies that OIDC providers, such as Github, themselves are trusted.

Nevertheless, this approach has the following advantages compared to the previous approach:

  • There is no need to manage private keys any more – they are short-lived and can be deleted after the signing process. Thus they cannot leak – and even if they would – they have only a short lifetime after which they cannot be used for signing.
  • All other issues with long-living private keys, such as outdated cryptographic algorithms, do not matter for short-lived keys
  • One does not need to inform others about key leakage and they do not need to react – a key is used just once for one artefact at a specific time – everything afterwards is simply not valid
  • One can provide provenance on the whole process that produced a digital artefact (in-toto)
  • It is much easier to integrate in an automation – also password-less (see my next blog on this)

Of course, it does not save you from regularly doing due diligence on your software and their dependency manually (e.g. quality, security etc.), but it saves you from – possibly unnoticed – attacks on central artefact distributors, such as Maven or Pypi, or on attacks on user accounts of those central artefact distributors.

Furthermore, it provides a simple mechanism for only allowing dependencies from people/organizations that you regularly check and trust centrally in your enterprise. Everything that is from unknown sources will need to be added to due diligence before you allow it – especially for highly confidential applications.

While it does not solve fully all the issues to the web of trust, it simplifies them a lot and makes them more manage-able.

State of the Ecosystem

At the moment, the novel approach is provided by Sigstore, which allow you to run the necessary infrastructure yourself enterprise-internally or it allows you to use their infrastructure for open source packages for free.

As OIDC provider you can use any OIDC provider – including your Enterprise-internal one.

The architecture is illustrated in the following diagram that describes how the Sigstore implementation works at a high level:

It contains the following components:

  • Fulcio: Root Certificate Authority – hands out short-lived code signing certificates
  • Cosign: signs/verifies digital artefacts, i.e. is a command line tool to do so, but provides also libraries for various programming languages. It is used by the software provider to or an automated process, such as a CI/CD pipeline, to sign the digital artefact
  • Rekor: Tamper-proof transparency log for signed artefacts
  • Any OIDC provider can serve to verify that an identity belongs to a user or an automated process (e.g. using Workload Identity Federation). Examples are Github, Gitlab, Google, AWS, Microsoft Azure, Keycloak, your own Enterprise-internal one etc.
  • The software consumer or an application on behalf of the software consumer verifies the signature and that it has been generated using a short-lived certificate by the certification authority – optionally they also may verify the transparency log
  • Optional: policy engines where you can provide your trust policies that are enforced during build and runtime. This may include not only that an artifact has been created by a certain provider, but also to ensure that it has followed a certain quality process

Open Source package providers, such as NPM.js, already working on the integration into their processes and build tools.

Gitsign allows to integrate this approach for signing git commits.

Also for AI models there are proposals for the Open Neural Network Exchange (ONNX) format to support this novel way for ensuring provenance on them.

I will explain in my next blog post how this integrates with identity of automated processes using OIDC (e.g. Workload Identity Federation based) without the needs for using passwords or other mechanisms that may leak credentials.


Kommentare

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert