Semantic Versioning for Artificial Intelligence (AI) 1.0.0

Artificial Intelligence (AI) becomes more and more part of some applications catering for the needs of many people. While AI is part of software products it has a very different velocity and less predictable needs for change. Especially if it addresses open-ended domains, such as natural language processing (NLP), where the content can change or the type of content varies greatly over time. For example, think about news articles, scientific articles or encyclopaedia entries.

Versioning is a core concept in Enterprise Software development and has the following benefits

  • Track different versions of the same software and where they are deployed
  • Enable collaboration on the development of a software system
  • Being able to see which deployed software in which version has defects and security issues fixed/not fixed
  • Create a given instance of system in a reproducible manner using the versions of its modules, services, interfaces and dependencies
  • Simpler assessment of the impact when deploying a new version of the software – if semantic versioning is followed
  • Express compatibility with previous versions
  • Being able to guarantee to consumers of software that certain features are deployed or are on a roadmap to be deployed in the future
  • Providing different user groups different versions of the software and being able to track which issues occurred for which user for which version of the software
  • Identify outdated software not corresponding to latest regulations and compliance standards
  • Identify business logic not corresponding to latest ethical standards (cf. Trustworthy AI) or business practices

To sum it up: Software versioning ensures delivery of business value of deployed software while enable management of the risks.

A given software can consists of many different modules, libraries etc. that may have all their own versioning.

While versioning in software can be referred to an established standards called semantic versioning (see below), no such concept exists for AI Models and the differences between software and AI models make the standards for software not applicable. Furthermore, simply applying the standards for software versioning to AI models will lead to confusion, defects and security issues when deploying AI Models. Surprisingly, many cloud providers do not version their AI model artifacts introducing all the issues described here.

I will propose in this post an approach for semantic versioning of AI models. This approach is itself versioned and starts with 1.0.0. This includes all facets of AI, such as supervised learning, unsupervised learning, natural language processing, reinforcement learning etc. After explaining semantic versioning for software, I will describe the difference between software and AI model artifacts. Then, I will derive from those differences a semantic versioning concept for AI Models.

Nevertheless, many open research challenges exist still in AI model versioning that may require years or decades of significant research work to solve, if they are solvable.

Semantic Versioning for Software

Versioning exist in the history of software since a long time. Although versioning of hardware or products date even longer, versioning of software has early on demonstrates its usefulness due to the rapid change that is possible to do when developing software.

Historically a lot of different version schemes have been developed. Semantic versioning attempts to standardize a version scheme to be able to deal with the growing software complexity containing many different software artifacts with the aim to reduce the risk that something breaks and to be able to always return to save known-working version of such a complex software construct.

The semantic versioning standard follows its own versioning conventions.

The rules are as follows (quote from the semantic versioning standard 2.0.0)

Given a version number MAJOR.MINOR.PATCH, increment the:

1) MAJOR version when you make incompatible API changes

2) MINOR version when you add functionality in a backwards compatible manner, and

3) PATCH version when you make backwards compatible bug fixes.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Albeit written for software it can also apply to hardware especially nowadays where nearly every piece of hardware contains some software.

Nevertheless, semantic versioning cannot be applied to everything. For example, it makes less sense for documents – except maybe technical documentation accompanying a software product. For example, it is very difficult to translate the mentioned rules to a news article, a tweet, a video or a piece of art, a financial report or a Wikipedia article – for different reasons. It also does not apply to corpora.

What is artificial intelligence?

We take here the definition of artificial intelligence within the scope of trustworthy AI by the High-Level Expert Group on AI coordinated by the European Commission

Artificial intelligence (AI) systems are software (and possibly also hardware) systems designed by humans that, given a complex goal, act in the physical or digital dimension by perceiving their environment through data acquisition, interpreting the collected structured or unstructured data, reasoning on the knowledge, or processing the information, derived from this data and deciding the best action(s) to take to achieve the given goal. AI systems can either use symbolic rules or l earn a numeric model, and they can also adapt their behaviour by analysing how the environment is affected by their previous actions.

As a scientific discipline, AI includes several approaches and techniques, such as machine learning (of which deep learning and reinforcement learning are specific examples), machine reasoning (which includes planning, scheduling, knowledge representation and reasoning, search, and optimization), and robotics (which includes control, perception, sensors and actuators, as well as the integration of all other techniques into cyber-physical systems).

Differences between Software and AI Model Artifacts

Before I discuss the differences between software and AI Model artifacts, I want to briefly elaborate what is meant by AI Model artifacts:

  • Data: Data is used to train, validate, test and monitor a model. In case of supervised learning this includes labeled data from which the AI algorithm can learn. The data might have been extracted/transformed and only be represented as versioned feature vectors. Data can be any type of data, e.g. structured data, text, images, audio or video. In an Online Machine Learning setting of AI the data continuously changes. Note: Entities in data itself maybe versioned. I refer here to a full dataset that may or may not contain versioned entities.
  • Initial State: The initial state are all settings/configurations related to the training of an AI model. This includes, but is not limited to, the initial random seed and/or the initial values of the hyperparameters. Different initial states may – ceteribus paribus – lead to a differently trained model and/or different behaviour for inference.
  • Final State: Training metrics, such as accuracy, are additional datasets that are generated after the model training. Note: the trained model itself is a dedicated point.
  • Pre-trained artifacts: Pre-trained artifacts, such as word embeddings, foundation models or pre-trained model weights, that are have partly or fully been trained outside of the current model.
  • Model: The model represents the final learned representation of the data. Additionally, it may contain handcrafted rules/relationships between data entities, such as knowledge graphs, ontologies or symbolic learning rules. It can be in various – usually binary – formats.
  • Source Code/Application Binaries: The source code of the application to preprocess data, train and evaluate a model. Additionally, it contains functionality to use the trained model for making prediction on new data. Application binaries are relevant because the same source code can be compiled for different target platforms, such as different CPUs, GPUs etc. which in many cases lead to different variations of the same AI model that may lead to different behaviours.

Out of scope of this definition is an AI System, which is a collection of data artifacts, AI Model artifacts, software artifacts and hardware artifacts.

There are major differences between software and AI artifacts:

  • Software is usually not as sensitive to data or environmental changes as AI artifacts.
  • Small changes may lead to big changes in the behavior of an AI. For example, a small change in input may lead to a completely different prediction.
  • Depending on its usage context an AI is usually much quicker out of date and needs to be retrained („updated“) and the need for retraining might be not as predictable as for software
  • Quality of an AI cannot be assessed as „Success“ or „Failure“ as in software tests, but one needs to estimate the risk and impact of an AI being wrong
  • Even more important than in software is to create traceability on why AI had a certain output to ensure trustworthy AI
  • AI Model artifacts are much more difficult to reproduce, because they depend on much more factors than software. However, reproducibility is key for trustworthy AI.
  • Updates to software can be easily replaced by an older version in case of failures. AI artifacts usually cannot as the need to replace an AI usually is derived from the fact that the current version is not working anymore to serve the business process. There might be exceptions, such as replacing an AI model to make it more robust towards attacks, but those are generally very rare.
  • AI models introduce new security attack vectors (see here)

The main reason for those differences is that software is usually defined by more or less experienced people that take into account the whole context of the software, such as business processes or common knowledge. Thus they incorporate also knowledge on the potential future, e.g. by making interfaces more general to cater for future needs. AI still has a lot of difficulties with this.

Hence, I propose to use other conventions for versioning of AI Model artifacts.

Semantic Versioning for AI Model Artifacts

Related work

As described before there is a lot of related work on software versioning which is largely some variations of semantic versioning. I do not provide here an extensive literature review on the very large state of the art related to data/schema/ontology versioning, but provide some examples. Very little work exist for versioning of AI model artifacts.

Schema Versioning

While there is a lot of work on how to implement data versioning (e.g. here), there is little work on what should be the semantic meaning of the version to deal with all of the requirements for versioning described previously. Often they refer to versioning of single data entities, e.g. in the context of slowly changing dimensions in data warehouses, or general schema versioning describing the change of data structures, but not data itself. While those approaches have been and are still very relevant in context of data processing software, they have never been designed with having in mind the specifics of AI model artifacts.

Ontology Versioning

There is also some work on ontology versioning. Ontology versioning has some overlap with some parts of AI, especially reasoning and planning as well as schema versioning. Ontology versioning aims at taking into account the interpretation of end users of concepts (cf. here or here) and there change as well as how to apply the change (e.g. as of now or retroactively to previous instances). An example given by the reference is an ontology about traffic connection in a city for bicycles. If you add water transports then the meaning of a bridge can be rather different, because in the first case it accelerates the transport by bikes significantly, but in the second case it is an obstacle for a ship.

As said before, ontology versioning has some overlap with AI, because AI has as input data with underlying concepts and outputs concepts or actions based on concepts. For example, think about an outdoor shop that provides ski equipment and that recommend related products – it should not recommend swimsuits, because some users bought them together, for example because they travel a lot between extremely different vacations. I also explained before that deep learning/machine learning will need to intersect to build better AI.

Others (e.g. here) aim at the ability to compare ontologies and to assess differences between ontologies. This has though semantic challenges – ontologies with different structures can represent the same concepts and differences between ontologies can go beyond syntactic differences (e.g. bank in park vs bank managing money).

Most of the ontologies see their context only at the level of the ontology, but not in the software module they are used in.

Ontology versioning introduces backward-compatibility as a criteria for comparing versions. If you look at the definition of semantic versioning, this also exists in software. Although they are realized differently – they have the same objective: Assessing the risk to switch to a newer version or moving back to recover from an incorrectly working new version.

Issues with current state of the art

Obviously, it would be best if we could reuse approaches for software versioning/schema versioning/ontology versioning.

Unfortunately we cannot. The assumption underlying those approaches are not the same. Mainly there are the following issues

(1) One cannot compare AI models as one can compare software/schemas/ontologies. We already saw that ontologies can be difficult to compare at the semantic level. While we could maybe compare difference in data, source code and trained hyperparameters of the same algorithm, we cannot assess the impact of those. Even if we would use the same sets of training and test data for the different models, this would be not very meaningful – one has to observe the behaviour of an AI model in production facing different data over time.

(2) There is a little work on describing inputs and outputs of an AI for training or inference semantically. Let us assume you have the same source code, the same hyperparameters, but the only thing you vary is the training data. For example, you want to train a product classification tool on new products that simply have not existed when the original classifier was developed. As we have seen for ontology versioning that is an unsolved problem.

(3) One cannot assess the impact of change of behaviour of an AI between different versions. Obviously, we can say that, for example, it still solves the same task (e.g. classification), but we cannot say if/to what extent it does it differently. This is especially difficult for large-scale pre-trained embeddings/models where for even very good AI Engineers/Data Scientists it already becomes challenging to assess their usefulness and plan training for an individual version. At the moment, there is no work existing that allows estimation of their impact (if any) if the version changes.

(4) Is is at the moment difficult to assess how changes in the underlying components of a model, such as data, initial state, pre-trained artifacts, trained model, source code/application binaries, affect the AI over time. This is even more challenging considering more complex scenarios, such as online machine learning.

Even each issue individually clearly highlights that syntactic comparisons („diffs„) are not adequate for AI Model Artifacts versioning. However, we also do not know – at the moment – how changes of AI model artifacts impact the AI as a whole – let alone a system consisting of one or more AI model artifacts.

AI Model Artifacts Semantic Versioning

Version the individual AI Model artifacts

An important aspect is that all artifacts which were used to build the AI model are versioned themselves. They can be reused in various contexts and possibly also in different models. Thus one key aspect of versioning is to version all the AI Model Artifacts of which an AI model consists of. I will describe in the following subsections for each of the AI model artifacts an versioning approach that is inspired from semantic versioning for software as explained above. Later I will provide also some examples.


Given a version number MAJOR.MINOR.PATCH, increment the:

1) MAJOR version when the schema of the data changes, when the semantic concepts underlying the data change, observations of the previous version are removed or changed

2) MINOR version when you add new observations, so that the distribution of the observation is expected to change

3) PATCH version when you add new observations that is expected to belong to the same distribution as the previous version

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Metadata can be rather complex and not everything can be described by simply extending the version. Thus metadata should be describe mostly in an additional machine-readable file. I suggest only to add labels that indicate on which period the data is valid (e.g. the data is Wikipedia pages from May 2019), the license (e.g. Creative Commons license) and the information privacy regime (e.g. gdpr-public if it contains no personal information according to General Data Protection Regulation (GDPR)). In essence any information that is useful to quickly identify if the data can be used generally.

Initial and Final State

Given a version number MAJOR.MINOR.PATCH, increment the:

1) MAJOR version when you add / remove hyperparameters, evaluation metrics on the same test dataset change more than +-3%, the test dataset changes, or evaluation metrics are added/removed

2) MINOR version when you change the initial setting of the hyperparameters, evaluation metrics on the same test dataset change between +-1% to 3% or calculation of evaluation metrics are changed

3) PATCH version when you change the random seed or evaluation metrics on the same test dataset change more than +-1%

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Pretrained Artifacts and (Binary) Model

Given a version number MAJOR.MINOR.PATCH, increment the:

1) MAJOR version when any of the dependent artifacts changes their major version, semantic concepts underlying the model are added/removed/changed, rules are added/removed, or artifacts are added / removed

2) MINOR version when at least one the dependent artifacts changes its minor version, existing rules change or binary format of the model/pre-trained artifact changes

3) PATCH version when at least one dependent artifact changes the patch version

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Source Code/Application Binaries

Source Code/Application Binaries: Use standard semantic versioning as described before. It should be seen in context of data, as explained in the corresponding section for the data.

Version the combination of different AI model artifacts

We have seen before how individual AI model artifacts are versioned. Once this is done, a combination of them that delivers an AI model, should be versioned as follows:

Given a version number MAJOR.MINOR.PATCH, increment the:

1) MAJOR version when any of the dependent artifacts changes their major version or artifacts are added / removed

2) MINOR version when at least one the dependent artifacts changes its minor version or binary format of the model/pre-trained artifact changes

3) PATCH version when at least one dependent artifact changes the patch version.

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format. You may reflect what I have mentioned for the data versioning: have only key metadata as extension to the version. All other metadata should be described in a machine readable format in a dedicated file that is also part of the data versioning.

I strongly recommend also to add a versioned model card as descriptive information to the combination of AI model artifacts.

Changing versions

The main motivation for the above definitions is for users of AI Model Artifacts to manage/assess the risk of changing the version. Contrary to software artifacts, it is more difficult to assess the risk of a change. Small changes can lead to very different behaviour of an AI, which can be positive and negative. Nevertheless, the main guideline I followed here when defining the versioning approach was that with a high likelihood changes can be assessed correctly on their impact.

Finally, one needs to consider that not only an AI Model changes, but also its context of use. This should always be carefully assessed and monitored on a continuous basis. For example, if you have a classifier for products then the underlying product set, for which inference is done, may change over time and thus requiring to create a new version of an AI model. Another example is the one of pre-trained large scale NLP models, they are usually based on old data, e.g. Wikipedia from before 2019, because they cost so much resources (time, money, people, hardware) to create. While their version does not change, their usefulness will likely degrade over time.


I will describe here some simple approaches how versioning can be implemented and used from a technical perspective.

Declaring version metadata

Declaring version metadata for an AI model and the artifacts it is composed of can be inspired from approaches for software. There build tools, such as cargo, gradle, sbt, maven, bazel etc. use build files that contain the version of the software and also each individual dependencies. The build tool make sure that they are fetched from a known and trusted repository when needed (see next section).

Find here how such a build file for versioned AI Model artifacts could like in Tom’s Obvious, Minimal Language (TOML) format:

name = "example-model"
version = "1.0.0-gdpr-public-cc4"
author = "Jörn Franke"
modelcard = ""
repositories = ["","","",""]

Here another example in YAML Ain’t Markup Language (YAML) format

name: example-model
version: 1.0.0-gdpr-public-cc4
author: Jörn Franke
       - example-data-wiki201905-1.0.0-gdpr-public-cc4.parquet
        - example-initialstate-1.0.0.cbor
        - example-finalstate-1.0.0.cbor
        - example-pretrained-nlp-1.0.0.bin
        - example-algorithm-1.0.0
        - tensorflow==2.7.0

Artifacts should be identifiable through a unique uniform resource locator (URL), e.g. provided by an AI Model Artifact Repository. You can see that per artifact type multiple artifacts can be defined, e.g. multiple datasets and multiple software dependencies.

A build tool for AI model artifacts would use efficient resolver/reasoning mechanisms (e.g. libsolv) to resolve the optimal subset of versions and dependencies for building an AI Model.

Unfortunately, such a build tool for AI model artifacts does not exist at the moment. Nevertheless, an important prerequisite is to start versioning at all and here there are tools such as Data Version Control (DVC) that can be used for versioning as presented here for all AI model artifacts.

AI Model Artifact Repository

Often Enterprises host their own repository for security, reliability and compliance requirements, ie they make sure that only trusted artifacts are used and that the model can be reproduced in a reliable manner independent if the original repository is currently reachable or not.

Such an AI model Artifact Repository does not exist at the moment. While there are some AI model repositories, they cover usually only a few, but not all artifacts explained here. Furthermore, they store them not very efficiently as AI model artifacts are very different in nature, e.g. from binary to text and from very small to very large.

The tool DVC described before in fact supports for this reasons different backends for storage – depending on the appropriateness of type of AI model artifact. For example, large binary models can be stored on an object storage or source code is stored in a source code version management system, such as GIT.

Finally, an AI Model Artifact Repository can provide a Unique Resource Identifier (URI) making it possible to identify an AI Model Artifact uniquely. For example: Instead of a version one may also specify „latest“ to always use the latest version, but this should be done with even more care than for software artifacts as this can imply difficult to anticipate shift in behaviour of the AI application. For example, a security software based on AI may not be able to identify anymore security issues because of even minor changes to the AI model artifacts.


I will present here some examples for various type of AI algorithms. In reality there can be also combination of those types. It is not always straight forward and there are – given the nature of AI – sometimes also debatable grey areas on which version to increase. Nevertheless, in those cases one should keep in mind what the purpose of versioning is as it was described in the beginning of this post.


Supervised learning essentially learns from example data inputs and outputs. During inference it is presented with new inputs for which it should provide the correct outputs. A very common use case is the one of classification. For example, given a product description, an AI model predicts the product category.

Let us assume now we have an initial version „product-classifier-1.0.0“.

We find out that for very few categories we had very few product examples leading to poor accuracy for those categories. Hence, we add more product examples for those problematic categories. This means the underlying data changes, but we just add more product examples, so that only the patch version for data changes from „product-data-1.0.0“ to „product-data-1.0.1“. The only metric we have „overall accuracy“ (btw. not a good one for this specific case!) changes from 90,1% to 93%. This means the final state increases the minor version from „product-finalstate-1.0.0“ to „product-finalstate-1.1.0“.

This means in total we need to increase the version of the complete AI Model to „product-classifier-1.1.0“.


Unsupervised Learning does not have training data, but instead discovers pattern in the data it is applied to.

Let us assume in this example we have a clustering algorithm that clusters users into personas that watch certain types of movies.

The initial version of the clustering algorithm is „movie-cluster-1.0.0“. We assess the algorithm based on the Dunn Index. The mean distance per cluster is 5. We are not happy with the clustering and thus change the number of clusters from 4 to 8. This means the initial state changes its minor version, e.g. from „movie-initialstate-1.0.0“ to „movie-initialstate-1.1.0“. The change lead to an improvement of our evaluation metric by 5%. This means the final state changes its major version, ie from „movie-finalstate-1.0.0“ to „movie-finalstate-2.0.0“.

This means the AI model changes its major version from „movie-cluster-1.0.0“ to „movie-cluster-2.0.0“.

Note: Unsupervised learning does not have training data per se, but it is applied to data that changes over time. Especially evaluation metrics as part of the final state must use in all cases some data that needs to be versioned – in this case the data is only used for testing and/or direct use, but not training.

Reinforcement Learning

Reinforcement learning is about „intelligent“ agents that take actions in a (usually simulated) environment to maximize their reward function. Similarly to unsupervised learning, we do not need training data. However, a simulated environment is needed where the actions and their reward are based usually on one or more externally defined probability distributions used in the simulation.

Let us assume for simplicity reasons that the evaluation metric is the maximum reward. The goal for our use case is for a virtual car to finish a race track as fast as possible (it is inspired by the AWS DeepRacer).

We have an initial version of our AI Model: „racer-rl-1.0.0“. We are not happy with the finish time of our virtual car. Thus we change the reward function. The reward function is a rule-based software artifact. Hence, we change the reward function and include several additional new ways of calculating reward. The software increases its minor version from „racer-software-1.0.0“ to „racer-software-1.1.0“. The maximum reward only marginally improves by 0,5%. Thus, the final state changes its patch version, ie the version „racer-finalstate-1.0.0“ becomes „racer-finalstate-1.0.1“.

Thus the complete AI model changes its minor version from „racer-rl-1.0.0“ to „racer-rl-1.1.0“.

Forecasting/Time series

Forecasting is about making predictions based on past/present data. For example, stock market predictions.

Let us assume for this case we want to make predicts of the STOXX Europe 600 index. Our evaluation metric is for this example Mean Squared Error (MSE).

We have an initial model version „stoxx600-forecasting-1.0.0“. We are not happy with our evaluation metric and thus change the underlying data by having the data at a higher frequency (daily instead of weekly). The data version changes its major version from „stoxx600-data-1.0.0“ to „stoxx600-data-2.0.0“. The underlying data may not have changed its distribution, but the semantic meaning of the data changes – from weekly to daily. Hence, a major version change is required. Since the software was cleverly implemented it did not require a change.

The MSE does improve by 2% and thus the final state changes its minor version from „stoxx600-finalstate-1.0.0“ to „stoxx600-finalstate-1.1.0“.

The overall version of the AI model artifact changes its major version from „stoxx600-forecasting-1.0.0“ to „stoxx600-forecasting-2.0.0“.

Symbolic Artificial Intelligence

Symbolic Artificial Intelligence usually employs a set of probabilistic and/or logic rules to make decisions.

Let us assume for our example that we use an AI model for finding medical treatments based on certain conditions of the patient. The model is versioned „treatment-logic-1.0.0“.

We add several new additional rules to be able to determine the same set of treatments, but with additional inputs, ie there is no semantic concept change, but existing rules are adapted. The model changes its minor version from „treatment-model-1.0.0“ to „treatment-model-1.1.0“.

This aspect requires careful assessment. For instance, if those additional inputs would lead to additional treatments or existing recommendations for a non-changing set of treatments would change, ie the semantic concepts change then a major version would have needed to be increased.

Nevertheless, since this is not the case the version of the AI Model changes its minor version „treatment-logic-1.1.0“.


Others try to implement some versioning especially on AI Model Artifact registries (e.g. NVIDIA Model Registry or HuggingFace ModelHub), but the differences between the versions is unclear and sometimes not even consistent. Thus, one often will have the need to change to a newer version, but cannot assess the impact directly and need to implement an expensive ModelOps process in production where over months the impact is assessed with different user groups. Nevertheless, those registries demonstrate already the value of having some type of versioning.

With semantic versioning for AI Model Artifacts, one should get a version that is certain and does not change itself, ie one can refer in a software to it, use it and the behaviour is manageable. If the version is correctly generated then it can be reproduced independently if all aspects of the version are defined. The impact depends on human judgement and it is currently impossible to assess objectively purely based on metrics what the impact is – it is even very difficult to assess if it is small or large, because this depends on the context of how AI Model Artifacts are used.

Finally, there is still a lot of research to be done on model versioning – not only conceptually, but also empirically with AI model artifacts used in different contexts for a production grade service servicing many users. This will require a lot of effort and investment, but will be needed to bring AI to the next level in enterprises. Based on this future research I expect further improvements to AI Model Artifact versioning to leverage the benefits of versioning. Additionally, I hope that eventually an AI Model Artifact build tool is available that can be used to implement the approach described here to enable reproducible AI Model Artifacts.


Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert