I will address in this post the issue of maintenance of large pretrained embeddings within Artificial Intelligence (AI) services. While this issue has some links to ethical aspects (see for example the European Commission’s guidelines on trustworthy AI or here), the focus here is on maintainability of those embeddings as part of MLOps. Software Maintenance is very relevant for enterprise services for reasons of business continuity. Hence, pre-trained embeddings as part of enterprise AI services must be taken into account for maintenance. Furthermore, the main objective of this post is to trigger thinking about how pretrained embeddings affect maintenance of AI services.
Pretrained embeddings have become very popular around 2012/2013, especially text embeddings, such as Word2Vec. In fact, new embeddings (e.g. variants of BERT, XLNET, GPT-3, T5 etc.) are published nearly weekly by large IT players and others. They are very popular as they are generic and can be fine-tuned to a very specific dataset usually requiring only a couple of training examples instead of thousands of examples to perform specific tasks. Thus they save a lot of manual labour cost and effort to use AI in any context. Furthermore, they are extremely easy to use – they are usually downloaded in the background and a few lines of codes are needed to develop an AI task.
However, in context of AI services that are supposed to be reliable, secure, maintainable and trustworthy, the question of maintenance arises very quickly. This is challenged in several ways:
- The underlying data of those embeddings is usually not curated. The key is that a large amount of data is used for training them and thus curation is too costly – even for the largest IT firm. Thus they must be updated if there are questions with respect to their trustworthiness especially in context of bias. Furthermore, the „library of babel“ effect may come into play.
- Pre-trained embeddings are subject to data privacy regulations, such as the General Data Protection Regulation (GDPR). The rights of data subjects for data erasure, data mobility, data rectification etc. pose challenges for pre-trained embeddings as they are based on public data that contains private data. Hence, they require frequent maintenance.
- As said, there are nearly weekly new types of models underlying the embeddings or training the embeddings very differently. However, at one point in time an AI Engineer needs to decide for a given embedding. The organisations behind those pretrained embeddings usually have little interest to maintain the embedding if a newer type of embedding is provided which is more „modern“ than the old. In fact, usually one may consider seriously older embeddings as they usually more computationally efficient with only little to no difference in accuracy.
I will analyse in this blog post if large pretrained embeddings are maintained and how frequently they are updated. Furthermore, I will discuss what are potential good update frequencies of embeddings for trustworthy AI services. Keep in mind that this is not about criticising existing large pretrained embeddings, but merely to trigger more research in this area to enable trustworthy AI services. Additionally, several of the issues raised here apply also to trained models as part of repositories of pre-trained models (also called „model-zoos“).
What are embeddings?
Embeddings in Machine Learning / Natural Language Processing are used in AI pipelines to reduce the dimensionality of input data significantly to improve training results. The original data is mapped to a compact form, similar as embeddings in a mathematical sense. They are one solution to the curse of dimensionality problem. This reduction should preserve the relevant information (especially semantic ones and concepts), but at the same time drop all irrelevant information („noise“) that is not needed for downstream tasks.
Embeddings are most commonly available for natural language processing tasks, but exists also for other media types, such as images. Additionally, machine learning tasks on graph can be made feasible using graph embeddings.
As you may note, in order to do be able to do a reduction while preserving the most relevant information, an embedding needs to be trained to do so. This is very similar to training a machine learning model. However, the objective of the training is here quite ambiguous – how would you tell a machine learning model what the relevant information in a picture or a text are – many details and special cases would need to be labelled. Luckily, it is possible to perform „easily“ a semi-supervised approach by using existing texts and images. For example, you can get from Flickr more than billions of images or Wikipedia had in 2020 more than 400 million articles. Then you have various ways on how to use semi-supervised learning. A simple approach would be to mask selected parts of the image or the text and predict what is behind the masked part. In that way the embedding would learn some inherent statistical relations in the provided dataset.
There are the following things to mention here:
- A pretrained embedding – independent how big it is or whatever contemporary deep learning techniques you may use – can only discover statistically significant relations within the data. It is most likely not able to understand concepts or infer new concepts not explicitly mentioned in the data.
- A pretrained embedding highly depends on its input data and it needs a lot of it. However, a lot does not necessarily mean the embedding becomes better or more universal. Unfortunately, most of our currently available data is based on different potentially conflicting social constructs. That often implies that it is imperfect, biased, wrong, contradicting and likely to change over time as society changes. I will explain more on this below.
- Pretrained embeddings are difficult to reproduce without costly risky resource investment. Usually, one needs to train them multiple times to get a good one. Additionally, there exist no good evaluation metric – aside perplexity – and evaluation is done using different graph, NLP or image tasks which may or may not of practical relevance, ie a company using the pretrained embedding cannot be sure on how well it will work.
Theoretically, any semi-supervised approach can be used for developing a pretrained embedding. A pretrained embedding can contrary to a model be used as part of various models addressing different tasks. For example, the above mentioned masking approach for text can be used for solving question/answering tasks, classification tasks or natural language inference. Thus, it is referred to as transfer learning. For example, the BERT language embeddings are using a masking approach and a next sentence prediction training for their training.
Find here a table of a some existing algorithms for embeddings
|Bag-of-words / Skip-gram
|Bag-of-words is one of the oldest types of embeddings and still used in many NLP applications nowadays due to its speed and predictable behaviour. Skip-gram is similar, but usually less efficient. Both try to predict the context of a given word. As early as in the 1950ies linguists have described an approach called bag of words to describe the structure of language.
|The approach Bidirectional Representations from Transformers (BERT) has lead to many different subtypes usually based on different underlying data and training concepts. It leverages the transformer architecture based on several attention heads that take into account multiple contexts of the same word at the same time (e.g. words nearby and object/verb relations). The underlying training problem of BERT is based on predicting masked words and next sentences. In its original version BERT was trained on BooksCorpus (800M words) and Wikipedia (2500M words). Lists, tables and headers were ignored for the latter. It is not described which version of Wikipedia was used, but since the initial submission of the paper was October 2018, I assume it must have been before. Training and prediction tasks using BERT is usually computationally expensive. BERT provides as well a pretrained classification task using the embedding that can be fine-tuned.
|graph (network of relations between financial entities)
|A graph embedding reduces patterns in a domain-specific graph into a vector-embedding. For example, DeepTrax focuses on relations between financial entities in graphs. Common patterns between financial entities are encoded into a lower-dimensional vector space. This can be used for various down-stream tasks, such as fraudulent transactions or anomaly detection. It is not clear what data has been used, but since the authors of the paper work for a commercial bank, one can assume that it is an internal transaction database.
|ELMo (Em-beddings from Language Models) represent words as their context (i.e. words around it) by employing two LSTM models which learn the representation left and right of the word respectively. It learns by trying to predict correctly the previous tokens and the following tokens related to a given token. The original embedding is trained on a corpus of 30 Million sentences of the one billion words benchmark from 2013. Training and prediction tasks using ELMo is usually computationally expensive.
|images / faces
|This is one of the few image embeddings focusing on faces. The original embedding minimizes the distance between faces of the same identity and at the same time maximizes the distance between faces of different identities. Faces can be represented as the vectors of those embedding by employing nearest-neighbour-search or similar.
|Fasttext is an embedding that can potentially also be used for NLP tasks involving words not in the corpora used for training the embedding. It learns by representing words as character-ngrams and by predicting the context of those words using positive and negative examples (the latter are randomly chosen from the dictionary). As the name indicates fasttext usually as a high speed for NLP tasks. It has been extended for a multi-lingual setting by providing the MUSE embeddings.
|Flair is a framework for supporting multiple different text embeddings (including those mentioned here). Furthermore, it is an embedding on its own. This own embedding basically is learned by employing two models – one that predicts the previous characters of a given character and one predicting the next characters. A word is represented as a combination of those two models. Furthermore, a traditional word embedding might be integrated in addition.
|GPT-3 is a large embedding and model pretrained for various NLP tasks. NLP tasks are performed by providing an initial text and let the model complete the rest. Example:
„translate English to German: how are you doing?“ and GPT is expected to output „Wie geht es Dir?“.
GPT-3 also leverages the transformer architecture. The approach for training is similar to its predecessors: predicting previous and next words. As the Github page states „GPT-3 was trained on arbitrary data from the web, so may contain offensive content and language“ there has been no curation of content. Common Crawl from 2016 to 2019 has been used, Webtext, two corpora with different books and Wikipedia. GPT-3 is only available behind an API, i.e. not the model itself, but its predictions.
|A graph embedding reduces patterns in a domain-specific graph into a vector-embedding. Node2Vec learns typical neighbourhoods in a graph for the embedding. This can then be used to predict connections between nodes in a graph or node classification tasks. Node2Vec does not provide a pretrained embedding itself, but the approach can be used to create embeddings for domain-specific graphs (e.g. biomedical embeddings). Node2Vec can be also applied to natural language processing tasks (e.g. as knowledge graphs or for language-context modelling).
|T5 is similar to GPT-3 and an approach to explore the limitations of transformers. It is a research project that is designed to be reproducible (in theory). CommonCrawl is used but cleaned according to standards by the author and resulting in 750 GB of information. Beyond that the text is not curated especially. While the T5 is documented, the costs of reproducing it are very high. T5 is very similar to BERT, but pretrained on much more and different data as well as the model based on the embedding is trained on various NLP tasks.
|Word2Vec represents words as composable vectors, i.e. it is possible – if the text used for pre-training contains sufficiently enough examples – to do linear operations on them. For example, V(„Germany“)+V(„capital“) results in a vector close to V(„Berlin“). Of course, this is an idealized situation an does not necessarily happen with all operations like this for various reasons. Also other embeddings can expose this property. Word2Vec is relatively fast to use.
|XLNet is based similar to BERT on a transformer architecture. However, it is an autoregressive model and not an autoencoded one. However, the training objective differs: Contrary to masking certain words in the sentence, XLNET generates all possible permutations of a sentence and predicts the most likely one (* with some optimisations). The reason is that masking is seen as artificial as it is not relevant for downsttream tasks. Predicting the next sentence was not included as a training objective as it did not lead to improvements for XLNET. Similarly to BERT it was trained on BooksCorpus and Wikipedia. Additionally, it includes Giga5, ClueWeb-2012B and CommonCrawl. In total it used 32.89B for pretraining the embedding.
I do not describe here pretrained models based on those algorithms as this will be investigated later.
Furthermore, there are nowadays many different variants of those algorithms – especially based on different training data and learning approaches, but also type of data (e.g. word vs sentence vs paragraph or single-modal vs multi-modal). Furthermore, since they are computationally very expensive there are various efforts to reduce the size of them, which is usually described as compression or distillation.
While some embeddings are simpler or older that does not mean one needs to use the latest embeddings for best results. It really depends on your requirements and how you engineer AI around them. Simpler models usually require much less computational resources and development effort. Thus, they are more cost-efficient, especially for large scale datasets. Furthermore, they can be more robust and predictable for humans to manage. Additionally, research usually stops at benchmarking at specific KPIs, such as accuracy on a test dataset, for a specific open dataset at a specific point in time. AI Services in production do not have those characteristics and thus only by taking into account all functional and non-functional requirements as well as behaviour over time on real data make the model predictable. Hence, usually a good AI service has multiple models of different time competing with each other or join forces.
What is the current state of embeddings?
It is currently an open research question if the type of approach used for learning the pretrained embedding has an impact on the success for solving downstream tasks. This is also inherently difficult to prove and most deep learning models are benchmarked in research against open datasets using performance indicators that have nothing to do with industry real-life AI services. Those are usually not published in a transparent reproducible manner.
However, one limitation is at the moment, that embeddings can only be used for one type of data. For example, an embedding for images can mostly be used for image related AI tasks and an embedding for text for text-based tasks. There are also combined embeddings, but they can mostly be used for AI tasks related to combinations, e.g. predict the caption of an image.
At the moment, we can also say little about the preselection of data for the embedding. For example, you may choose for the embeddings only financial texts if your downstream tasks are also related to financial tasks. There seems to be evidence that this can help a bit according to „academic“ performance indicators and it is plausible that such special embeddings contain less noise. For example, if you talk in a financial text about a bank then you very likely do not refer to a bank in a park. However, this may not be the case for real-life AI services in real-world organisations. For example, when evaluating the risk of a firm in a financial text you want to include nowadays also aspects of climate change which have only little mention in contemporary financial text used for training of the embedding. Thus, too narrow might be also not too good for real-world usage – especially for handling future tasks with a different scope. Generally, all existing pretrained embeddings are based on a narrow excerpt of reality at a given point in time despite being trained on large datasets.
There are also further type of embeddings that do not require any training data, but are based on logic and/or probabilistic inference, which still requires human input. Those are related also to knowledge graphs/ontologies. Those are potentially suitable for other tasks, but they are also more complex (see my previous post).
How are embeddings used?
Embeddings exist predominantly in the natural language space, but as mentioned also in other spaces.
There main use case is to easily solve downstream AI tasks, such as image/text classification, text summarisation or natural language inference. By „easy“ I mean the following:
- Simple to develop and start with for virtually everyone with and without a computer science background
- Requiring zero to few training examples for downstream tasks: Labelling data is daunting, difficult to do in high quality and takes a lot of resources. Furthermore, in a production grade AI service, labelling needs to take place on a continuous basis to validate that the model still works and to collect training samples for automated retraining of the model. Thus, this effort needs to be reduced significantly
Note: I mention those aspects related to the modelling part. For a real production grade AI service much more aspects come into play as I will explain later.
Finding production grade AI Services that uses the recent large-scale embeddings is difficult – especially that reliable use it.
Google claims to use, for example, BERT in the processing of search queries, but at the same time the results are ranked by RankBrain. The details and test cases are scarce and the success performance indicators are not communicated. Since an ensemble of different AI models is used in Google Search, it is also difficult to figure out how much one single of those contribute it. At the same time, the key performance metric for Google is to sell as much advertisement as possible and that users click advertisements. This may not necessarily a good indicator on search result quality. Nevertheless, it remains unclear if BERT helped with that objective or not.
Interestingly, the Google announcement triggered a lot of feedback in the SEO (Search Engine Optimisation) industry on how to adapt content to be ranked higher than other content given the changes introduced by Google (cf. e.g. here or here).
What are the potential issues related to trustworthy AI services?
Despite the big research success of large embeddings in recent years, such as BERT or XLNET, they are not widely spread used in AI production services as we have seen. While they are extremely easy to use, a huge body of literature exists and virtually everyone can write rapidly a program solving a NLP tasks using them, they have also several shortcomings:
- A lot of computational power is needed not only to generate them (they are pre-trained so it is not an issue), but also to do fine-tuning on down-stream AI tasks as well as use them on new data input (e.g. „prediction“).
- They are trained on pre-selected datasets that may not be relevant for a highly specialised domain (e.g. medicine) or have inherent disadvantage (cf. GPT-3 which is trained on „problematic“ content) that has serious implications on the use in a production AI service (cf. e.g. here). Unfortunately, there is no good way on predicting if a large pretrained embedding will work well in an production grade AI service
- Creating an own pre-trained embedding requires a huge cost and resource investment which is not available for everyone and simply not justified for the use case
- As written before, embeddings have to be seen in the training context of one or more tasks that they should be able to perform. For example, from the embedding itself one cannot expect things such as reasoning (e.g. what do I know and what not?), understanding math (e.g. what is 1+1?) or time (what have I done yesterday?). Nevertheless, they can be used in context of a model where some of those aspects might be addressed. The bottom line is that despite that they have been fed with tons of information they cannot leverage them adequately
- Many pretrained embeddings that are based on data that contains public information, but still subject to data privacy regulations. For example, the CommonCrawl corpus may violate the General Data Protection Regulation (GDPR) as well as other privacy regulations and thus are a risk to use – especially in an enterprise setting (cf. here, here, here or here). There has been no legal analysis so far in how the use existing data (e.g. Wikipedia or CommonCrawl) is problematic from a legal perspective for use in models, but they do contain private data and from a legal perspective it does not matter if its public or not to be subject to data privacy laws.
- Providing the AI model is usually only 5% of the services, 95% of the efforts are spend on AI/Software engineering
- While large-scale pre-trained embeddings look good on academic papers, the success has not been reproduced for AI services in production and very little is documented on that aspect, especially performance indicators and reproducibility of results of using them in production. AI Services in production need to be constantly monitored and metrics cross-validating the outputs with other models and datasets need to be employed.
- Similar to a software program, pre-trained embeddings need to be maintained, because (1) they contain undesired bias which is bad for an organisation providing/consuming AI Services (2) The world changes. Every year there are millions of edits on Wikipedia, edit-wars get resolved and new edit wars start as well as new information (e.g. books, news, etc.) get published (3) the training objective is active subject to research and is probably going to be adapted in the future to cater for production AI services and (4) legal obligations (e.g. data subjects demand to modify or remove private data) (5) AI models including embeddings need to be protected from Hacker attacks trying to influence the decisions of AI Models to cause harms over others
Surprisingly the last point has not been heavily investigated in the past: Are large pre-trained embeddings maintained? If so – how often are they updated? Does it really matter?
How much does the world change and what does it imply for embeddings?
I will investigate in this section how much the underlying data for pretrained embedding changes. I will focus here mostly on text-based embeddings as there are many more available and a lot of pretrained ones are provided publicly as well as used by many services. However the conclusions can analogously be applied to other types of embeddings.
Similarly to machine learning, one big issue of pretrained embeddings is that they are created at a certain point in time based on some data available at this time. This is especially problematic for text embeddings, because new concepts appear all the time. While grammar does not change so much – although this is debatable in times of microblogging and messaging – new facts and relations appear all the time. If you think about the Covid19-Pandemic then it becomes evident that an embedding from the pre-covid area will be not so good for many NLP tasks (e.g. classification of economic impact of news) as terms such as „lock-down“ are not represented very well. However, even more minor/subtle changes can lead quickly to poorly performing AI services.
Find here some data sources for text and their update frequencies
|The BookCorpus contains various free books of different domains (fiction, non-fiction, fantasy etc.). BookCorpus is problematic for training embeddings as they cannot make a difference between fiction and non-fiction. Contains various languages, but English is dominating. Note: contemporary bestsellers are usually not free and thus not included.
|There is no standard Book Corpus and thus no update frequency.
|CommonCrawl crawls monthly web pages and content of any type. The exact set can be derived from the crawl. Each crawl may contain shared pages, but also pages unique to the crawl. A website is not guaranteed to be included in the crawl. Contains various languages, but English is dominating. Contains private data.
|monthly, as of 2014, before seasonal. Earliest is from 2013
|Wikipedia is a living encyclopaedia, as such it is updated virtually every second. Anyone may edit Wikipedia. In 2020 it has seen more than 600 Million edits. Contains various languages. Contains private data.
|Wikipedia offers complete full dumps nearly every couple of minutes.
However, also for any other types of embeddings recentness is relevant:
- Graphs: Graph represent concepts at a given point in time. For example, the aforementioned financial transaction graph embedding is based on financial entities and their relations. This changes on a constant basis as business models are created and changed. New type
- Images: Images preserve the reality at a certain point in time. For example, fashion and looks change rapidly. Climate change can lead to fast changes of landscapes. Natural disasters change how a location looks like.
The bottom line is that all embeddings are capturing the current Zeitgeist due to the data. Obviously, this can change fast and AI Services based on those embeddings are expected to move at the same speed. Thus, it is plausible to update the embeddings frequently, especially if your AI Service has to deal with recent information that is representing latest information (e.g. recent news).
Targeted disinformation in embeddings
Additionally, the data sources underlying embeddings are subject to targeted disinformation (e.g. here). While they are usually temporarily present in the original data source they may end up in an embedding and one can assume that any embedding based on public data sources contain disinformation that may heavily impact your AI model despite that they are being small in nature. In fact, this type of attack is common and has already been identified long before.
At least, one should verify that embeddings based on „old“ data can handle new aspects properly according to business objectives.
New embedding types are published frequently
Obviously, not only the underlying data changes. The methods to train such an embedding change as well. However, here one does not need always go with the latest. While newer embeddings show improvements in accuracy on specific datasets, this cannot be generalized to a specific dataset that you fine-tune the embedding on. Furthermore, accuracy on a test dataset does not mean it will have the same when providing the AI Service on real new data. Often this is not the case, but lower. Newer methods also require heavy costly compute power, which is often not justified for a marginal improvement or make some AI services prohibitive.
Finally changing embedding type frequently may also lead to less transparency and trustworthiness.
What is the state of current publicly available embeddings?
I investigate in this section some existing embeddings based on the dimensions of maintenance, reproducibility and computational effort. Those dimensions should not be understood as criticism – at the moment there is no good understanding and empirical evidence on long-term large-scale application of embeddings. Furthermore, this highly depends on your business objectives/requirements and those are out of scope at the moment of many research approaches.
Find here a list of pre-trained embeddings and when they were last updated (state: 21.02.2021). Note: This is about the pre-trained embeddings and not the source code/library using them.
|BERT (Google Original)
|Derived from download URLs, e.g.
|Derived from model card
|ca. 2018, data based on 2011
|The website states the embeddings were create around 2018 based on data from 2011 (WMT2011/One Billion Words Benchmark)
|Derived from model filename available under the website
|Not fully clear, some references on webpage indicate data from 2017 („Wikipedia 2017“)
|Derived from dates in download folder,
|latest state is undocumented/intransparent
original paper: training data from 2016-2019
|The original paper
|Derived from model card
|Derived from website and links to embedding files on Google Docs
|According to statement on its Github project
|Derived from model card
One interesting aspect is that there are usually no previous versions of embeddings available. Either because they never existed or they are overridden. This also implies that the users of those embeddings must take care of backups and versioning themselves.
Discussion and way forward?
Embeddings are not updated
We have seen that there are several pre-trained embeddings and we expect more in the future, because they have been proven useful.
I presented here that they need to be updated regularly, because they represent contemporary social constructs, which rapidly change – think about news or new world-wide events, such as climate change or digitalisation. Furthermore, there is another driver: security and constant disinformation. While disinformation is only marginally presented in the data sources, we have to assume that any embedding always will contain some disinformation. This may be very target and thus can heavily influence your model based on the embedding.
Nevertheless, we have also seen that pre-trained embeddings are not updated regularly or at all. While it is still an open research question on how often a pre-trained embedding and the model based on it should be updated, we see also that changes to the underlying data sources (e.g. Wikipedia) are updated frequently. Hence, one may orient around typical IT security update frequencies of one month or when there is a big issue related to misinformation is detected in the data even an immediate update and retraining of the model based on it is needed. Of course, metrics need to be researched to determine if the new version of the pre-trained embedding will behave similarly as the old version.
Trends are followed and approaches are not maintained
More concerning is that „older“ modelling approaches are more likely to not be updated at all, despite that the approach is still very useful (e.g. much faster prediction than more complex newer models). This means one is forced to change to a new embedding approach with often completely different underlying data sources, which makes reliable operation of an AI Service very challenging.
Costs of maintaining embeddings is prohibitive
Obviously maintaining pre-trained embeddings is cost and resource intensive. Many organisations cannot even afford or make the business case to train their own embedding – they rely on existing ones. Often the existing ones are created by researchers and do not follow AI Engineering principles for production AI Services. Usually, it is not done by re-running the training pipeline. New data needs to be taken into account. Several run need to be executed, e.g. due to errors or low quality (e.g. perplexity high).
Data from far in the past may be needed
Another open question is how much data from the past would be needed, e.g. 1 year, 2 years, 10 years. Obviously, it depends on the task to solve, but since embeddings should be generic, it can be useful to use as much past as possible. Potentially models will need to recognize the time when the data that is used is for prediction was created to use the correct part of the pretrained embedding.
Private data needs to be changed or deleted frequently
According to many data privacy regulations, such as GDPR, data subjects have the rights to change or delete private information about themselves – even if is published in public. Thus, they may request for example on a Website part of the CommonCrawl dataset to remove data. This then obviously need to be removed in any pre-trained embedding based on CommonCrawl.
Research on maintenance requirements for pre-trained embedding is needed
I hope that in the future more AI researchers and practitioners will pick up the research questions mentioned here and come with reliable data and facts on this topic from practical and theoretical side. Trustworthy AI is not only a vision – it must be realised.