Machine Learning and Artificial Intelligence (ML/AI) are now at the epicenter of a digital world that is increasingly becoming data-and-analytics-driven. However, technological developments do not only come with opportunities, they also introduce new risks and challenges to the security posture of applications and systems.
Recent years have seen an exponential growth of adversarial ML/AI methods and techniques, and it is still a very active area of research1. Traditionally, the focus of these systems has been on optimizing inference performance, with security being a secondary concern (if at all). The literature shows a plethora of research work where the data, the learning models, or the system setup are manipulated. An adversary may force your system into learning the wrong pattern, change its output, or leak its data or structure, including the model and its parameters.
How do you build resilient models and data pipelines?
The answer is being aware of potential risks as the system is created, at each step, from raw data collection to model deployment and use. This is the most sensible approach as the existence and pervasiveness of adversarial ML/AI are still not well understood and an active area of research. In IriusRisk, we have opted for the ‘generic ML/AI system’ approach used in the BIML framework2, representing typical ML/AI data pipelines and components.
Every ML/AI system goes through the following steps, all of which have their own security and privacy issues, in addition to the typical performance and operational issues:
- Raw data collection,
- Data pre-processing,
- Data preparation for training, test, and validation,
- Learning algorithm selection,
- Performance evaluation,
- Preparing a deployment environment,
- Deploying the trained model,
- Receiving users' runtime inputs, and
- Generating model outputs.
Steps 1 to 3 represent the ‘data pipeline’, 4 and 5 represent 'model building’, and 6 to 9 represent the ‘model deployment and operational use’. Each of these steps and meta-steps are represented as components in IriusRisk, providing related threats and their countermeasures, along with references and further learning and implementation resources.
This content helps data practitioners and engineers, as well as security people, understand how adversaries take aim at different parts of the ML/AI lifecycle, and how this might impact broader systems. Data collection, for instance, is subject to manipulation, usually termed as poisoning. These attacks, via the training process or input manipulation after systems deployment, cause a model to learn a behavior that an adversary can exploit at a later time, effectively corrupting the model. This is akin to planting a backdoor in software or a network that would be used to trigger an action at any moment in time.
As more and more organizations are embracing a data-driven culture, large amounts of data are increasingly being collected and used in modeling. Large language models, for instance, which have been very much in the news lately, are trained on several dozen terabytes of text data. Inevitably, this large amount of data becomes a very attractive target for cybercriminals. Data and algorithmic leakage might allow adversaries to reveal aspects of the model or the dataset that you did not intend to reveal. Consequently, in addition to security, privacy will be one of the main concerns and challenges of the data-driven world going forward. With this content, we aim to help practitioners navigate these issues and take special care in designing and deploying their systems and controls.
An example threat model
There are thousands of possible use cases of systems and domains using modeling to better understand and improve operations. Data-driven decision-making, for instance, is increasingly used in finance. The following example shows an algorithmic trading system using an ML model to evaluate scenarios and support trade execution. The model learns market patterns and behavior from historical trading data to make trading suggestions. Typically these are ensemble models combining a number of algorithms. The outputs are then used in a simulator environment to replicate market trading based on model predictions and guide final trading decisions.
The threat model in Figure 1. shows the components of the end-to-end system, including the data pipeline and model creation, the deployment setup, the simulator environment, and the external dependencies and components. IriusRisk generates relevant threats and countermeasures for each component in the system, including all the steps involved in training and generating a model. Figure 2. shows a few examples of threats and countermeasures applicable to the data pipeline.
Note that this content does not intend to delve into the specifics of the hundreds of existing algorithms, but rather to provide a generic baseline and the set of current risks to consider, albeit with a particular effort to give both the data and security practitioners a set of guidelines, implementation examples, and tools, from the literature, the industry, and the broader community.
Figure 1. An example threat model using the ML/AI library.
Figure 2. Example threats and countermeasures applicable to the data pipeline.
Looking for more?
Take a look at our dedicated Machine Learning and Artificial Intelligence Landing page for the most up to date news and information.