Machine Learning and Security

Chairs

Prof. Mauro Conti

University of Padova, Italy
Director of the 2019 International Summer Schools

Welcome Message: Introducing the School and the SPRITZ Security and Privacy Research Group together with its activities in the area of machine learning and security.
Final Remarks

Prof. Gabriele Tolomei

Sapienza University of Rome, Italy
Executive Director of the Machine Learning and Security School

Welcome Message
Session Chairing
Final Remarks

Speakers

Dr. Lamberto Ballan

University of Padova, Italy

Talk title: Deep Learning Under the Hood: Algorithmic Bias and Fairness

Abstract: Many are claiming that data is the "new oil", and the machine-learning paradigm is changing how we address many different problems. Self-driving cars, robot caregivers and chatbot platforms are really happening, while they were only popular sci-fi topics until a few years ago. Deep Neural Network architectures – trained on very large datasets (e.g. ImageNet) onto fast dedicated hardware (e.g. GPUs) – are the most popular approach and the main reason why machines can now recognize objects in an image, and translate speech in real time. However, despite the impressive achievements of these technologies, big data and AI are often used and referred as to “black boxes”. The aim of this lecture is to give a gentle introduction to the key concepts that have led to the recent success of these techniques, and also to highlight what are the main challenges, open problems and weaknesses of deep learning based systems.

Dr. Battista Biggio

University of Cagliari, Italy

Talk title: Wild Patterns: Ten Years after the Rise of Adversarial Machine Learning

Abstract: Data-driven AI and machine-learning technologies have become pervasive, and even able to outperform humans on specific tasks. However, it has been shown that they suffer from hallucinations known as adversarial examples, i.e., imperceptible, adversarial perturbations to images, text and audio that fool these systems into perceiving things that are not there. This has severely questioned their suitability for mission-critical applications, including self-driving cars and autonomous vehicles. This phenomenon is even more evident in the context of cybersecurity domains with a clearer adversarial nature, like malware and spam detection, in which data is purposely manipulated by cybercriminals to undermine the outcome of automatic analyses. As current data-driven AI and machine-learning methods have not been designed to deal with the intrinsic, adversarial nature of these problems, they exhibit specifc vulnerabilities that attackers can exploit either to mislead learning or to evade detection. Identifying these vulnerabilities and analyzing the impact of the corresponding attacks on learning algorithms has thus been one of the main open issues in the research field of adversarial machine learning, along with the design of more secure and explainable learning algorithms.
In this talk, I review previous work on evasion attacks, where malicious samples are manipulated at test time to evade detection, and poisoning attacks, which can mislead learning by manipulating even only a small fraction of the training data. I discuss some defense mechanisms against both attacks in the context of real-world applications, including computer vision, biometric identity recognition and computer security. Finally, I briefly discuss our ongoing work on attacks against deep-learning algorithms, and sketch some promising future research directions.

Dr. Stefano Calzavara

Ca' Foscari University of Venice, Italy

Talk title: Machine Learning for Web Vulnerability Detection

Abstract: Cross-Site Request Forgery (CSRF) is one of the oldest and simplest attacks on the Web, yet it is still effective on many websites and it can lead to severe consequences, such as economic losses and account takeovers. Unfortunately, tools and techniques proposed so far to identify CSRF vulnerabilities either need manual reviewing by human experts or assume the availability of the source code of the web application. In this paper we present Mitch, the first machine learning solution for the black-box detection of CSRF vulnerabilities. At the core of Mitch there is an automated detector of sensitive HTTP requests, i.e., requests which require protection against CSRF for security reasons. We trained the detector using supervised learning techniques on a dataset of 5,828 HTTP requests collected on popular websites, which we make available to other security researchers. Our solution outperforms existing detection heuristics proposed in the literature, allowing us to identify 35 new CSRF vulnerabilities on 20 major websites and 3 previously undetected CSRF vulnerabilities on production software already analyzed using a state-of-the-art tool.

Dr. Claude Castelluccia

INRIA, France

Talk title: From Machine Learning Security to Cognitive Security

Abstract: Online services, devices or secret services are constantly collecting data and meta-data from their users. This data collection is mostly used to target users or customise their services. However, as illustrated by the Cambridge Analytica case, data and technologies are more and more used to manipulate, influence or shape people's opinions online, i.e. to "hack" our brains. In this context, it is urgent to develop the field of "Cognitive security” in order to better comprehend these attacks and provide solutions.
This talk will introduce the concept of "Cognitive security”. We will explore the different types of cognitive attacks and discuss possible solutions. We will show how cognitive security can benefit from the field of Machine Learning Security.

Prof. Pavel Laskov

University of Liechtenstein, Liechtenstein

Talk title: Machine Learning for Malware Detection and Analysis

Abstract: The problem of malware detection has been the Holy Grail of computer security for several decades. Vulnerability of computing systems to exploitation via malware enable attackers to circumvent almost any established security mechanism. Running malicious software on compromised devices can be used by attackers in various ways to earn profits. It can also inflict irrevocable damage to affected systems. Timely detection of malicious software is therefore a crucial task of security mechanisms.
In this talk, I will present an overview of existing approaches to applying machine learning for detection and analysis of malware. The first problem for which effectiveness of machine learning for this taks has been demonstrated is detection of malicious PDF documents. Discovery of vulnerabilities in Adobe PDF parsers in 2008 caught the entire security industry off-guard. None of the then existing signature-based detection methods could cope with the vast structural complexity of the PDF format. The early scientific approaches based on machine learning developed in 2011-2012 demonstrated the power of automatic analysis and exceeded detection accuracy of best antivirus engines. Detection of malicious PDF documents has also served as an important use-case for investigation of attacks against machine learning algorithms and the respective defenses against such attacks.
In the second part of my presentations I will present techniques for detection and classification of malicious executables which represent the operational functionality of attacks. From the technical point of view, this problem can be addressed in two different ways. In the static approach, decisions are made based on the content of executable files. In the dynamic approach, executable files are run in a special environment which enables to track all actiona carried out during execution. For both approaches, using machine learning on top of suitable features obtained from the basic analysis enables one to accurately discriminate between malicious and benign executables as well as to categorize malware it into meaningful families.

Dr. Marco Morana

University of Palermo, Italy

Talk title: Online Social Networks: Opportunities and Security Challenges

Abstract: In recent years, the widespread diffusion of online social networks (OSN) not only has enabled people to new forms of interaction within virtual communities, but it has also created a new paradigm for sharing information in a pervasive way. Social networks spread information faster than any other media! Unfortunately, this make OSNs also target of users interested in performing malicious activities, such as spamming, spreading malware, and performing security attacks.
In this talk, the adoption of AI and ML algorithms for social network analysis and protection will be discussed, while also highlighting the open challenges that still need to be addressed.

Dr. Stjepan Picek

Delft University of Technology, Netherlands

Talk title: Machine Learning and Implementation Attacks

Abstract: Recent years showed that machine learning techniques can be a powerful paradigm for implementation attacks, especially profiling side-channel attacks (SCAs). Still, we are limited in our understanding when and how to select appropriate machine learning techniques. Additionally, the results we can obtain are empirical and difficult to generalize. In this talk, we discuss several well-known machine learning techniques, the results obtained, and their limitations. We especially concentrate on deep learning techniques and potential benefits such techniques can bring to SCA, with an emphasis on real-world scenarios. Next, we examine how various AI techniques are used for fault injection attacks.
In the last part of the talk, we discuss how SCAs are used to attack deep learning implementations.

Dr. Fabrizio Silvestri

Facebook Inc., UK

Talk title: Misspelling Oblivious Embeddings

Abstract: We present a method to learn word embeddings that are resilient to misspellings. Existing word embeddings have limited applicability to malformed texts, which contain a non-negligible amount of out-of-vocabulary words. We propose a method combining FastText with subwords and a supervised task of learning misspelling patterns. In our method, misspellings of each word are embedded close to their correct variants. We train these embeddings on a new dataset we are releasing publicly. Finally, we experimentally show the advantages of this approach on both intrinsic and extrinsic NLP tasks using public test sets.

Prof. V.S. Subrahmanian

Dartmouth College, USA

Talk title: Predictive Analysis of Android Malware

Abstract: The talk will contain a comprehensive overview of the use of machine learning techniques for Android malware classification and prediction problems. I will start with a problem raised by Symantec: to predict the family to which an Android malware sample belongs. I will describe the EC2 algorithm that mixes clustering and classification to solve at 156-way classification problem. The second problem I will discuss is that of detecting Android Banking Trojans (ABTs) and show that using new classes of features based on the novel family of "Triadic Suspicion Graphs" both aids efficient detection as well as is resilient to some types of adversarial attack. Third, I will discuss recent work on detecting Android rooting malware that uses feature transformations, together with a mix of clustering and classification in order to improve predictive accuracy and also be more resistant to adversarial attack. If time permits, I will also discuss some of our work on predicting whether an Android app will turn malicious in the future, even if it is currently benign.

Dimitris Tsipras

Massachusetts Institute of Technology, USA

Talk title: Robust Machine Learning: Progress, Challenges, and Applications

Abstract: Recent progress has made the deployment of machines learning (ML) systems in the real world an imminent possibility. But are our current ML systems really up to the task?
In this talk, I will discuss the pervasive brittleness of existing ML tools and offer a new perspective on how it arises. I will then describe a conceptual framework that aims to deliver models that are more reliable, and robust to adversarial manipulation. Finally, I will outline how this framework constitutes a new learning paradigm, how it differs from the classic perspective, and what benefits it provides, beyond robustness itself.

Dr. Sicco Verwer

Delft University of Technology, Netherlands

Talk title: Learning State Machines or Automatic Reverse Engineering of Communication Protocols

Abstract: In this talk, I describe key algorithms for learning state machines from black-box software components. I discuss their theoretical guarantees, the tools that implement them, and how we use them in practice. Essentially, these algorithms aim to turn a black-box component, such as a communication protocol, into a white-box model, in this case a state machine.

Prof. Shanchieh Jay Yang

Rochester Institute of Technology, USA

Talk title: Anticipatory Cyber Defense: Extracting, Synthesizing, and Predicting Attack Behaviors

Abstract: Critical and sophisticated cyberattacks now take multitudes of reconnaissance, exploitations, and obfuscation techniques to penetrate through well protected enterprise networks. The discovery and detection of attacks, though needing continuous efforts, is no longer sufficient. Imagine a system that automatically extracts the temporal, spatial and contextual patterns of attacker actions, and generates empirical models that can be used for in-depth analysis or even predict next attack actions. What if we can synthesize plausible novel attack scenarios that have not been seen before? Such advances will be key to provide actionable predictive intelligence for anticipatory cyber defense.
This lecture will challenge the students to explore techniques to extract, learn, and synthesize cyberattack models. Students will be presented a suite of research advances: ASSERT employs information theoretical based Bayesian learning to generate and refine attack models based on observed malicious activities; CASCADES simulates how attackers of different capabilities and preferences gradually accumulate knowledge to penetrate into a network; CAPTURE overcomes limitations of imbalanced, insufficient, and insignificant data to forecast cyberattacks before they happen using unconventional signals in the public domain. During the lecture, students will learn and brainstorm mathematical treatments for selected problems.

Dr. Fabiana Zollo

Ca' Foscari University of Venice, Italy

Talk title: From Confirmation Bias to Echo-chambers: A Data-driven Approach

Abstract: The advent of the Internet and web technologies have radically changed the paradigm of news consumption, leading up to the formation of a new scenario where people actively participate not only in the diffusion of content, but also its production. In this context, social media have become central not only to our social lives, but also to the political and civic world, rapidly establishing as the main information source for many of their users. However, social media are riddled with unsubstantiated and often untruthful rumors that can influence public opinion negatively. Since 2013 the World Economic Forum has indeed been placing the global danger of massive digital misinformation at the core of other technological and geopolitical risks. Understanding the main determinants behind content consumption and the emergence of narratives online is thus crucial. In this talk, we address such a challenge by analyzing massive data from online social media, such as Facebook and Twitter. We provide the empirical existence of the so-called echo chambers, polarized groups of like-minded people where users reinforce their pre-existing opinions. We show the role of confirmation bias in content consumption, address the emotional dynamics inside and between different narratives, and investigate users’ response to both confirmatory and contrasting information (fact-checking). Our findings reveal that similar patterns also hold for political (the Brexit, the Italian Constitutional Referendum) and public (Climate Change, Vaccines) debates. Our results provide interesting insights about the determinants of polarization and the evolution of core narratives on online debating platforms, thus highlighting the crucial role of data science techniques to map the information space.