I am a Principal Research Scientist in the Security Research team at SAP. I am based in Sophia-Antipolis, in Southern France, and I have been with SAP since 2010.
My current focus is primarily on artificial intelligence for source code analysis applied to the field of software security, particularly open-source software.
I was part of the team that invented and developed Eclipse Steady, the tool that SAP has used since 2015 to scan the dependencies of its Java products. In February 2019 my colleagues and I released the vulnerability dataset that fuels Steady at SAP; an extended version of that dataset is now available through project KB.
I am principal investigator of EU-funded Sec4AI4Sec project (2023-2026). I have been principal investigator and technical leader of EU-funded AssureMOSS project (2020-2023).
Since the end of 2021, I serve as a (co-)editor for the Building Security In department of the IEEE Security & Privacy magazine.
Before joining SAP, I was a post-doc fellow and then a full-time researcher at the National Research Council (CNR) (Pisa, Italy), where I spent four years overall.
During my PhD, in 2005 and 2006, I spent 7 months overall as a visiting researcher at Carleton University, Ottawa.
I received my PhD in Computer Science and Engineering from the University of Rome ‘Tor Vergata’ (Italy) in 2007.
You may find additional information about me on LinkedIn and on Google Scholar.
To get in touch with me, just click here and write me a message.
PhD in Computer Science and Automation Engineering, 2007
University of Rome 'Tor Vergata'
Master's in Computer Science/Engineering, 2003
University of Rome 'Tor Vergata'
During this second visit at Carleton University, I worked with Dorina Petriu on combining concepts from the aspect-oriented paradigm and graph grammars to represent crosscutting concerns (performance, security) in software models.
Using graph grammars as a way to abstract information from software models, as a preliminary step for futher model transformation to a different target representation.
PhD thesis on model-driven methods to automate the performance analysis of software systems
Selected publications (the most cited)
Eclipse Steady supports software development organizations in regards to the secure use of open-source components during application development. The tool analyzes Java and Python applications in order to: detect whether they depend on open-source components with known vulnerabilities, collect evidence regarding the execution of vulnerable code in a given application context (through the combination of static and dynamic analysis techniques), and support developers in the mitigation of such dependencies.
The mission of AssureMOSS is to produce a coherent set of automated, lightweight techniques that allow software companies to assess, manage, and re-certify the security and privacy risks associated with the fast-paced development and continuous deployment of multi-party open software and services (for which we introduce the MOSS acronym).
Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.
Software reuse may result in software bloat when significant portions of application dependencies are effectively unused. Several tools exist to remove unused (byte)code from an application or its dependencies, thus producing smaller artifacts and, potentially, reducing the overall attack surface. In this paper we evaluate the ability of three debloating tools to distinguish which dependency classes are necessary for an application to function correctly from those that could be safely removed. To do so, we conduct a case study on a real-world commercial Java application. Our study shows that the tools we used were able to correctly identify a considerable amount of redundant code, which could be removed without altering the results of the existing application tests. One of the redundant classes turned out to be (formerly) vulnerable, confirming that this technique has the potential to be applied for hardening purposes. However, by manually reviewing the results of our experiments, we observed that none of the tools can handle a widely used default mechanism for dynamic class loading.
Open source packages have source code available on repositories for inspection (eg on GitHub) but developers use pre-built packages directly from the package repositories (such as npm for JavaScript, PyPI for Python, or RubyGems for Ruby). Such convenient practice assumes that there are no discrepancies between source code and packages. These differences pose both operational risks (eg making dependent projects unable to compile) and security risks (eg deploying malicious code during package installation) in the software supply chain. Our empirical assessment of 2438 popular packages in PyPI with an analysis of around 10M lines of code shows several differences in the wild: modifications cannot be just attributed to malicious injections. Yet, scanning again all and whole ‘most likely good but modified’packages is hard to manage for FOSS downstream users. We propose a methodology, LastPyMile, for identifying the differences between build artifacts of software packages and the respective source code repository. We show how it can be used to extend current package scanning practices for malware injection (which only covers less than 1% of the code of deployed packages).
Modern software applications, including commercial ones, extensively use Open-Source Software (OSS) components, accounting for 90% of software products on the market. This has serious security implications, mainly because developers rely on non-updated versions of libraries affected by software vulnerabilities. Several tools have been developed to help developers detect these vulnerable libraries and assess and mitigate their impact. The most advanced tools apply sophisticated reachability analyses to achieve high accuracy; however, they need additional data (in particular, concrete execution traces, such as those obtained by running a test suite) that is not always readily available.In this work, we propose SIEGE, a novel automatic exploit generation approach based on genetic algorithms, which generates test cases that execute the methods in a library known to contain a vulnerability. These test cases represent precious, concrete evidence that the vulnerable code can indeed be reached; they are also useful for security researchers to better understand how the vulnerability could be exploited in practice. This technique has been implemented as an extension of EVOSUITE and applied on set of 11 vulnerabilities exhibited by widely used OSS JAVA libraries. Our initial findings show promising results that deserve to be assessed further in larger-scale empirical studies.
Deep learning methods, which have found successful applications in fields like image classification and natural language processing, have recently been applied to source code analysis too, due to the enormous amount of freely available source code (e.g., from open-source software repositories). In this work, we elaborate upon a state-of-the-art approach to the representation of source code that uses information about its syntactic structure, and we adapt it to represent source code changes (i.e., commits). We use this representation to classify security-relevant commits. Because our method uses transfer learning (that is, we train a network on a “pretext task” for which abundant labeled data is available, and then we use such network for the target task of commit classification, for which fewer labeled instances are available), we studied the impact of pre-training the network using two different pretext tasks versus a randomly initialized model. Our results indicate that representations that leverage the structural information obtained through code syntax outperform token-based representations. Furthermore, the performance metrics obtained when pre-training on a loosely related pretext task with a very large dataset (> $10^6$ samples) were surpassed when pretraining on a smaller dataset (> $10^4$ samples) but for a pretext task that is more closely related to the target task.
Vulnerable dependencies are a known problem in today’s free open-source software ecosystems because FOSS libraries are highly interconnected, and developers do not always update their dependencies. Our paper proposes Vuln4Real, the methodology for counting actually vulnerable dependencies, that addresses the over-inflation problem of academic and industrial approaches for reporting vulnerable dependencies in FOSS software, and therefore, caters to the needs of industrial practice for correct allocation of development and audit resources. To understand the industrial impact of a more precise methodology, we considered the 500 most popular FOSS Java libraries used by SAP in its own software. Our analysis included 25767 distinct library instances in Maven. We found that the proposed methodology has visible impacts on both ecosystem view and the individual library developer view of the situation of software dependencies: Vuln4Real significantly reduces the number of false alerts for deployed code (dependencies wrongly flagged as vulnerable), provides meaningful insights on the exposure to third-parties (and hence vulnerabilities) of a library, and automatically predicts when dependency maintenance starts lagging, so it may not receive updates for arising issues.
Advancing our understanding of software vulnerabilities, automating their identification, the analysis of their impact, and ultimately their mitigation is necessary to enable the development of software that is more secure. While operating a vulnerability assessment tool that we developed and that is currently used by hundreds of development units at SAP, we manually collected and curated a dataset of vulnerabilities of open-source software and the commits fixing them. The data was obtained both from the National Vulnerability Database (NVD) and from project-specific Web resources that we monitor on a continuous basis. From that data, we extracted a dataset that maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects, used in SAP products or internal tools, onto the 1282 commits that fix them. Out of 624 vulnerabilities, 29 do not have a CVE identifier at all and 46, which do have a CVE identifier assigned by a numbering authority, are not available in the NVD yet. The dataset is released under an open-source license, together with supporting scripts that allow researchers to automatically retrieve the actual content of the commits from the corresponding repositories and to augment the attributes available for each instance. Also, these scripts allow to complement the dataset with additional instances that are not security fixes (which is useful, for example, in machine learning applications). Our dataset has been successfully used to train classifiers that could automatically identify security-relevant commits in code repositories. The release of this dataset and the supporting code as open-source will allow future research to be based on data of industrial relevance; also, it represents a concrete step towards making the maintenance of this dataset a shared effort involving open-source communities, academia, and the industry.