The importance of snippet
matching for software provenance analysis (aka “software audit”) is an ongoing discussion topic in the domain of open source compliance. This document explains nexB’s perspective based on our decade of experience.
Definition of Snippet Matching
A snippet is a small piece of code.
Snippet matching (or snippet-level matching) is about finding similar small pieces of code where the formatting, case, spacing and other lesser attributes of the source code are ignored.
At a technical level, snippet detection is similar to approximate whole file detection with the addition of a content-defined file-chunking algorithm where chunks are fingerprinted with checksums and the checksums are indexed.
File chunking is the main part of any snippet scan approach to break down a file into a set of “snippets”. Some ways to do this are:
- The Winnowing approach used by some proprietary vendors.
- The Hailstorm approach which we prefer.
Practical Issues with Snippet Matching
The underlying assumption is that when a snippet of code in a source file matches a snippet of open source or other third-party code, then the code snippet may have been copied from that open source or other third-party code.
This match assumption may not be true in many cases. For instance:
- Two code snippets may have been generated by the same code generator, code editor or IDE.
- Two code snippets may be the same or similar because there is only one way to call an API and therefore independent developers create similar code.
- A code snippet may be derived from an origin that is not reported by the snippet-matching tool used.
- A code snippet may be boilerplate such as standard header comments, API comments, common lists of package imports or includes, list of numbers such as zero, etc.
These are some examples of “false positive” matches and snippet matching tools return a very large number of false positive matches.
The key issue when using snippet matching for software provenance analysis is that you need to apply significant resources (money and people) to get snippet detection data, review all the possible matches, identify the false positive matches and then investigate the remaining snippet matches.
The primary under-estimated cost is the effort to research and resolve false positive snippet matches. This cost is increasing almost daily as more copies of popular open source package proliferate across public repositories like GitHub. More copies mean more snippet matches to analyze.
Another important issue is the definition of how large a “snippet match” needs to be in order to be considered “significant”. Some companies define a snippet match by lines of code, but this can be problematic because the number of lines of code should be defined for each major language relative to the overall verbosity of the language. The effective content of 10 lines of C code could quite different from 10 lines of Java or Python code.
You also need to carefully consider how concepts for copyrightable code apply to snippets. Most organizations consider a snippet match where the matched lines are only standard definitions to be a false positive. It can be quite complex to define effective policies for what constitutes a true snippet match.
Overall, implementing practical policies and techniques for snippet matching is very difficult. The cost of snippet matching is very high, but the benefits are less clear.
Our Experience
After hundreds of software provenance analysis projects, we have found very few true and unattributed code snippets.
From audit projects where we have used both nexB’s ScanCode Toolkit and a prominent proprietary snippet-matching tool we have found that:
- ScanCode finds many snippets based on developer comments in the code (e.g. a Stack Overflow URL) that were not detected by the snippet-matching tool. it is often the case that these developer comments are missing the attribution notice (copyright and license), but such comments are usually clear evidence of the snippet origin.
- The snippet-matching tool reports false positive to true snippet matches in a ratio of 10:1 or higher.
- The cases with a significant pattern of true snippet matches are usually correlated with a product team that has little or no understanding of open source licensing.
Based on this experience, we normally do not include snippet matching in our analysis unless the customer provides access to their preferred snippet-matching tool and their policies for identifying a true snippet match.
Risk Considerations
Regardless of the specific tools and techniques used for snippet-matching, it is important to also consider the risk perspective.
The specific compliance risk is that an undocumented or undisclosed code snippet
will be discovered by a third party without access to your source code is low. The riskiest use case is for languages where there is no compilation or similar process that would mask the source code. The most common current example of this risk area would be JavaScript that is not minified.
When you have limited compliance resources it is best to apply those resources to the higher risk use cases of re-using larger-scale open source (or other third-party) packages which are easier for a third party to detect.
The cost to address a compliance issue from an undocumented/undisclosed code snippet will also tend to be much lower than the cost to replace a larger-scale open source (or other third-party) package.
The primary high risk use case for code snippets is the rare case where there is a pervasive pattern of many code snippets copied from:
- An open source package under a Copyleft license or
- A third-party package under a proprietary/commercial license that does not permit the use of code snippets. This use case is extremely rare and is usually found with a product where the engineering team has no idea that this is a problem – i.e. the real issue is a lack of even basic OSS compliance training. This case can often be detected by software interaction analysis without snippet matching.
The risk profile will usually be highest in two cases:
- Acquisition of a software product or company where you may not have historical insight into the software development and licensing processes.
- A product where you know that the engineering team has not had even basic training for open source license compliance.
Recommended Role for Snippet Matching
Because of the significant resources required to analyze snippet scan data, nexB recommends a two-level process if you need to include snippet matching in your OSS compliance process:
- A primary process based on package- and file-level scanning and matching.
- A secondary process to apply snippet matching only to code whose origin has not been confirmed in the primary process.
Some details about this approach are:
- Start with a combination of scanning and matching at the package- or file-level:
- Scanning means collecting software provenance (origin and license) data from the codebase including copyright notices, license text and package metadata.
- Matching means identifying open source or other third-party packages by searching (using checksums, names or other clues) in a database. A simple example is “matching” a prebuilt Java library
(.jar)
with a copy of the same library available in the Maven central repository. The matching database would typically contain the checksum (e.g. SHA1) of every Java library contained in the Maven central repository.
- Run the snippet matching tool only on the subset of the codebase that does not yet have a confirmed origin and license. In most cases, this will primarily be “first-party” code written by the product team or third-party proprietary code.
This snippet analysis approach may still be expensive in terms of time and resources, but it limits that effort to code that does not have a reasonably concluded third-party origin.
Running snippet-matching on known OSS packages does not make sense in any use case. Running snippet matching on a popular open source package will likely result in many thousands of snippet matches. The snippet match count for a copy of the Linux Kernel will be in the millions.
It is hard to quantify the cost of snippet matching in general because so much depends on the languages and platforms of a product.
We estimate that using snippet matching as the primary analysis tool will require at least twice the resources and time compared to our recommended approach.
Summary
In summary, mandating snippet detection in your software compliance program without committing the corresponding resources to analyze snippet detection data is probably a mistake from the operational, business and legal perspectives.
The best approach is to focus any snippet matching on high risk use cases.