Musings on the OWASP Top 10 2017 RC1 Part 2: The Data
As I wrapped up my first round of analysis on the Top 10, I realized that this data set is a relative rarity and I could probably glean some interesting insights from it.
I highly expected the contributions of product vendors and consulting organizations because their data is a collection of other companies’ data. I was very pleased to see the nine contributors who identified as internal assessment teams, and want to thank them for being willing to share data about what they are finding during testing.
Sounds almost the same, doesn’t it? There is a pretty big difference between the two, especially with regard to what types of vulnerabilities are found and how they are reported. If it is the HAT-style testing, you will typically see much higher vulnerability-per-app scores. This is usually due to using the tool’s generated report, which counts all instances as unique findings. When you have TAH-style testing, you would typically see no more than a single finding for a vulnerability type in the report with all known instances listed in that finding. Another common issue with HAT-style testing is the high rate of false positives or findings that aren’t exploitable. This seems to stem from lack of understanding context and the fear of missing something, so most anything that might be something is reported.
I’ve tried to figure out what kind of testing was performed for each of the contributor-provided datasets. This is my best guess based on how I interpreted the metadata provided: I took the type of testing and average vulnerabilities per application to classify what style of testing I believe was most likely used. I also looked at individual vulnerability numbers. If they were an order of magnitude higher than the number of web apps tested, it was most likely a tool-generated report for the base numbers.
Armed with this classification system, I decided to dig deeper into the datasets and attempt to answer a couple questions related to vulnerabilities that were found:
First, I set out to see what each testing style appeared to be good at finding, and the initial results were interesting. If one style had a significantly higher percentage of the findings for a particular category, I would color code the CWE gray for HAT and orange for TAH.
The human-augmented tools are good at finding what we would expect them to find: injections, vulnerable dependencies, and weak crypto, stuff that doesn’t typically require knowledge of context in an application. It also looked like a good mix of both common static and dynamic-style findings.
The tool-augmented humans found many tougher things, like improper access controls, insufficient anti-automation, and unrestricted uploads, among a few other categories. Again, not really surprising and, from my experience, also in line with general expectations.
Then I had a bit of a “Aha!” moment. The number of applications tested between the two styles was not balanced. The TAHs only tested 9% of the web apps whereas the HATs had tested 91% of the web apps. So, there was a bit of an unfair comparison in my opinion. Out of curiosity, I scaled up the TAH style by tenfold, pretending we had a bunch more humans (if only it was this easy…). The results were pretty interesting.
At the same scale, the TAHs demolish the tools in a significant number of categories. Unfortunately, humans don’t scale well, as many of us can attest. Quite honestly, the difference is probably even greater than depicted here, as we previously mentioned that TAHs usually count no more than one category finding per web app vs the HATs counting all instances found.
I wanted to see what the Top 10 in prevalence would be for HAT vs TAH, so I dug back in to see what that would look like. This isn’t taking into account “risk,” “severity,” or “impact” as we don’t have that info on this data, but it is still useful to know.
The HAT is heavily dominated by XSS findings. The Top 10 covers 99% of the total findings by human-assisted tools, but that’s not too surprising as XSS on its own was 83% of the overall findings via this testing style. While it is tempting to remove it because it overwhelms the data, I didn’t have any justifiable grounds to do so. We shouldn’t just pick and choose what data to look at because it’s easier and/or fits a model. However, it is also important to point out the testing methodology that reported the vast majority of XSS vulnerabilities was “Raw output of automated analysis tools, using rules tuned by earlier stage manual false positive analysis”, which do not appear to have a post-test validation process before reporting.
For the TAH testing style, we see a much more evenly distributed Top 10 by prevalence.
Misconfigurations, Authentication, and XSS are the top three, and the Top 10 represent approximately 76% of the total findings. But, even looking at this, it’s hard to tell the risk that is presented in these metrics. A Security Misconfiguration could lead to disclosure of some fairly benign data or could open up the entire application to compromise. What we can tell is that TAHs seem to find a broader range of vulnerabilities as well as several types that tools haven’t been able to reliably detect.
From my perspective, there are a few important takeaways from this data:
If there are other conclusions you can draw that I haven’t touched on, let me know, because this stuff fascinates me.
Actually, there is a whole lot we don’t know about this data. Here are a few things I can think of off the top of my head:
There is some missing “bonus” data as well (at least missing from what I have access to). A few contributors mentioned that they sent separate emails listing additional vulnerabilities that were not part of the original list sent out in the call for data.
Metrics are hard. To collect good, statistically sound data is even harder. This was a good start as there are several useful insights in this data, but we should be able to do better. The more guidance and structure you can provide up front, the more consistent the resulting data should be. The struggle will be between structured enough to provide useful insight without being so onerous that no one contributes.
Here are a few things that I think could be done for a future call of data (Top 10 or otherwise):
What question(s) are you trying to answer or what story are you trying to tell? Deciding this up front will help ensure that you define the right structure and are able to ask the right questions. Are we looking to see what vulnerabilities are being found by different types of testing? Are we looking for common vulnerability types by language/framework? What kind of vulnerabilities are caught before pushing to production? Has there been an improvement over time?
What is a “vulnerability” or “finding,” and do you typically report only one type or all instances of that type? This can make a huge difference in how the numbers are reported and whether or not they are comparable. Do you want only new findings or are retests ok? Tagging whether a finding was from a default scanner, verified after triage, or manually found can make a difference as well.
I’m also working on this problem for OWASP SAMM Benchmark. We are working on how to collect relatively sensitive data from companies that could help build a powerful knowledge base to provide a beneficial understanding of how security programs are working without putting them at risk. How can we achieve a balance of metadata and data that is useful while still running a safe enough process that companies will contribute to? Right now, it’s mostly product and service companies that contribute because the data is so abstract and high level that one isn’t able to determine which client it originated from. We dearly need data that can convey risk, and companies won’t share if it is perceived as a risk to them.
All in all, more accurate data would be good. We need to ensure we understand what story it is telling us, and we also need to make sure that the data collected and reported will drive desired behaviors.