The nVisium Blog

Musings on the OWASP Top 10 2017 RC1 Part 2: The Data

As I wrapped up my first round of analysis on the Top 10, I realized that this data set is a relative rarity and I could probably glean some interesting insights from it.

In the call for data for the Top 10 2017 project, submitters were required to provide some structure as well as supporting metadata for their data contributions. There was some structure and a number of metadata questions that submitters needed to provide. This was to help structure or normalize the data so that it could be comparable. However, after looking at the data, I've come to the conclusion that more needs to be done up front. We'll touch on that point later, but first I want to share what I believe are interesting insights about the data.

The OWASP Top 10 Data Contributors

The raw published data sheet has one tab for "large data sets" and another for "small data sets," and it looks like most of the analysis was done on the 11 identified large data sets. I wanted to use all the data available for my purposes, so I used the entries in the small data set worksheet as well. There are a total of 24 different contributors of data that range between 44,627 to three applications tested. The data is distributed between 2014, 2015, or 2014-2015. The data that spans several years does not have a specific breakdown of each year, so I'm going to treat it all as the same time period as there isn't really enough to break it down with any accuracy.

Submitting Organizations
Submitting Organizations

I highly expected the contributions of product vendors and consulting organizations because their data is a collection of other companies' data. I was very pleased to see the nine contributors who identified as internal assessment teams, and want to thank them for being willing to share data about what they are finding during testing.

Human-Augmented Tools (HAT) vs. Tool-Augmented Humans (TAH)

Sounds almost the same, doesn't it? There is a pretty big difference between the two, especially with regard to what types of vulnerabilities are found and how they are reported. If it is the HAT-style testing, you will typically see much higher vulnerability-per-app scores. This is usually due to using the tool's generated report, which counts all instances as unique findings. When you have TAH-style testing, you would typically see no more than a single finding for a vulnerability type in the report with all known instances listed in that finding. Another common issue with HAT-style testing is the high rate of false positives or findings that aren't exploitable. This seems to stem from lack of understanding context and the fear of missing something, so most anything that might be something is reported.

I've tried to figure out what kind of testing was performed for each of the contributor-provided datasets. This is my best guess based on how I interpreted the metadata provided: I took the type of testing and average vulnerabilities per application to classify what style of testing I believe was most likely used. I also looked at individual vulnerability numbers. If they were an order of magnitude higher than the number of web apps tested, it was most likely a tool-generated report for the base numbers.

Testing Styles
Testing Styles

Armed with this classification system, I decided to dig deeper into the datasets and attempt to answer a couple questions related to vulnerabilities that were found:

  • What are humans better at finding versus what are tools better at finding?
  • What are the Top 10s in prevalence for tools vs humans?

First, I set out to see what each testing style appeared to be good at finding, and the initial results were interesting. If one style had a significantly higher percentage of the findings for a particular category, I would color code the CWE gray for HAT and orange for TAH.

HAT vs TAH
HAT vs TAH

The human-augmented tools are good at finding what we would expect them to find: injections, vulnerable dependencies, and weak crypto, stuff that doesn't typically require knowledge of context in an application. It also looked like a good mix of both common static and dynamic-style findings.

The tool-augmented humans found many tougher things, like improper access controls, insufficient anti-automation, and unrestricted uploads, among a few other categories. Again, not really surprising and, from my experience, also in line with general expectations.

Then I had a bit of a "Aha!" moment. The number of applications tested between the two styles was not balanced. The TAHs only tested 9% of the web apps whereas the HATs had tested 91% of the web apps. So, there was a bit of an unfair comparison in my opinion. Out of curiosity, I scaled up the TAH style by tenfold, pretending we had a bunch more humans (if only it was this easy…). The results were pretty interesting.

TAH vs HAT
TAH vs HAT

At the same scale, the TAHs demolish the tools in a significant number of categories. Unfortunately, humans don't scale well, as many of us can attest. Quite honestly, the difference is probably even greater than depicted here, as we previously mentioned that TAHs usually count no more than one category finding per web app vs the HATs counting all instances found.

Top 10 for HAT vs TAH

I wanted to see what the Top 10 in prevalence would be for HAT vs TAH, so I dug back in to see what that would look like. This isn't taking into account "risk," "severity," or "impact" as we don't have that info on this data, but it is still useful to know.

HAT Top 10 by prevalence
HAT Top 10 by prevalence

The HAT is heavily dominated by XSS findings. The Top 10 covers 99% of the total findings by human-assisted tools, but that's not too surprising as XSS on its own was 83% of the overall findings via this testing style. While it is tempting to remove it because it overwhelms the data, I didn't have any justifiable grounds to do so. We shouldn't just pick and choose what data to look at because it's easier and/or fits a model. However, it is also important to point out the testing methodology that reported the vast majority of XSS vulnerabilities was "Raw output of automated analysis tools, using rules tuned by earlier stage manual false positive analysis", which do not appear to have a post-test validation process before reporting.

For the TAH testing style, we see a much more evenly distributed Top 10 by prevalence.

TAH Top 10 by prevalence
TAH Top 10 by prevalence

Misconfigurations, Authentication, and XSS are the top three, and the Top 10 represent approximately 76% of the total findings. But, even looking at this, it's hard to tell the risk that is presented in these metrics. A Security Misconfiguration could lead to disclosure of some fairly benign data or could open up the entire application to compromise. What we can tell is that TAHs seem to find a broader range of vulnerabilities as well as several types that tools haven't been able to reliably detect.

From my perspective, there are a few important takeaways from this data:

  1. Tools are nowhere near replacing humans in finding vulnerabilities. (Granted this is a relatively small dataset considering the amount of vulnerabilities being found each day, but if someone wants to build/offer a bigger dataset to play with, I'm in.)
  2. Humans need tools to scale.
  3. Humans need better tools.
  4. We need a combination of TAH and HAT testing styles for a good while yet: HATs for scale and speed (part of the build/integration cycle) and TAHs for coverage (prioritized, higher risk applications).
  5. Either we are not terribly good at preventing XSS or we need a more accurate way to test for it. I have similar thoughts for a number of injection types.

If there are other conclusions you can draw that I haven't touched on, let me know, because this stuff fascinates me.

What don't we know about this data?

Actually, there is a whole lot we don't know about this data. Here are a few things I can think of off the top of my head:

  1. Are all the identified web applications unique? Or are there duplicates?
  2. How are the vulnerabilities counted? Are they collected into one finding or are all instances found counted separately?
  3. How much massaging took place to map to CWEs? And was it consistent?
  4. Where in the process are the tools leveraged? A tool in the DevOps flow should be tuned to look for different things rather than one run out of band.
  5. Is there a distinction between vulnerabilities found in dev/testing vs production?
  6. No risk, severity, likelihood, exploitability, or other important factors are listed.
  7. No idea if any of these ever made it to production, were fixed, or marked "won't fix."
  8. For a specific finding, was it that the dev team missed one control point or was it a systemic issue?

There is some missing "bonus" data as well (at least missing from what I have access to). A few contributors mentioned that they sent separate emails listing additional vulnerabilities that were not part of the original list sent out in the call for data.

What can we learn from this dataset for the next time?

Metrics are hard. To collect good, statistically sound data is even harder. This was a good start as there are several useful insights in this data, but we should be able to do better. The more guidance and structure you can provide up front, the more consistent the resulting data should be. The struggle will be between structured enough to provide useful insight without being so onerous that no one contributes.

Here are a few things that I think could be done for a future call of data (Top 10 or otherwise):

Understand what you are looking for.

What question(s) are you trying to answer or what story are you trying to tell? Deciding this up front will help ensure that you define the right structure and are able to ask the right questions. Are we looking to see what vulnerabilities are being found by different types of testing? Are we looking for common vulnerability types by language/framework? What kind of vulnerabilities are caught before pushing to production? Has there been an improvement over time?

Be very clear on your definitions.

What is a "vulnerability" or "finding," and do you typically report only one type or all instances of that type? This can make a huge difference in how the numbers are reported and whether or not they are comparable. Do you want only new findings or are retests ok? Tagging whether a finding was from a default scanner, verified after triage, or manually found can make a difference as well.

Define a data collection process that allows contributors to safely submit their own data.

I'm also working on this problem for OWASP SAMM Benchmark. We are working on how to collect relatively sensitive data from companies that could help build a powerful knowledge base to provide a beneficial understanding of how security programs are working without putting them at risk. How can we achieve a balance of metadata and data that is useful while still running a safe enough process that companies will contribute to? Right now, it's mostly product and service companies that contribute because the data is so abstract and high level that one isn't able to determine which client it originated from. We dearly need data that can convey risk, and companies won't share if it is perceived as a risk to them.

The Final Takeaway

All in all, more accurate data would be good. We need to ensure we understand what story it is telling us, and we also need to make sure that the data collected and reported will drive desired behaviors.