Presenting Introduction to Machine Learning and Security at DEF CON China 1.0

By Gavin Stroy

190601-blog-postDef-Con-China-featured

In May - June 2019, Bishop Fox's Gavin Stroy led a machine learning workshop at DEF CON China 1.0. The below is his recap of this unique event. 

Wait, DEF CON China?

Yes, you heard that right. This year was the official launch of DEF CON China 1.0, recently held in  beautiful Beijing. The venue was located near the 798 Arts District and… Wow! This was the most Defcon place I’ve ever seen. For starters, the main DEF CON speaking track was in an old silo. If you took away the decorations, it would be an unassuming building that could just as easily fade into the backdrop of the surrounding area. Inside though, was a massive spectacle of lights and art.

The villages, on the other hand, really made the area pop! They were organized into colorful pods, made from refurbished shipping containers. These pods lined both sides of the walkway leading from the entrance to the main track, all the way to the chillout lounge.

190617-blog-post-def-con-china-3
DEF Con China was a pretty bustling event, as you can tell.

Who Am I?

Oh right, intro: I’m a Senior Security Analyst at Bishop Fox who specializes in application and network penetration testing. Lately, I’ve been having fun finding new and creative applications of machine learning to computer network security. I went to DEF CON China this year to present an “Introduction to Machine Learning and Security” workshop there, alongside the AI Village and MesaTEE.

190617-blog-post-def-con-china-4
Here's me talking to some folks @ DEF CON China. 

Arriving in China

If you weren’t already aware, smartphone app and network access in China is a little complicated. So in preparation, I traded my Google Maps, Uber, and Facebook Chat for Baidu Maps, DiDi, and WeChat. I also downloaded the offline version of Google Translate for simplified Chinese (but thankfully, DEF CON also provided excellent translators).

Most American attendees didn’t even bring their own smartphone and instead opted to bring a burner cellphone with GoogleFi for international data. And I get it, if you can’t be completely paranoid at a global hacker conference in the middle of a foreign country, where can you be?

As for me, my fear of not having a mobile phone trumped my fear of potentially having people spying on me. According to the permissions and terms of service of some apps I needed, they definitely do have the ability to collect a lot of information about you. At the same time, so does Facebook and Google. After a bit of thinking, it felt like a moot point. I wiped my phone before leaving and wiped my phone after coming back. A week later, nothing seems amiss, so I’d call this a safe option.

The Workshop

The purpose of the Introduction to Machine Learning and Security workshop was to introduce the security community to the wonderful world of AI and machine learning. While there’s been more recent talk in the community about deep neural networks or fooling self-driving cars, as a whole machine learning is still very misunderstood. Some think that it requires a PhD in mathematics to understand, while others think it is magic and will solve all our problems. In a nutshell, machine learning uses statistics to make future predictions based on past data. So, our workshop aimed to give a brief overview of what machine learning is and to walk the audience through how to build simple models with easy to use tools.

We started with a discussion around machine learning assumptions and what the attendees associate with phrases like machine learning, AI, or deep learning. We then dove into defining machine learning and the machine learning process, introduced the Multinomial Naïve Bayes algorithm (MNB) through a simple example, and finally built a series of models to be tested and evaluated.

The interactive part of the workshop walked the audience through how to build a simple spam filter using the Scikit-learn machine learning library. At each step, we used an example email to show how email bodies are preprocessed to be made consumable by our algorithms. We then used the preprocessed training data to make predictions on new emails that our model has never seen before. We made minor tweaks to how the model learns and evaluated which of the models worked the best. To conclude the workshop participants tested the models they created on a modern spam email.

The Math-y Bits

I think it makes sense here to talk a little bit about how the model works. At a high level, MNB uses Bayes’ theorem to answer the question of, “what is the probability this piece of data belongs to a class?”. For instance, a spam filter would ask, “what is the probability this email contains Spam, given the phrase ‘buy my product now’?”. This is represented in notation as,

190617-blog-post-def-con-china-equation-1

In straight Naïve Bayes, we would answer this question as,

190617-blog-post-def-con-china-equation-2

By counting words that appeared in past emails, we can easily determine the following:

  • P(Spam) = The number of spam emails we have, divided by the number of total emails we have
  • P(“buy my product now”) = The number of times the words “buy”, “my”, “product”, and “now” occur independently, divided by the total number of words in all of the emails (the result is usually a very small number)
  • P(“buy my product now” | Spam) = The same as above, but when only looking at spam emails

We perform the above calculations for each class, in this case Spam and Not Spam. Whichever probability yields the greater number becomes our predicted class.

Great, but where does MNB fit in? Naïve Bayes works fairly well, except when it comes across words that it has never seen before. So instead, we need to use a variant of Naïve Bayes, which is MNB.

Now, I could do an entire blog post on MNB alone but, to save you the time here, I encourage you to check out this blog post on Medium

How It Went

We leveraged MNB because it is one of the easiest models to introduce to people who are new to machine learning and it doesn’t require math beyond that of a high school graduate. It works great with text data and can be used as a steppingstone for understanding other similar algorithms, such as Logistic Regression. It also has the added benefit of being directly applicable to security issues. For example, GyoiThon was presented last year at Blackhat 2018 and DEF CON 26. GyoiThon is a machine learning powered system scanner and automated-exploit tool, which uses MNB as its core learning engine. On the defensive side, Fwaf is a machine learning powered Web Application Firewall (WAF), which applies the same techniques with a Logistic Regression backend. Attendees were excited to learn how they could apply these kinds of tools into their own environment!

All-in-all, the workshop went off without a hitch. Although, it was originally designed Originally, it was designed to use Google Collaboratory. Collaboratory is a free Jupyter notebook environment that runs in the cloud. It’s fantastic for writing and executing python code in the browser and is geared towards machine learning research. Google will even provide you with a free GPU! The problem is, Google is not accessible in China.

Instead, I used Binder to host the workshop. Binder is also a free, cloud based Jupyter notebook but is more geared towards general python programming. The only thing to keep in mind is that Binder is a temporary environment, so the instance needed to be rebuilt on each iteration of the workshop. Which, for a workshop, worked quite out nicely.

Even though DEF CON China just finished, I can’t wait for next year. Thank you to DEF CON and Bishop Fox for allowing me the opportunity to have this amazing experience!

190617-blog-post-def-con-china-7

Until next time, China!