ChatGPT and Beyond: Generative AI in Security

The impact of generative AI, particularly models like ChatGPT, has captured the imagination of many in the security industry. However, generative AIs encompass a variety of techniques such as large language models (LLMs), generative adversarial networks (GANs), diffusion models and autoencoders, each playing a unique role in enhancing security measures. In this article, we will explore the multifaceted applications of these technologies and their potential to revolutionize how we identify and combat threats.

Tackling Phishing With Autoencoders

Phishing attacks have become increasingly sophisticated, making them more challenging to detect using traditional security measures. This challenge has paved the way for AI models specifically trained to identify phishing patterns. These models scrutinize various attributes of emails, websites and online communications, honing their ability to differentiate between legitimate and malicious content.

A prime example of their application is in detecting Google login phishing scams. Scammers often mimic well-known login interfaces, such as those of Google, Microsoft and Facebook, to deceive users into entering their credentials. These counterfeit login screens closely resemble the authentic ones, enabling them to evade conventional image-matching algorithms. This scenario is where autoencoder-based deep neural network architectures prove advantageous.

An autoencoder is a type of artificial neural network designed to learn efficient data codings without supervision. Its defining feature is its ability to learn a compressed, low-dimensional representation of data and then reconstruct it as output.

The structure of an autoencoder comprises two main components: The encoder and the decoder. The encoder compresses the input into a latent-space representation, while the decoder reconstructs the input data from this encoded form as accurately as possible.

In phishing detection, we input a suspected phishing screen and compare it against known legitimate login interfaces. With adequate training, the autoencoder can learn to identify phishing screens and correlate them with their legitimate counterparts.

Detecting Spurious Domain Names With LLMs

LLMs have significantly enhanced a variety of language-related tasks, and their effectiveness is further amplified when they are fine-tuned for specific tasks. Consequently, the trend is shifting towards customizing foundational LLMs through fine-tuning. For example, fine-tuning an LLM for classification tasks tailors it for domain-specific predictions. There are multiple methods to fine-tune an LLM, which is a transformer-based model. For instance, as illustrated in the figure above, Approach A employs a labeled dataset to generate outputs that subsequently train another classifier. Approach B involves adding additional dense layers, with weights refined during fine-tuning. Approach C entails retraining both the transformer and the dense layers. Sequentially, these methods enhance performance but also increase computational demands.

In practical scenarios, fine-tuning has significant applications in cybersecurity, particularly in identifying spurious domain names often created using domain generation algorithms. Employing a labeled dataset to fine-tune an LLM results in a performance that substantially surpasses traditional models. This method capitalizes on the deep language comprehension capabilities of LLMs, enabling them to make precise predictions in specialized fields. Furthermore, this strategy is highly scalable for other security-focused language tasks, requiring only a switch in the labeled dataset for adaptation.

Using GANs for Private Synthetic Data

GANs represent a class of neural networks renowned for their ability to learn and replicate the distribution of training data. This capability enables them to generate new data that closely mirrors the original. GANs have achieved significant acclaim in the realm of image generation, creating images so lifelike they are often indistinguishable from actual photographs. However, their potential extends well beyond image creation.

The figure above illustrates a basic GAN architecture consisting of two main components: The generator and the discriminator. These two models engage in a competitive zero-sum game wherein the generator strives to produce increasingly realistic data. This is achieved through an iterative process, where the generator’s output is assessed by the discriminator, which then discerns between real and synthetic data.

One of the groundbreaking applications of GANs is in the generation of tabular data that not only adheres to the original data distribution but also incorporates strategic perturbations to ensure privacy. This synthetic data can be invaluable for training new models, particularly in scenarios where original data is scarce or sensitive. This capability of GANs opens new doors for robust data analysis and model training, offering a blend of realism and privacy.

Future Outlook

In conclusion, the application of generative AI in security is a game-changer, offering novel solutions to pressing challenges in cybersecurity. The future of cybersecurity is intertwined with the advancements in generative AI, and while it will be important to ensure the development of these technologies is guided by ethical principles, it is an exciting frontier that holds immense promise for a more secure tomorrow.

Special thanks to co-contributor Kumar Sharad, senior threat researcher.

Avatar photo

Abhinav Mishra

Abhinav is a Principal Applied Scientist on the Splunk Machine Learning for Security (SMLS) team. Before Splunk, he worked at Yahoo! Research. He graduated with an MS from IIT Kanpur and joined the Georgia Tech PhD program researching graph algorithms and machine learning. In his free time, he enjoys hiking, biking, and nature photography.

abhinav-mishra has 1 posts and counting.See all posts by abhinav-mishra