The field of cybersecurity is evolving fast. Experts need to be informed
about past, current and – in the best case – upcoming threats, because attacks
are becoming more advanced, targets bigger and systems more complex. As this
cannot be addressed manually, cybersecurity experts need to rely on machine
learning techniques. In the texutual domain, pre-trained language models like
BERT have shown to be helpful, by providing a good baseline for further
fine-tuning. However, due to the domain-knowledge and many technical terms in
cybersecurity general language models might miss the gist of textual
information, hence doing more harm than good. For this reason, we create a
high-quality dataset and present a language model specifically tailored to the
cybersecurity domain, which can serve as a basic building block for
cybersecurity systems that deal with natural language. The model is compared
with other models based on 15 different domain-dependent extrinsic and
intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the
one hand, the results of the intrinsic tasks show that our model improves the
internal representation space of words compared to the other models. On the
other hand, the extrinsic, domain-dependent tasks, consisting of sequence
tagging and classification, show that the model is best in specific application
scenarios, in contrast to the others. Furthermore, we show that our approach
against catastrophic forgetting works, as the model is able to retrieve the
previously trained domain-independent knowledge. The used dataset and trained
model are made publicly available

By admin