CmdCaliper-small
[Dataset] [Code] [Paper]
The CmdCaliper models are the first embedding models specifically designed for command-line embeddings, developed by CyCraft AI Lab. Our evaluation results demonstrate that even the smallest version of CmdCaliper, with approximately 30 million parameters, can outperform state-of-the-art sentence embedding models that have over 10 times more parameters (335 million) across various command-line-specific tasks.
CmdCaliper offers three models of different sizes: CmdCaliper-large, CmdCaliper-base, and CmdCaliper-small. This provides flexible options to accommodate various hardware resource constraints.
CmdCaliper was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research".
Metric
Methods | Model Parameters | MRR @3 | MRR @10 | Top @3 | Top @10 |
---|---|---|---|---|---|
Levenshtein distance | - | 71.23 | 72.45 | 74.99 | 81.83 |
Word2Vec | - | 45.83 | 46.93 | 48.49 | 54.86 |
E5-small | Small (0.03B) | 81.59 | 82.6 | 84.97 | 90.59 |
GTE-small | Small (0.03B) | 82.35 | 83.28 | 85.39 | 90.84 |
CmdCaliper-small | Small (0.03B) | 86.81 | 87.78 | 89.21 | 94.76 |
BGE-en-base | Base (0.11B) | 79.49 | 80.41 | 82.33 | 87.39 |
E5-base | Base (0.11B) | 83.16 | 84.07 | 86.14 | 91.56 |
GTR-base | Base (0.11B) | 81.55 | 82.51 | 84.54 | 90.1 |
GTE-base | Base (0.11B) | 78.2 | 79.07 | 81.22 | 86.14 |
CmdCaliper-base | Base (0.11B) | 87.56 | 88.47 | 90.27 | 95.26 |
BGE-en-large | Large (0.34B) | 84.11 | 84.92 | 86.64 | 91.09 |
E5-large | Large (0.34B) | 84.12 | 85.04 | 87.32 | 92.59 |
GTR-large | Large (0.34B) | 88.09 | 88.68 | 91.27 | 94.58 |
GTE-large | Large (0.34B) | 84.26 | 85.03 | 87.14 | 91.41 |
CmdCaliper-large | Large (0.34B) | 89.12 | 89.91 | 91.45 | 95.65 |
Usage
HuggingFace Transformers
import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel
def average_pool(last_hidden_states: Tensor,
attention_mask: Tensor) -> Tensor:
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
input_texts = [
'cronjob schedule daily 00:00 ./program.exe',
'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")
# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
Sentence Transformers
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
# Run inference
sentences = [
'cronjob schedule daily 00:00 ./program.exe',
'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]
# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]
Limitation
This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens.
Citation
@inproceedings{huang2024cmdcaliper,
title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
year={2024}
}
- Downloads last month
- 6