You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

CmdCaliper-small

[Dataset] [Code] [Paper]

The CmdCaliper models are the first embedding models specifically designed for command-line embeddings, developed by CyCraft AI Lab. Our evaluation results demonstrate that even the smallest version of CmdCaliper, with approximately 30 million parameters, can outperform state-of-the-art sentence embedding models that have over 10 times more parameters (335 million) across various command-line-specific tasks.

CmdCaliper offers three models of different sizes: CmdCaliper-large, CmdCaliper-base, and CmdCaliper-small. This provides flexible options to accommodate various hardware resource constraints.

CmdCaliper was introduced in the EMNLP 2024 paper titled "CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research".

Metric

Methods Model Parameters MRR @3 MRR @10 Top @3 Top @10
Levenshtein distance - 71.23 72.45 74.99 81.83
Word2Vec - 45.83 46.93 48.49 54.86
E5-small Small (0.03B) 81.59 82.6 84.97 90.59
GTE-small Small (0.03B) 82.35 83.28 85.39 90.84
CmdCaliper-small Small (0.03B) 86.81 87.78 89.21 94.76
BGE-en-base Base (0.11B) 79.49 80.41 82.33 87.39
E5-base Base (0.11B) 83.16 84.07 86.14 91.56
GTR-base Base (0.11B) 81.55 82.51 84.54 90.1
GTE-base Base (0.11B) 78.2 79.07 81.22 86.14
CmdCaliper-base Base (0.11B) 87.56 88.47 90.27 95.26
BGE-en-large Large (0.34B) 84.11 84.92 86.64 91.09
E5-large Large (0.34B) 84.12 85.04 87.32 92.59
GTR-large Large (0.34B) 88.09 88.68 91.27 94.58
GTE-large Large (0.34B) 84.26 85.03 87.14 91.41
CmdCaliper-large Large (0.34B) 89.12 89.91 91.45 95.65

Usage

HuggingFace Transformers

import torch.nn.functional as F
from torch import Tensor
from transformers import AutoTokenizer, AutoModel

def average_pool(last_hidden_states: Tensor,
                 attention_mask: Tensor) -> Tensor:
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

input_texts = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]

tokenizer = AutoTokenizer.from_pretrained("CyCraftAI/CmdCaliper-base")
model = AutoModel.from_pretrained("CyCraftAI/CmdCaliper-base")

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())

Sentence Transformers

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("CyCraftAI/CmdCaliper-base")
# Run inference
sentences = [
    'cronjob schedule daily 00:00 ./program.exe',
    'schtasks /create /tn "TaskName" /tr "C:\program.exe" /sc daily /st 00:00',
    'xcopy C:\Program Files (x86) E:\Program Files /E /H /K /O /X'
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 768]

# Get the similarity scores for the embeddings
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

Limitation

This model focuses exclusively on Windows command lines. Additionally, any lengthy texts will be truncated to a maximum of 512 tokens.

Citation

@inproceedings{huang2024cmdcaliper,
  title={CmdCaliper: A Semantic-Aware Command-Line Embedding Model and Dataset for Security Research},
  author={SianYao Huang, ChengLin Yang, CheYu Lin, and ChunYing Huang},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,
  year={2024}
} 
Downloads last month
6
Safetensors
Model size
33.4M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Dataset used to train CyCraftAI/CmdCaliper-small