Word2Bezbar: Word2Vec Models for French Rap Lyrics

Overview

Word2Bezbar are Word2Vec models trained on french rap lyrics sourced from Genius. Tokenization has been done using NLTK french word_tokenze function, with a prior processing to remove french oral contractions. Used dataset size was 323MB, corresponding to 77M tokens.

The model captures the semantic relationships between words in the context of french rap, providing a useful tool for studies associated to french slang and music lyrics analysis.

Model Details

Size of this model is small

Parameter Value
Dimensionality 100
Window Size 5
Epochs 10
Algorithm CBOW

Versions

This model has been trained with the followed software versions

Requirement Version
Python 3.8.5
Gensim library 4.3.2
NTLK library 3.8.1

Installation

  1. Install Required Python Libraries:

    pip install gensim
    
  2. Clone the Repository:

    git clone https://github.com/rapminerz/Word2Bezbar-small.git
    
  3. Navigate to the Model Directory:

    cd Word2Bezbar-small
    

Loading the Model

To load the Word2Bezbar Word2Vec model, use the following Python code:

import gensim

# Load the Word2Vec model
model = gensim.models.Word2Vec.load("word2vec.model")

Using the Model

Once the model is loaded, you can use it as shown:

  1. To get the most similary words regarding a word
model.wv.most_similar("bendo")
[('binks', 0.8920747637748718),
 ('bando', 0.8460732698440552),
 ('hood', 0.8299438953399658),
 ('tieks', 0.8264378309249878),
 ('hall', 0.817583441734314),
 ('secteur', 0.8145656585693359),
 ('barrio', 0.809047281742096),
 ('block', 0.793493390083313),
 ('bâtiment', 0.7826434969902039),
 ('bloc', 0.7753982543945312)]

model.wv.most_similar("kichta")
[('liasse', 0.878665566444397),
 ('sse-lia', 0.8552991151809692),
 ('kishta', 0.8535938262939453),
 ('kich', 0.7646669149398804),
 ('skalape', 0.7576569318771362),
 ('moula', 0.7466527223587036),
 ('valise', 0.7429592609405518),
 ('sacoche', 0.7324921488761902),
 ('mallette', 0.7247079014778137),
 ('re-pai', 0.7060815095901489)]
  1. To find the word that doesn't match in a list of words
model.wv.doesnt_match(["racli","gow","gadji","fimbi","boug"])
'boug'

model.wv.doesnt_match(["Zidane","Mbappé","Ronaldo","Messi","Jordan"])
'Jordan'
  1. To find the similarity between two words
model.wv.similarity("kichta", "moula")
0.7466528

model.wv.similarity("bonheur", "moula")
0.16985293
  1. Or even get the vector representation of a word
model.wv['ekip']
array([ 1.4757039e-01,  ... 1.1260221e+00],
      dtype=float32)

Purpose and Disclaimer

This model is designed for academic and research purposes only. It is not intended for commercial use. The creators of this model do not endorse or promote any specific views or opinions that may be represented in the dataset.

Please mention @RapMinerz if you use our models

Contact

For any questions or issues, please contact the repository owner, RapMinerz, at [email protected].

Downloads last month
12
Inference API
Unable to determine this model’s pipeline type. Check the docs .