Turkish GPT2 Model Finetuned

Türkçe GPT2 Modeli

Model description

This is a GPT2-Small English based model finetuned and additionaly trainied with Wikipedia Articles in Turkish as of 28-10-2020

Live demo based on this work at : https://www.metayazar.com/

Fine tuned writer on this model: https://huggingface.co/gorkemgoknar/gpt2-turkish-writer

Work has been done on Pierre Guillou tutorial as on this page. (https://github.com/piegu/fastai-projects/blob/master/finetuning-English-GPT2-any-language-Portuguese-HuggingFace-fastaiv2.ipynb)

Code is converted to work with Fastai 2.X .

Using Google Colab for training.

Additional tutorial and source will be in https://github.com/gorkemgoknar in later stage.

Current accuracy 33 % , Perplexity : 51.88

Models are available:

Intended uses & limitations

How to use

Install

from transformers import AutoTokenizer, AutoModelWithLMHead
import torch

tokenizer = AutoTokenizer.from_pretrained("gorkemgoknar/gpt2-small-turkish")
model = AutoModelWithLMHead.from_pretrained("gorkemgoknar/gpt2-small-turkish")

# Get sequence length max of 1024
tokenizer.model_max_length=1024 

model.eval()  # disable dropout (or leave in train mode to finetune)

Generate 1 word

# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output
outputs = model(**inputs, labels=inputs["input_ids"])
loss, logits = outputs[:2]
predicted_index = torch.argmax(logits[0, -1, :]).item()
predicted_text = tokenizer.decode([predicted_index])

# results
print('input text:', text)
print('predicted text:', predicted_text)

# input text: 
# predicted text:  

Generate Full Sequence

# input sequence
text = "Bu yazıyı bilgisayar yazdı."
inputs = tokenizer(text, return_tensors="pt")

# model output using Top-k sampling text generation method
sample_outputs = model.generate(inputs.input_ids,
                                pad_token_id=50256,
                                do_sample=True, 
                                max_length=50, # put the token number you want
                                top_k=40,
                                num_return_sequences=1)

# generated sequence
for i, sample_output in enumerate(sample_outputs):
    print(">> Generated text {}\\\\
\\\\
{}".format(i+1, tokenizer.decode(sample_output.tolist())))

# >> Generated text
#    

Limitations and bias

The training data used for this model come from Turkish Wikipedia. We know it contains a lot of unfiltered content from the internet, which is far from neutral.

Training data

Wikipedia Turkish article dump as of 28-10-2020

Training procedure

Eval results

epoch\\t train_loss\\t valid_loss\\t accuracy\\t perplexity\\t time
0\\t 4.777015\\t 4.621834\\t 0.292547\\t 101.680367\\t 2:42:05
1\\t 4.509412\\t 4.403999\\t 0.305574\\t 81.777267\\t 1:09:38
2\\t 4.169529\\t 4.120755\\t 0.324908\\t 61.605747\\t 1:07:45
3\\t 4.293973\\t 4.177899\\t 0.317211\\t 65.228653\\t 1:07:02
4\\t 4.049848\\t 3.949103\\t 0.338347\\t 51.888783\\t 1:05:53

#Epoch 0 on Tesla T4, others on V100


Downloads last month
190
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.