How was the dataset from TowerBlocks read into the TowerInstruct
#3
by
alvations
- opened
Since the data from TowerBlocks varies depending on the task, is there a particular way the data is read for the supervised fine-tuning?
E.g. from the NER task, we have:
{'conversations': [{'from': 'human',
'value': 'Study this taxonomy for classifying named entities:\n- Product (Consumer products such as food, drinks, clothing, and vehicles)\n- Location (Location or physical facilities)\n- Group (Groups of people, organizations, corporations or other entities)\n- Medical (Entities from the medical domain, including diseases, symptoms, and medications)\n- Person (Names of people)\n- CreativeWorks (Titles of creative works like movie, song, and book titles). Identify all named entities in the following tokens:\n["el", "republicano", "emilio", "castelar", "manifestaría", "al", "respecto", ":"]\nAdditionally, you should add B- to the first token of a given entity and I- to subsequent ones if they exist. For tokens that are not named entities, mark them as O.\nAnswer: '},
{'from': 'gpt',
'value': '[("el", "O"), ("republicano", "O"), ("emilio", "B-Person"), ("castelar", "I-Person"), ("manifestaría", "O"), ("al", "O"), ("respecto", "O"), (":", "O")]'}],
'lang': 'es',
'split': 'dev',
'dataset': 'multiconer2023',
'task': 'named_entity_recognition'}
Would the input to the tokenizer be something like:
source = tokenizer( row['conversations']['from']['human']['value'] )
target = tokenizer( row['conversations']['from']['gpt']['value'] )
In the above example, does gpt
mean the system's output? It is not referring to any of the OpenAI's model right?
Thank you in advance for the clarification!
Regards,
Liling
P/S: Thank you for compiling and sharing the data collection for TowerLLM
Yes, something like
X = tokenizer( row['conversations'][0]['from']['human']['value'] )
Y = tokenizer( row['conversations'][0]['from']['gpt']['value'] )
is correct. Tower was trained for next-token prediction on the tokens of Y, given X.
gpt
is just what the model is trained to predict; there is no connection with OpenAI.
Thank you for the clarification!
jmprcp
changed discussion status to
closed