{ "cells": [ { "cell_type": "markdown", "id": "b3fc8862-0c2b-45f3-badf-e591c7b8f891", "metadata": {}, "source": [ "# Token Count Exploration\n", "It would be really useful for deployment to know our input/output expectations. We know that our output is quite verbose relative to the input since the explanations are long. With a model like `mistralai/Mistral-7B-Instruct-v0.3` Id expect that our real output with explanations will be shorter. Thats perfect since our training data will give us a reliable upper bound, which is great to prevent truncation.\n", "\n", "Lets figure out how to split input and output tokens, and then we can build a histogram." ] }, { "cell_type": "markdown", "id": "3a501f2f-ba98-4c0f-aa30-f4768bd80dcb", "metadata": {}, "source": [ "## Config" ] }, { "cell_type": "code", "execution_count": 1, "id": "5d0bd22f-293e-4c15-9dfe-8070553f42b5", "metadata": { "tags": [] }, "outputs": [], "source": [ "INPUT_DATASET = 'derek-thomas/labeled-multiple-choice-explained-falcon-tokenized'\n", "BASE_MODEL = 'tiiuae/Falcon3-7B-Instruct'" ] }, { "cell_type": "markdown", "id": "c1c3b00c-17bf-4b00-9ee7-d10c598c53e9", "metadata": {}, "source": [ "## Setup" ] }, { "cell_type": "code", "execution_count": 2, "id": "af2330f3-403c-401c-8028-46ae4971546e", "metadata": {}, "outputs": [ { "data": { "application/vnd.jupyter.widget-view+json": { "model_id": "2c216b161c3340ada0223141da2cc441", "version_major": 2, "version_minor": 0 }, "text/plain": [ "VBox(children=(HTML(value='