davanstrien HF staff commited on
Commit
19c43aa
·
1 Parent(s): 6709ec5

feat: Add Hugging Face Hub integration for uploading database file

Browse files
Files changed (1) hide show
  1. dataset_search_client_notebook.ipynb +520 -0
dataset_search_client_notebook.ipynb ADDED
@@ -0,0 +1,520 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {
6
+ "id": "Kq8_kBUjxY3B"
7
+ },
8
+ "source": [
9
+ "# Dataset Search Client Documentation\n",
10
+ "\n",
11
+ "This notebook demonstrates how to use the [librarian-bots/dataset-column-search-api](https://huggingface.co/spaces/librarian-bots/dataset-column-search-api) API to search for Hugging Face datasets by their column names."
12
+ ]
13
+ },
14
+ {
15
+ "cell_type": "markdown",
16
+ "metadata": {
17
+ "id": "ArdwzeQSxY3D"
18
+ },
19
+ "source": [
20
+ "## Introduction\n",
21
+ "\n",
22
+ "The Hugging Face Hub hosts a vast collection of datasets for various machine learning tasks. These datasets often have different structures and column names. The [librarian-bots/dataset-column-search-api](https://huggingface.co/spaces/librarian-bots/dataset-column-search-api) API allows you to find datasets that match specific column structures, which can be incredibly useful for tasks like:\n",
23
+ "\n",
24
+ "1. Finding datasets suitable for specific machine learning tasks\n",
25
+ "2. Identifying datasets with compatible structures for transfer learning or data augmentation\n",
26
+ "3. Exploring the availability of datasets with certain features or labels\n",
27
+ "\n",
28
+ "By searching based on column names, you can quickly identify datasets that fit your specific needs without having to manually inspect each dataset's structure."
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "markdown",
33
+ "metadata": {
34
+ "id": "5KeXd86UxY3D"
35
+ },
36
+ "source": [
37
+ "## Setup\n",
38
+ "\n",
39
+ "First, let's import the necessary libraries and define a `DatasetSearchClient` class which we'll use to call the API (feel free to directly call the API if prefered)."
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "code",
44
+ "execution_count": 94,
45
+ "metadata": {
46
+ "id": "EyvEz03KxY3D"
47
+ },
48
+ "outputs": [],
49
+ "source": [
50
+ "import requests\n",
51
+ "from typing import List, Dict, Any, Iterator\n",
52
+ "\n",
53
+ "class DatasetSearchClient:\n",
54
+ " def __init__(self, base_url: str = \"https://librarian-bots-dataset-column-search-api.hf.space\"):\n",
55
+ " self.base_url = base_url\n",
56
+ "\n",
57
+ " def search(self,\n",
58
+ " columns: List[str],\n",
59
+ " match_all: bool = False,\n",
60
+ " page_size: int = 100) -> Iterator[Dict[str, Any]]:\n",
61
+ " \"\"\"\n",
62
+ " Search datasets using the provided API, automatically handling pagination.\n",
63
+ "\n",
64
+ " Args:\n",
65
+ " columns (List[str]): List of column names to search for.\n",
66
+ " match_all (bool, optional): If True, match all columns. If False, match any column. Defaults to False.\n",
67
+ " page_size (int, optional): Number of results per page. Defaults to 100.\n",
68
+ "\n",
69
+ " Yields:\n",
70
+ " Dict[str, Any]: Each dataset result from all pages.\n",
71
+ "\n",
72
+ " Raises:\n",
73
+ " requests.RequestException: If there's an error with the HTTP request.\n",
74
+ " ValueError: If the API returns an unexpected response format.\n",
75
+ " \"\"\"\n",
76
+ " page = 1\n",
77
+ " total_results = None\n",
78
+ "\n",
79
+ " while total_results is None or (page - 1) * page_size < total_results:\n",
80
+ " params = {\n",
81
+ " \"columns\": columns,\n",
82
+ " \"match_all\": str(match_all).lower(),\n",
83
+ " \"page\": page,\n",
84
+ " \"page_size\": page_size\n",
85
+ " }\n",
86
+ "\n",
87
+ " try:\n",
88
+ " response = requests.get(f\"{self.base_url}/search\", params=params)\n",
89
+ " response.raise_for_status()\n",
90
+ " data = response.json()\n",
91
+ "\n",
92
+ " if not {\"total\", \"page\", \"page_size\", \"results\"}.issubset(data.keys()):\n",
93
+ " raise ValueError(\"Unexpected response format from the API\")\n",
94
+ "\n",
95
+ " if total_results is None:\n",
96
+ " total_results = data['total']\n",
97
+ "\n",
98
+ " for dataset in data['results']:\n",
99
+ " yield dataset\n",
100
+ "\n",
101
+ " page += 1\n",
102
+ "\n",
103
+ " except requests.RequestException as e:\n",
104
+ " raise requests.RequestException(f\"Error connecting to the API: {str(e)}\")\n",
105
+ " except ValueError as e:\n",
106
+ " raise ValueError(f\"Error processing API response: {str(e)}\")\n",
107
+ "\n",
108
+ "# Create an instance of the client\n",
109
+ "client = DatasetSearchClient()"
110
+ ]
111
+ },
112
+ {
113
+ "cell_type": "markdown",
114
+ "metadata": {
115
+ "id": "mxVqxdCtxY3E"
116
+ },
117
+ "source": [
118
+ "## Example 1: Searching for Text Classification Datasets\n",
119
+ "\n",
120
+ "Let's start by searching for datasets that have both \"text\" and \"label\" columns, which are common in text classification tasks:"
121
+ ]
122
+ },
123
+ {
124
+ "cell_type": "code",
125
+ "execution_count": 95,
126
+ "metadata": {
127
+ "colab": {
128
+ "base_uri": "https://localhost:8080/"
129
+ },
130
+ "id": "T2wyABxrxY3E",
131
+ "outputId": "9541e61e-1e0d-4d8a-a5d7-1e2db117bf3c"
132
+ },
133
+ "outputs": [
134
+ {
135
+ "output_type": "stream",
136
+ "name": "stdout",
137
+ "text": [
138
+ "Datasets suitable for text classification (with 'text' and 'label' columns):\n",
139
+ "1. mteb/amazon_counterfactual: ['text', 'label', 'label_text']\n",
140
+ "2. dair-ai/emotion: ['text', 'label']\n",
141
+ "3. stanfordnlp/imdb: ['text', 'label']\n",
142
+ "4. 203427as321/articles: ['label', 'text', '__index_level_0__']\n",
143
+ "5. indonlp/NusaX-senti: ['id', 'text', 'lang', 'label']\n",
144
+ "\n",
145
+ "Total datasets found: 1866\n"
146
+ ]
147
+ }
148
+ ],
149
+ "source": [
150
+ "text_classification_columns = [\"text\", \"label\"]\n",
151
+ "results = client.search(text_classification_columns, match_all=True)\n",
152
+ "\n",
153
+ "print(\"Datasets suitable for text classification (with 'text' and 'label' columns):\")\n",
154
+ "for i, dataset in enumerate(results, 1):\n",
155
+ " print(f\"{i}. {dataset['hub_id']}: {dataset['column_names']}\")\n",
156
+ " if i >= 5: # Print only the first 5 as a sample\n",
157
+ " break\n",
158
+ "\n",
159
+ "total_results = len(list(client.search(text_classification_columns, match_all=True)))\n",
160
+ "print(f\"\\nTotal datasets found: {total_results}\")"
161
+ ]
162
+ },
163
+ {
164
+ "cell_type": "markdown",
165
+ "metadata": {
166
+ "id": "al0oo4yBxY3E"
167
+ },
168
+ "source": [
169
+ "## Example 2: Searching for Question-Answering Datasets\n",
170
+ "\n",
171
+ "Now, let's search for datasets that could be used for question-answering tasks:"
172
+ ]
173
+ },
174
+ {
175
+ "cell_type": "code",
176
+ "execution_count": 97,
177
+ "metadata": {
178
+ "colab": {
179
+ "base_uri": "https://localhost:8080/"
180
+ },
181
+ "id": "WY9e3o0CxY3E",
182
+ "outputId": "f46cb86a-9df9-405a-bca9-17cac3fe5faa"
183
+ },
184
+ "outputs": [
185
+ {
186
+ "output_type": "stream",
187
+ "name": "stdout",
188
+ "text": [
189
+ "Datasets suitable for question-answering tasks (with 'question', 'answer', and 'context' columns):\n",
190
+ "1. hotpotqa/hotpot_qa: ['id', 'question', 'answer', 'type', 'level', 'supporting_facts', 'context']\n",
191
+ "2. neural-bridge/rag-dataset-12000: ['context', 'question', 'answer']\n",
192
+ "3. ryo0634/xquad-sampled: ['id', 'question', 'context', 'answer_sentence', 'answer']\n",
193
+ "4. lcw99/wikipedia-korean-20240501-1million-qna: ['question', 'answer', 'context']\n",
194
+ "5. virattt/financial-qa-10K: ['question', 'answer', 'context', 'ticker', 'filing']\n",
195
+ "\n",
196
+ "Total datasets found: 646\n"
197
+ ]
198
+ }
199
+ ],
200
+ "source": [
201
+ "qa_columns = [\"question\", \"answer\", \"context\"]\n",
202
+ "results = client.search(qa_columns, match_all=True)\n",
203
+ "\n",
204
+ "print(\"Datasets suitable for question-answering tasks (with 'question', 'answer', and 'context' columns):\")\n",
205
+ "for i, dataset in enumerate(results, 1):\n",
206
+ " print(f\"{i}. {dataset['hub_id']}: {dataset['column_names']}\")\n",
207
+ " if i >= 5: # Print only the first 5 as a sample\n",
208
+ " break\n",
209
+ "\n",
210
+ "total_results = len(list(client.search(qa_columns, match_all=True)))\n",
211
+ "print(f\"\\nTotal datasets found: {total_results}\")"
212
+ ]
213
+ },
214
+ {
215
+ "cell_type": "markdown",
216
+ "metadata": {
217
+ "id": "kiU3-f-OxY3E"
218
+ },
219
+ "source": [
220
+ "## Example 3: Searching for Instruction-Following Datasets\n",
221
+ "\n",
222
+ "Let's search for datasets that could be used for instruction-following tasks, which are common in training large language models:"
223
+ ]
224
+ },
225
+ {
226
+ "cell_type": "code",
227
+ "execution_count": 98,
228
+ "metadata": {
229
+ "colab": {
230
+ "base_uri": "https://localhost:8080/"
231
+ },
232
+ "id": "nt8SSWaRxY3F",
233
+ "outputId": "42460b4b-6dac-48f1-a3b2-b1504bd16686"
234
+ },
235
+ "outputs": [
236
+ {
237
+ "output_type": "stream",
238
+ "name": "stdout",
239
+ "text": [
240
+ "Datasets suitable for instruction-following tasks (with 'instruction', 'input', and 'output' columns):\n",
241
+ "1. garage-bAInd/Open-Platypus: ['input', 'output', 'instruction', 'data_source']\n",
242
+ "2. HuggingFaceH4/databricks_dolly_15k: ['category', 'instruction', 'input', 'output']\n",
243
+ "3. chargoddard/alpaca-gpt4-500: ['instruction', 'input', 'output', 'text', '__index_level_0__']\n",
244
+ "4. vicgalle/alpaca-gpt4: ['instruction', 'input', 'output', 'text']\n",
245
+ "5. llamafactory/alpaca_en: ['instruction', 'input', 'output']\n",
246
+ "\n",
247
+ "Total datasets found: 1937\n"
248
+ ]
249
+ }
250
+ ],
251
+ "source": [
252
+ "instruction_columns = [\"instruction\", \"input\", \"output\"]\n",
253
+ "results = client.search(instruction_columns, match_all=True)\n",
254
+ "\n",
255
+ "print(\"Datasets suitable for instruction-following tasks (with 'instruction', 'input', and 'output' columns):\")\n",
256
+ "for i, dataset in enumerate(results, 1):\n",
257
+ " print(f\"{i}. {dataset['hub_id']}: {dataset['column_names']}\")\n",
258
+ " if i >= 5: # Print only the first 5 as a sample\n",
259
+ " break\n",
260
+ "\n",
261
+ "total_results = len(list(client.search(instruction_columns, match_all=True)))\n",
262
+ "print(f\"\\nTotal datasets found: {total_results}\")"
263
+ ]
264
+ },
265
+ {
266
+ "cell_type": "markdown",
267
+ "source": [
268
+ "# Creating collections for common dataset formats\n",
269
+ "\n",
270
+ "We can also use the API to create a Hugging Face Collection based on our search. Let's use an alpaca formatted dataset as an example:\n",
271
+ "\n",
272
+ "alpaca\n",
273
+ "```\n",
274
+ "{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}\n",
275
+ "```\n"
276
+ ],
277
+ "metadata": {
278
+ "id": "yRdaLtZ0AQlj"
279
+ }
280
+ },
281
+ {
282
+ "cell_type": "code",
283
+ "source": [
284
+ "alpaca = ['instruction', 'input', 'output']"
285
+ ],
286
+ "metadata": {
287
+ "id": "kdB0wnEDDek8"
288
+ },
289
+ "execution_count": 99,
290
+ "outputs": []
291
+ },
292
+ {
293
+ "cell_type": "code",
294
+ "source": [
295
+ "results = list(client.search(alpaca, match_all=True))\n",
296
+ "len(results)"
297
+ ],
298
+ "metadata": {
299
+ "colab": {
300
+ "base_uri": "https://localhost:8080/"
301
+ },
302
+ "id": "uh52VwKTQasR",
303
+ "outputId": "c16e50ce-6799-42b9-9ae4-e9016d767c6f"
304
+ },
305
+ "execution_count": 100,
306
+ "outputs": [
307
+ {
308
+ "output_type": "execute_result",
309
+ "data": {
310
+ "text/plain": [
311
+ "1937"
312
+ ]
313
+ },
314
+ "metadata": {},
315
+ "execution_count": 100
316
+ }
317
+ ]
318
+ },
319
+ {
320
+ "cell_type": "markdown",
321
+ "source": [
322
+ "We now import some functions from `huggingface_hub` to create a collection."
323
+ ],
324
+ "metadata": {
325
+ "id": "BZ6LNKg3FdYs"
326
+ }
327
+ },
328
+ {
329
+ "cell_type": "code",
330
+ "source": [
331
+ "from huggingface_hub import login, create_collection, add_collection_item"
332
+ ],
333
+ "metadata": {
334
+ "id": "eckH26s8w_U4"
335
+ },
336
+ "execution_count": 25,
337
+ "outputs": []
338
+ },
339
+ {
340
+ "cell_type": "markdown",
341
+ "source": [
342
+ "I have my HF_TOKEN stored as a Secret in Colab. You can also login by calling `login()` directly."
343
+ ],
344
+ "metadata": {
345
+ "id": "nUIshM8bFhW3"
346
+ }
347
+ },
348
+ {
349
+ "cell_type": "code",
350
+ "source": [
351
+ "from google.colab import userdata"
352
+ ],
353
+ "metadata": {
354
+ "id": "3ywhU4J7xGuE"
355
+ },
356
+ "execution_count": 102,
357
+ "outputs": []
358
+ },
359
+ {
360
+ "cell_type": "code",
361
+ "source": [
362
+ "login(userdata.get('HF_TOKEN'))"
363
+ ],
364
+ "metadata": {
365
+ "colab": {
366
+ "base_uri": "https://localhost:8080/"
367
+ },
368
+ "id": "b0yRHNw0xCq7",
369
+ "outputId": "1bcdbda5-34d9-4848-f315-2fc81772df38"
370
+ },
371
+ "execution_count": 103,
372
+ "outputs": [
373
+ {
374
+ "output_type": "stream",
375
+ "name": "stdout",
376
+ "text": [
377
+ "The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.\n",
378
+ "Token is valid (permission: write).\n",
379
+ "Your token has been saved to /root/.cache/huggingface/token\n",
380
+ "Login successful\n"
381
+ ]
382
+ }
383
+ ]
384
+ },
385
+ {
386
+ "cell_type": "markdown",
387
+ "source": [
388
+ "We create a collection using `create_collection`. WE"
389
+ ],
390
+ "metadata": {
391
+ "id": "krcmAIyNFshv"
392
+ }
393
+ },
394
+ {
395
+ "cell_type": "code",
396
+ "source": [
397
+ "collection = create_collection(\"Probably Alpaca Style Datasets\", exists_ok=True)"
398
+ ],
399
+ "metadata": {
400
+ "id": "fGpAnGOPxEWp"
401
+ },
402
+ "execution_count": 108,
403
+ "outputs": []
404
+ },
405
+ {
406
+ "cell_type": "code",
407
+ "source": [
408
+ "collection.title"
409
+ ],
410
+ "metadata": {
411
+ "colab": {
412
+ "base_uri": "https://localhost:8080/",
413
+ "height": 36
414
+ },
415
+ "id": "Gt8rql39RC5R",
416
+ "outputId": "4af9a2f0-6c20-43a9-f46f-1dc38c2cb480"
417
+ },
418
+ "execution_count": 109,
419
+ "outputs": [
420
+ {
421
+ "output_type": "execute_result",
422
+ "data": {
423
+ "text/plain": [
424
+ "'Probably Alpaca Style Datasets'"
425
+ ],
426
+ "application/vnd.google.colaboratory.intrinsic+json": {
427
+ "type": "string"
428
+ }
429
+ },
430
+ "metadata": {},
431
+ "execution_count": 109
432
+ }
433
+ ]
434
+ },
435
+ {
436
+ "cell_type": "code",
437
+ "source": [
438
+ "collection.slug"
439
+ ],
440
+ "metadata": {
441
+ "colab": {
442
+ "base_uri": "https://localhost:8080/",
443
+ "height": 36
444
+ },
445
+ "id": "0OC5U8VeF_Zq",
446
+ "outputId": "bf135fe4-cf65-4425-c541-eb285aaa86e6"
447
+ },
448
+ "execution_count": 110,
449
+ "outputs": [
450
+ {
451
+ "output_type": "execute_result",
452
+ "data": {
453
+ "text/plain": [
454
+ "'davanstrien/probably-alpaca-style-datasets-667eead1bad3a964ea580e04'"
455
+ ],
456
+ "application/vnd.google.colaboratory.intrinsic+json": {
457
+ "type": "string"
458
+ }
459
+ },
460
+ "metadata": {},
461
+ "execution_count": 110
462
+ }
463
+ ]
464
+ },
465
+ {
466
+ "cell_type": "markdown",
467
+ "source": [
468
+ "We now loop through our results and add them to the Collection."
469
+ ],
470
+ "metadata": {
471
+ "id": "-GEpHrekGAx6"
472
+ }
473
+ },
474
+ {
475
+ "cell_type": "code",
476
+ "source": [
477
+ "for result in results:\n",
478
+ " add_collection_item(collection.slug, result['hub_id'], item_type=\"dataset\", exists_ok=True)"
479
+ ],
480
+ "metadata": {
481
+ "id": "Vb3hgnRBxW4T"
482
+ },
483
+ "execution_count": null,
484
+ "outputs": []
485
+ },
486
+ {
487
+ "cell_type": "markdown",
488
+ "source": [
489
+ "Since the results have some key metadata about the dataset you can also filter the results further before creating a Collection."
490
+ ],
491
+ "metadata": {
492
+ "id": "vOdodAVcGI96"
493
+ }
494
+ }
495
+ ],
496
+ "metadata": {
497
+ "kernelspec": {
498
+ "display_name": "Python 3",
499
+ "language": "python",
500
+ "name": "python3"
501
+ },
502
+ "language_info": {
503
+ "codemirror_mode": {
504
+ "name": "ipython",
505
+ "version": 3
506
+ },
507
+ "file_extension": ".py",
508
+ "mimetype": "text/x-python",
509
+ "name": "python",
510
+ "nbconvert_exporter": "python",
511
+ "pygments_lexer": "ipython3",
512
+ "version": "3.8.5"
513
+ },
514
+ "colab": {
515
+ "provenance": []
516
+ }
517
+ },
518
+ "nbformat": 4,
519
+ "nbformat_minor": 0
520
+ }