From f6b2d8a7b7ac9c3fff063dcb176d20acceb141e9 Mon Sep 17 00:00:00 2001 From: Thomas Gerald <gerald@lisn.fr> Date: Tue, 29 Oct 2024 10:57:38 +0000 Subject: [PATCH] Upload New File --- BPE-Algorithm-correction.ipynb | 475 +++++++++++++++++++++++++++++++++ 1 file changed, 475 insertions(+) create mode 100644 BPE-Algorithm-correction.ipynb diff --git a/BPE-Algorithm-correction.ipynb b/BPE-Algorithm-correction.ipynb new file mode 100644 index 0000000..7058d50 --- /dev/null +++ b/BPE-Algorithm-correction.ipynb @@ -0,0 +1,475 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "f4bdb240-9099-411b-a859-daaf3269c900", + "metadata": {}, + "source": [ + "## Implémentation de l'algorithme BPE\n", + "L'objectif de ce premier TP est de mettre en place l'algorithme de tokenization BPE. Pour rappel le principe consiste à rassembler les mots ou ``tokens'' apparaissant le plus de fois successivement.\n", + "\n", + "Par exemple si l'on considère le corpus contenant les mots de la table suivante (avec le nombre d’occurrences de chaque mot) :\n", + "\n", + "| mots | occurence |\n", + "|------|-----------|\n", + "| manger | 2 |\n", + "| voter | 3 |\n", + "| lent | 1 |\n", + "| lentement | 2 |\n", + "\n", + "Et que les \"tokens\" initiaux sont les lettres de l'alphabet alors, c'est le sufixe \"er\" qui sera ajouté à la liste des sous-mots (tokens) dans un premier temps.\n", + "\n", + "Les étapes pour réaliser l'algorithme BPE sont les suivantes :\n", + "1. Télécharger un corpus de textes (ici une page wikipedia)\n", + "2. Découper le texte en mots (en utilisant le caractère \"espace\") et compter le nombre d’occurrences de chaque mot\n", + "3. Initialiser le dictionnaire de mots avec les tokens initiaux (les lettres de l’alphabet)\n", + "4. Faire tourner l'algorithme BPE (apprendre le vocabulaire)\n", + "5. Tester la décomposition en tokens sur des phrases choisies (on applique les règles de fusion)" + ] + }, + { + "cell_type": "markdown", + "id": "ee88b07e-4de8-436e-9dc7-f405c69c9042", + "metadata": {}, + "source": [ + "### Étape 1: Télécharger un corpus" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "5352d48f-27ca-41ad-9265-0e128578a5ec", + "metadata": {}, + "outputs": [], + "source": [ + "import urllib.request\n", + "import re\n", + "import numpy as np\n", + "import collections\n", + "import json\n", + "url_request = 'https://fr.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&explaintext&redirects=1&titles=Gr%C3%A8ce_antique'\n", + "raw_page = urllib.request.urlopen(url_request)\n", + "json_page = json.load(raw_page)\n", + "\n", + "with open('wikitest.json', 'w') as f:\n", + " json.dump(json_page, f)" + ] + }, + { + "cell_type": "markdown", + "id": "f8b6b7a2-74b4-4ba2-aa4b-93885a286fb2", + "metadata": {}, + "source": [ + "### Étape 2 : Découper le texte en mots\n", + "\n", + "Pour découper le texte en mot nous utiliserons la regex suivante ```r'(\\b[^\\s]+\\b)'```. Pour compter les mots nous utiliserons l'objet Counter de python. \n", + "1. Stocker dans **count_words** chaque mot ainsi que le nombre d’occurrences\n", + "2. En regardant la documentation donnez les 10 mots les plus fréquents (vous les stockerez dans most_commons_words)." + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "6d7341ff-61bf-4a5a-9d0d-4ae8fd8797ce", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['de', 'la', 'et', 'des', 'les', 'à ', 'le', 'en', 'qui', 'dans']" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from collections import Counter\n", + "\n", + "corpus = list(json_page['query']['pages'].values())[0]['extract']\n", + "word_regex = re.compile(r'(\\b[^\\s]+\\b)')\n", + "words = word_regex.findall(corpus)\n", + "\n", + "count_words = collections.Counter(words)\n", + "most_commons_words = [k for k, v in count_words.most_common(10)]\n", + "\n", + "most_commons_words" + ] + }, + { + "cell_type": "markdown", + "id": "7a7a38b8-97f5-4851-b642-f7b40c023491", + "metadata": {}, + "source": [ + "### Étape 3 : Initialiser le dictionnaire de mots avec les tokens initiaux (les lettre de l'aplhabet)\n", + "\n", + "Créer le vocabulaire initial dans la variable vocab. Combien de tokens initiaux avez-vous ?" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "901f7ea5-9d71-4f8c-95bf-32b098c3241c", + "metadata": {}, + "outputs": [], + "source": [ + "vocab = list({char for word in count_words.keys() for char in word })\n", + "vocab.sort()" + ] + }, + { + "cell_type": "markdown", + "id": "efd79ab8-b1b1-4802-ac80-39b12cb5917f", + "metadata": {}, + "source": [ + "### Étape 4 : Apprendre le tokenizer\n", + "Pour apprendre le tokenizer nous avons besoins de plusieurs fonctions:\n", + "1. Une fonction pour calculer la fréquence de chacune des pairs de ``tokens''\n", + "2. Une fonction pour fusionner un paire\n", + "\n", + "Plusieurs variables sont nécessaires:\n", + "1. **vocab** contenant le vocabulaire courant\n", + "2. **merge_rules** contenant toutes les règles de fusion (un dictionnaire contenant comme clef un couple de tokens à fusionner et le résultat de la fusion des tokens). Par exemple : {('e', 's'), 'es', ('en', 't') :'ent'}\n", + "3. **splits** Un dictionnaire contenant le découpage courant du corpus avec pour clef le mot et comme valeur la liste des \"tokens\"\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "8357b6bd-9414-4263-9bf5-ef249614cb04", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{'La': ['L', 'a'],\n", + " 'Grèce': ['G', 'r', 'è', 'c', 'e'],\n", + " 'antique': ['a', 'n', 't', 'i', 'q', 'u', 'e'],\n", + " 'est': ['e', 's', 't'],\n", + " 'une': ['u', 'n', 'e']}" + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# En première étape splits contient les mots décomposer en caractères\n", + "splits = {word: [c for c in word] for word in count_words.keys()}\n", + "{k: splits[k] for k in list(splits.keys())[:5]}" + ] + }, + { + "cell_type": "markdown", + "id": "722df20d-fae1-41f1-96bc-fecb6b10a0ca", + "metadata": {}, + "source": [ + "#### Calculer la fréquences des pairs de tokens\n", + "Créer un fonction **compute_pair_freqs** qui étant donné les mots décomposés en tokens (dictionnaire splits) et la fréquence des mots retourne la fréquence de chaque couple de tokens (attention seulement les sous-mots successifs). " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "cb995e01-74f4-432f-9ddb-5c676afc32c8", + "metadata": {}, + "outputs": [], + "source": [ + "def compute_pair_freqs(splits, word_freqs):\n", + " pair_freqs = {}\n", + " for word, freq in word_freqs.items():\n", + " split = splits[word]\n", + " if len(split) == 1:\n", + " continue\n", + " for i in range(len(split) - 1):\n", + " pair = (split[i], split[i + 1])\n", + " if(pair not in pair_freqs):\n", + " pair_freqs[pair] = 0\n", + " pair_freqs[pair] += freq\n", + " return pair_freqs" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "d806c606-8cd3-43ff-8e14-aea74f9d4172", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{('L', 'a'): 188,\n", + " ('G', 'r'): 266,\n", + " ('r', 'è'): 279,\n", + " ('è', 'c'): 316,\n", + " ('c', 'e'): 1342}" + ] + }, + "execution_count": 8, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "pair_freqs = compute_pair_freqs(splits, count_words)\n", + "{k: pair_freqs[k] for k in list(pair_freqs.keys())[:5]}" + ] + }, + { + "cell_type": "markdown", + "id": "72e05d19-df89-4b0b-a2ad-ba5af727b1e5", + "metadata": {}, + "source": [ + "#### Retrouver la paire la plus fréquente et fusionner une pair \n", + "1. Créer une fonction **most_frequent(pair_freqs)** retournant la paire de token la plus fréquente.\n", + "2. Créer une fonction **merge_pair()** qui étant donnée une paire, l'objet splits retourne la nouvelle séparation en token des données (splits))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "c608944a", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "9b60759e-0bb4-4969-bc86-ee39fd3dc7bb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(('e', 's'), 6895)" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "def most_frequent(pair_freqs):\n", + " return max(pair_freqs.items(), key=lambda x: x[1])\n", + "most_frequent(pair_freqs)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "dd44467a-326b-499c-9995-8661fd102bec", + "metadata": {}, + "outputs": [], + "source": [ + "def merge_pair(a, b, splits):\n", + " for word in splits.keys():\n", + " split = splits[word]\n", + " if len(split) == 1:\n", + " continue\n", + " i = 0\n", + " while i < len(split) - 1:\n", + " if split[i] == a and split[i + 1] == b:\n", + " split = split[:i] + [a + b] + split[i + 2 :]\n", + " else:\n", + " i += 1\n", + " splits[word] = split\n", + " return splits" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "462376c4-6abc-4d4c-8bb6-a7e4d5bb96bb", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "['g', 'r', 'e', 'c', 'q', 'u', 'es']\n" + ] + } + ], + "source": [ + "new_splits = merge_pair(*most_frequent(pair_freqs)[0], splits)\n", + "print(new_splits['grecques'])" + ] + }, + { + "cell_type": "markdown", + "id": "1a9cdfeb-ade8-4ac6-9cd5-9049090ca142", + "metadata": {}, + "source": [ + "#### Appliquer l'algorithme jusqu'a obtenir la taille du vocabulaire souhaitée\n", + "Créer un objet BPE qui prend en argument un corpus, un nombre de mots et applique l'algorithme BPE. L'algorithme stocke dans l'attribut vocab le vcocabulaire final et dans merge_rule les règles de fusion." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "c0367f37-9b40-4ae7-8ec6-ca7ab86ae687", + "metadata": {}, + "outputs": [], + "source": [ + "class BPE:\n", + " def __init__(self, corpus, vocabulary_size=500):\n", + " super().__init__()\n", + " self.word_regex = re.compile(r'(\\b[^\\s]+\\b)')\n", + " words = self.word_regex.findall(corpus)\n", + "\n", + " # counting words\n", + " count_words = collections.Counter(words)\n", + " # create initial vocab\n", + " self.vocab = list({char for word in count_words.keys() for char in word })\n", + " self.vocab.sort()\n", + " # create the initial split\n", + " splits = {word: [c for c in word] for word in count_words.keys()}\n", + " # initialise merge_rules\n", + " self.merge_rules = {}\n", + " while len(self.vocab) < vocabulary_size:\n", + " pair_freqs = compute_pair_freqs(splits, count_words)\n", + " # I considered the format (('e', 's'), 6848) for best_pair\n", + " best_pair = most_frequent(pair_freqs)\n", + " \n", + " splits = merge_pair(*best_pair[0], splits)\n", + "\n", + " self.merge_rules[best_pair[0]] = best_pair[0][0] + best_pair[0][1]\n", + " self.vocab.append(''.join(best_pair[0]))\n", + "\n", + " def tokenize(self, text):\n", + " words = self.word_regex.findall(text)\n", + " splits = [[l for l in word] for word in words]\n", + " for pair, merge in self.merge_rules.items():\n", + " for idx, split in enumerate(splits):\n", + " i = 0\n", + " while i < len(split) - 1:\n", + " if split[i] == pair[0] and split[i + 1] == pair[1]:\n", + " split = split[:i] + [merge] + split[i + 2 :]\n", + " else:\n", + " i += 1\n", + " splits[idx] = split\n", + " \n", + " return sum(splits, [])\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "c282a2a1-f23c-4b22-b858-35a0049250c1", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "my_bpe = BPE(corpus)" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "b225b59b-6103-4b3f-b776-ea2b71714d64", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/plain": [ + "['culture', 'grecques', 'dévelop', 'p', 'ée', 'en', 'Grèce']" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "texte = '''culture grecques développée en Grèce '''\n", + "my_bpe.tokenize(texte)[:12]" + ] + }, + { + "cell_type": "markdown", + "id": "a6f478a3-5f52-43c4-9b48-0e912304985a", + "metadata": {}, + "source": [ + "#### Testez en modifiant les paramètres ou le corpus\n", + "Tester l'algorithme avec différents hyper-paramètres ou données" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "e33bdef2-8392-44f8-a846-81249fd60ac9", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{('e', 's'): 'es',\n", + " ('n', 't'): 'nt',\n", + " ('q', 'u'): 'qu',\n", + " ('r', 'e'): 're',\n", + " ('o', 'n'): 'on',\n", + " ('d', 'e'): 'de',\n", + " ('l', 'e'): 'le',\n", + " ('l', 'a'): 'la',\n", + " ('t', 'i'): 'ti',\n", + " ('i', 's'): 'is',\n", + " ('e', 'nt'): 'ent',\n", + " ('e', 'n'): 'en'}" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "{k: my_bpe.merge_rules[k] for k in list(my_bpe.merge_rules.keys())[:12]}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "7513567a-5237-4dad-9500-bf2cded9b8a2", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2ccb41e3", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} -- GitLab