top of page

Computer Code  Tokenization

To create a specialized Large Language Model (LLM) that can translate between computer languages, first tokenize the computer code into a dataset.

​

Data Collection:
Gather a large corpus of parallel code snippets in the source and target programming languages you want to translate between. Ensure that the code snippets are diverse, covering various programming concepts, data structures, algorithms, and real-world use cases. Aim for a balanced dataset with a sufficient number of examples for each language pair.

 

Data Preprocessing:
Clean the collected code snippets by removing any irrelevant information, such as comments, documentation, or metadata. Handle any encoding issues and ensure that the code is in a consistent character encoding format (e.g., UTF-8). Normalize the code by applying consistent formatting, such as indentation and whitespace handling. If necessary, perform any language-specific preprocessing, such as handling language-specific syntax or constructs.


Tokenization:
Tokenize the code snippets into smaller units called tokens. Tokenization helps in breaking down the code into meaningful elements. Choose an appropriate tokenization strategy based on the characteristics of the programming languages involved. Some common tokenization approaches include:

  • Whitespace-based tokenization: Split the code based on whitespace characters (spaces, tabs, newlines).

  • Lexical tokenization: Use language-specific lexical rules to identify tokens such as keywords, identifiers, literals, and operators.

  • Subword tokenization: Break down tokens into smaller subword units to handle out-of-vocabulary words and capture morphological patterns.

  • Apply the chosen tokenization approach consistently to both the source and target code snippets.

  • Store the tokenized code snippets along with their corresponding language labels

 

Vocabulary Creation:

  • Build a vocabulary from the tokenized code snippets.

  • Determine the size of the vocabulary based on the complexity and diversity of the programming languages involved.

  • Include the most frequent tokens in the vocabulary while considering a balance between coverage and model efficiency.

  • Assign unique integer IDs to each token in the vocabulary.

  • Handle special tokens, such as padding tokens, start-of-sequence tokens, and end-of-sequence tokens, which are used during model training and inference.


Data Splitting:

  • Split the preprocessed and tokenized dataset into training, validation, and test sets.

  • Ensure that the splits are representative of the overall dataset and maintain the distribution of programming concepts and language pairs.

  • Use the training set for model training, the validation set for hyperparameter tuning and model selection, and the test set for final evaluation.

 

Data Formatting:
Convert the tokenized code snippets into a suitable format for training the LLM.

 

Common formats include:
Sequence-to-sequence format: Pair the source code sequence with the corresponding target code sequence. Token-level format: Align the source and target code tokens at each position.
Pad the sequences to a fixed length or use dynamic padding to handle variable-length sequences efficiently. Store the formatted dataset in a format compatible with the LLM framework you are using (e.g., TensorFlow records, PyTorch tensors).


Data Augmentation (Optional):
Consider applying data augmentation techniques to increase the diversity and robustness of the training data. Techniques such as code snippet rotation, identifier renaming, or syntax tree manipulation can help generate additional training examples. Be cautious not to introduce errors or change the semantics of the code during augmentation.


Model Training:

  • Use the prepared dataset to train the specialized LLM for code translation.

  • Choose an appropriate architecture, such as a Transformer-based model (e.g., BERT, GPT) or a sequence-to-sequence model (e.g., encoder-decoder with attention).

  • Experiment with different hyperparameters, such as model size, learning rate, batch size, and number of training epochs, to optimize the model's performance.

  • Monitor the model's performance on the validation set and apply techniques like early stopping or model checkpointing to prevent overfitting.


Model Evaluation:
Evaluate the trained model on the test set to assess its performance in code translation.

Process the following Python code:

(See Hugging Face Python dataset)

import os
researcher = Agent(
   role = 'Researcher',
   goal = 'Research Ai insights',
   backstory = 'You are the head research assistant',
   verbose= True,
   allow_delegation = True,
)
writer1 = Agent(
   role = 'Writer',
   goal = "Write engaging blog post",
   backstory = 'You are the head blog poster',
   verbose = True,
   allow_delegation = True,
)
task1 = Task(description='Investigate trends', agent=researcher)
crew = Crew(
   agents=[researcher, writer],
   tasks=[task1],
)
result = crew.kickoff()
print("#######################")
print(result)

 

////////////////////////////////////////////////////////////////////////////////////////////////

Tokenize the code:


['from', 'crewai', 'import', 'Agent', ',', 'Task', ',', 'Crew', ',', 'Process', '\n',
'import', 'os', '\n',
'researcher', '=', 'Agent', '(', '\n',
'    ', 'role', '=', "'Researcher'", ',', '\n',
'    ', 'goal', '=', "'Research Ai insights'", ',', '\n',
'    ', 'backstory', '=', "'You are the head research assistant'", ',', '\n',
'    ', 'verbose', '=', 'True', ',', '\n',
'    ', 'allow_delegation', '=', 'True', ',', '\n',
')', '\n',
'writer1', '=', 'Agent', '(', '\n',
'    ', 'role', '=', "'Writer'", ',', '\n',
'    ', 'goal', '=', '"Write engaging blog post"', ',', '\n',
'    ', 'backstory', '=', "'You are the head blog poster'", ',', '\n',
'    ', 'verbose', '=', 'True', ',', '\n',
'    ', 'allow_delegation', '=', 'True', ',', '\n',
')', '\n',
'task1', '=', 'Task', '(', 'description', '=', "'Investigate trends'", ',', 'agent', '=', 'researcher', ')', '\n',
'crew', '=', 'Crew', '(', '\n',
'    ', 'agents', '=', '[', 'researcher', ',', 'writer', ']', ',', '\n',
'    ', 'tasks', '=', '[', 'task1', ']', ',', '\n',
')', '\n',
'result', '=', 'crew', '.', 'kickoff', '(', ')', '\n',
'print', '(', '"#######################"', ')', '\n',
'print', '(', 'result', ')', '\n']


////////////////////////////////////////////////////////////////////////////
Here's the numerical encoding for the tokenized Python code:

​

[17, 18, 19, 20, 7, 21, 7, 22, 7, 23, 1, 19, 24, 1, 25, 8, 20, 9, 1, 0, 26, 8, 27, 7, 1, 0, 28, 8, 29, 7, 1, 0, 30, 8, 31, 7, 1, 0, 32, 8, 33, 7, 1, 0, 34, 8, 33, 7, 1, 10, 1, 35, 8, 20, 9, 1, 0, 26, 8, 36, 7, 1, 0, 28, 8, 37, 7, 1, 0, 30, 8, 38, 7, 1, 0, 32, 8, 33, 7, 1, 0, 34, 8, 33, 7, 1, 10, 1, 39, 8, 21, 9, 40, 8, 41, 7, 42, 8, 25, 10, 1, 22, 8, 22, 9, 1, 0, 43, 8, 44, 25, 7, 35, 45, 7, 1, 0, 46, 8, 44, 39, 45, 7, 1, 10, 1, 47, 8, 22, 11, 48, 9, 10, 1, 49, 9, 50, 10, 1, 49, 9, 47, 10, 1]

 

Each token in the original code has been replaced with a unique numerical value. The mapping between tokens and their corresponding numerical values is as follows:


Copy code
'\n': 1
' ': 0
',': 7
'(': 9
')': 10
'.': 11
'=': 8
'[': 44
']': 45
'Agent': 20
'Crew': 22
'Process': 23
'Task': 21
"'Researcher'": 27
"'Research Ai insights'": 29
"'You are the head research assistant'": 31
"'Writer'": 36
'"Write engaging blog post"': 37
"'You are the head blog poster'": 38
'True': 33
'allow_delegation': 34
'backstory': 30
'crewai': 18
'crew': 22
'description': 40
'from': 17
'goal': 28
'import': 19
'kickoff': 48
'os': 24
'print': 49
'researcher': 25
'result': 47
'role': 26
'task1': 39
'tasks': 46
'verbose': 32
'writer': 35
'writer1': 35
'"#######################"': 50

bottom of page