GPT-4o Finetuning Failed

Nas 20 Reputation points
2025-04-22T03:02:49.8666667+00:00

I'm receiving an error.

After preparing data files following guidelines (https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/fine-tune?tabs=command-line).

During Preprocessing Files:

status : Training file: Preprocessing Summary: The provided data failed validation due to: contains invalid schema (22). Please visit our docs to learn how to resolve these issues, and try again. Details - Samples of lines per error type: contains invalid schema: Line numbers --> 1, 3, 4, 5, 6, 7, 9, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26

After: When I force my files to UTF-8 encoding without BOM, I now get error like this, just simple File Preprocessing failed.

File preprocessing failed

Any input will be appreciated!!.
Thanks.

Azure Machine Learning
Azure Machine Learning
An Azure machine learning service for building and deploying models.
3,261 questions
{count} votes

Accepted answer
  1. Sina Salam 20,101 Reputation points Moderator
    2025-04-30T22:07:34.5833333+00:00

    Hello Nas,

    Welcome to the Microsoft Q&A and thank you for posting your questions here.

    I understand that your GPT-4o Finetuning Failed.

    To accurately resolve the issue, kindly follow these step-by-step recommendations:

    1. Validate Schema Format in the each line in your .jsonl file must have exactly the following structure:
         {"prompt": "Translate to French: Hello", "completion": "Bonjour"}
      
      • No extra keys, no nesting, no arrays.
      • Each line must be valid standalone JSON (newline-delimited).
      Use this reference for details - https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=command-line#prepare-training-data
    2. Use the official Data Preparation Tool instead of manual validation, use the CLI tool from Microsoft using bash command: openai tools fine_tunes.prepare_data -f yourfile.jsonl It provides:
      • Exact line numbers with schema errors
      • Suggestions on how to fix issues
      https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
    3. Check Encoding + Line Endings to ensure:
      • File is UTF-8 without BOM
      • Unix-style line endings (\n), not Windows (\r\n)
      • PowerShell for re-encoding and line ending fix: (Get-Content -Raw yourfile.jsonl) -replace "`r`n", "`n" | Set-Content -Encoding utf8 yourfile_clean.jsonl
    4. Handle Unicode and HTML safely, and only decode Unicode escape sequences if they are incorrectly double-escaped, i.e., appear as \\ud83d\\udc49 instead of \ud83d\udc49. If unsure, leave them — valid Unicode escape sequences are acceptable. To remove HTML tags safely (optional):
         import re
         def remove_html(text):
             return re.sub(r'<[^>]+>', '', text)
      
    5. Create a small test.jsonl with 3–5 rows and test fine-tuning:
       {"prompt": "What is 2+2?", "completion": "4"}
       {"prompt": "Translate to French: Apple", "completion": "Pomme"}
    
    • If this works, the issue is clearly in the formatting of your original file.

    I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.


    Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

    0 comments No comments

0 additional answers

Sort by: Most helpful

Your answer

Answers can be marked as Accepted Answers by the question author, which helps users to know the answer solved the author's problem.