GPT-4o Finetuning Failed

Question

GPT-4o Finetuning Failed

Nas 20

I'm receiving an error.

After preparing data files following guidelines (https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/fine-tune?tabs=command-line).

During Preprocessing Files:

status : Training file: Preprocessing Summary: The provided data failed validation due to: contains invalid schema (22). Please visit our docs to learn how to resolve these issues, and try again. Details - Samples of lines per error type: contains invalid schema: Line numbers --> 1, 3, 4, 5, 6, 7, 9, 11, 12, 13, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26

After: When I force my files to UTF-8 encoding without BOM, I now get error like this, just simple File Preprocessing failed.

File preprocessing failed

Any input will be appreciated!!.
Thanks.

Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-22T13:27:00.1566667+00:00
Hi Nas
To resolve your GPT-4o model fine-tuning issues completely, let's address both the schema validation error and preprocessing failure systematically.

First, ensure your training file follows Microsoft's official JSON format requirements where each line must contain properly formatted prompt completion pairs with exact JSON syntax and also the file must be encoded in UTF-8 without BOM (Byte Order Mark) as explicitly stated as above.

To convert your file encoding appropriately, you can use PowerShell commands for to fix Encoding

Get-Content input.json | Set-Content -Encoding utf8 output.json

Each line should follow the exact format shown in Microsoft's examples and also verify that your file meets all validation criteria by checking for proper JSON structure, ensuring each line contains both prompt and completion fields, and confirming consistent line endings throughout the file. If you still encounter errors after implementing these changes, check for hidden characters at the beginning of your file and verify that all prompts and completions are properly quoted, as these are common issues documented in Microsoft's troubleshooting guide.

Reference - Fine Tuning Documentation ,Ai Docs

Thanks
Nas 20 Reputation points

2025-04-24T04:39:29.5833333+00:00

Hey @Ravada Shivaprasad , Thank you for your prompt response. I still see the same error as can be seen in the images below. I validated the JSON schema in both the training and validation files, which was fine.

I noted that my JSON file contains a variety of characters, HTML tags, and Unicode escape sequences (like \ud83d\udc49). Will this affect the training?
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-24T09:19:03.61+00:00
Hi Nas

When working with JSON files containing Unicode characters such as ('\ud83d\udc49'), it's important to handle them properly to avoid issues during the fine-tuning process. These Unicode escape sequences are often used to represent emojis or other special characters in text data. While they appear as valid-looking strings in the raw JSON, they can cause problems when processed by some systems unless converted to their proper UTF-8 encoded representations.
For that you have to modify your data processing pipeline to convert these escape sequences into actual Unicode characters. You need to implement proper data preprocessing. Here's a comprehensive solution that handles both HTML tags and Unicode escape sequences

# Remove HTML tags while preserving content $htmlCleaned = (Get-Content "input.json" -Raw) | ForEach-Object { [regex]::Replace($_, '<.*?>', '') } $unicodeJson = $htmlCleaned | ForEach-Object { [regex]::Replace($_, '\\\\u([0-9a-fA-F]{4})', { param($match) $code = [Convert]::ToInt32($match.Groups[1].Value, 16) [char]$code }) } # Save the cleaned data with proper UTF-8 encoding $unicodeJson | Set-Content "cleaned_data.json" -Encoding utf8

This preprocessing approach ensures that your data is properly formatted for Azure OpenAI fine-tuning. The script first removes HTML tags while preserving the actual content, then converts Unicode escape sequences into their proper UTF-8 representations.
After applying this preprocessing need to validate your data to ensure it meets Azure OpenAI's requirements. By implementing this preprocessing pipeline, you can ensure that your training data is properly formatted and free of potential issues that could affect your model's performance. The cleaned data will be more reliable for training and will help prevent common errors related to character encoding and HTML tags.

Thanks
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-25T16:47:50.93+00:00

Hi Nas

Did you get any chance to check the response.

Thank you!
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-28T04:46:39.1433333+00:00

Hi Nas

Just following up to see if you had a chance to review the above response.

Thank you!
Nas 20 Reputation points

2025-04-28T05:56:51.3433333+00:00

Hi @Ravada Shivaprasad ,
Thank you for your attention to this matter. I tried, but still have no results. I am trying to clean the data more and see how it responds.

If I find anything, I will let you know. Thanks.
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-28T09:42:51.2233333+00:00

Hi Nas

Thank you for your update. If cleaning the data resolves the issue, please confirm the resolution, as it will serve as a helpful reference for other community members. However, if the problem persists, kindly provide further details so I can assist you accordingly.

Thanks
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-29T09:12:35.3333333+00:00

Hi Nas

Just following up to see if you had a chance to review the above response.

Thank you!
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-30T11:16:52.9233333+00:00

Hi Nas

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thank you!
Sina Salam 20,101 Reputation points Moderator

2025-04-30T21:38:19.2666667+00:00

Nice work

Accepted answer

0 additional answers

Your answer

Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-22T13:27:00.1566667+00:00

Hi Nas
To resolve your GPT-4o model fine-tuning issues completely, let's address both the schema validation error and preprocessing failure systematically.

First, ensure your training file follows Microsoft's official JSON format requirements where each line must contain properly formatted prompt completion pairs with exact JSON syntax and also the file must be encoded in UTF-8 without BOM (Byte Order Mark) as explicitly stated as above.

To convert your file encoding appropriately, you can use PowerShell commands for to fix Encoding

Get-Content input.json | Set-Content -Encoding utf8 output.json

Each line should follow the exact format shown in Microsoft's examples and also verify that your file meets all validation criteria by checking for proper JSON structure, ensuring each line contains both prompt and completion fields, and confirming consistent line endings throughout the file. If you still encounter errors after implementing these changes, check for hidden characters at the beginning of your file and verify that all prompts and completions are properly quoted, as these are common issues documented in Microsoft's troubleshooting guide.

Reference - Fine Tuning Documentation ,Ai Docs

Thanks
Nas 20 Reputation points

2025-04-24T04:39:29.5833333+00:00

Hey @Ravada Shivaprasad , Thank you for your prompt response. I still see the same error as can be seen in the images below. I validated the JSON schema in both the training and validation files, which was fine.

I noted that my JSON file contains a variety of characters, HTML tags, and Unicode escape sequences (like \ud83d\udc49). Will this affect the training?
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-24T09:19:03.61+00:00

Hi Nas

When working with JSON files containing Unicode characters such as ('\ud83d\udc49'), it's important to handle them properly to avoid issues during the fine-tuning process. These Unicode escape sequences are often used to represent emojis or other special characters in text data. While they appear as valid-looking strings in the raw JSON, they can cause problems when processed by some systems unless converted to their proper UTF-8 encoded representations.
For that you have to modify your data processing pipeline to convert these escape sequences into actual Unicode characters. You need to implement proper data preprocessing. Here's a comprehensive solution that handles both HTML tags and Unicode escape sequences

# Remove HTML tags while preserving content $htmlCleaned = (Get-Content "input.json" -Raw) | ForEach-Object { [regex]::Replace($_, '<.*?>', '') } $unicodeJson = $htmlCleaned | ForEach-Object { [regex]::Replace($_, '\\\\u([0-9a-fA-F]{4})', { param($match) $code = [Convert]::ToInt32($match.Groups[1].Value, 16) [char]$code }) } # Save the cleaned data with proper UTF-8 encoding $unicodeJson | Set-Content "cleaned_data.json" -Encoding utf8

This preprocessing approach ensures that your data is properly formatted for Azure OpenAI fine-tuning. The script first removes HTML tags while preserving the actual content, then converts Unicode escape sequences into their proper UTF-8 representations.
After applying this preprocessing need to validate your data to ensure it meets Azure OpenAI's requirements. By implementing this preprocessing pipeline, you can ensure that your training data is properly formatted and free of potential issues that could affect your model's performance. The cleaned data will be more reliable for training and will help prevent common errors related to character encoding and HTML tags.

Thanks
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-25T16:47:50.93+00:00

Hi Nas

Did you get any chance to check the response.

Thank you!
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-28T04:46:39.1433333+00:00

Hi Nas

Just following up to see if you had a chance to review the above response.

Thank you!
Nas 20 Reputation points

2025-04-28T05:56:51.3433333+00:00

Hi @Ravada Shivaprasad ,
Thank you for your attention to this matter. I tried, but still have no results. I am trying to clean the data more and see how it responds.

If I find anything, I will let you know. Thanks.
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-28T09:42:51.2233333+00:00

Hi Nas

Thank you for your update. If cleaning the data resolves the issue, please confirm the resolution, as it will serve as a helpful reference for other community members. However, if the problem persists, kindly provide further details so I can assist you accordingly.

Thanks
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-29T09:12:35.3333333+00:00

Hi Nas

Just following up to see if you had a chance to review the above response.

Thank you!
Ravada Shivaprasad 40 Reputation points Microsoft External Staff

2025-04-30T11:16:52.9233333+00:00

Hi Nas

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others. Otherwise, will respond with more details and we will try to help.

Thank you!
Sina Salam 20,101 Reputation points Moderator

2025-04-30T21:38:19.2666667+00:00

Nice work

Answer 1

Hello Nas,

Welcome to the Microsoft Q&A and thank you for posting your questions here.

I understand that your GPT-4o Finetuning Failed.

To accurately resolve the issue, kindly follow these step-by-step recommendations:

Validate Schema Format in the each line in your .jsonl file must have exactly the following structure:
```
   {"prompt": "Translate to French: Hello", "completion": "Bonjour"}
```
- No extra keys, no nesting, no arrays.
- Each line must be valid standalone JSON (newline-delimited).
Use this reference for details - https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/fine-tuning?tabs=command-line#prepare-training-data
Use the official Data Preparation Tool instead of manual validation, use the CLI tool from Microsoft using bash command: openai tools fine_tunes.prepare_data -f yourfile.jsonl It provides:
- Exact line numbers with schema errors
- Suggestions on how to fix issues
https://platform.openai.com/docs/guides/fine-tuning/preparing-your-dataset
Check Encoding + Line Endings to ensure:
- File is UTF-8 without BOM
- Unix-style line endings (\n), not Windows (\r\n)
- PowerShell for re-encoding and line ending fix: (Get-Content -Raw yourfile.jsonl) -replace "`r`n", "`n" | Set-Content -Encoding utf8 yourfile_clean.jsonl
Handle Unicode and HTML safely, and only decode Unicode escape sequences if they are incorrectly double-escaped, i.e., appear as \\ud83d\\udc49 instead of \ud83d\udc49. If unsure, leave them — valid Unicode escape sequences are acceptable. To remove HTML tags safely (optional):
```
   import re
   def remove_html(text):
       return re.sub(r'<[^>]+>', '', text)
```
Create a small test.jsonl with 3–5 rows and test fine-tuning:

   {"prompt": "What is 2+2?", "completion": "4"}
   {"prompt": "Translate to French: Apple", "completion": "Pomme"}

If this works, the issue is clearly in the formatting of your original file.

I hope this is helpful! Do not hesitate to let me know if you have any other questions or clarifications.

Please don't forget to close up the thread here by upvoting and accept it as an answer if it is helpful.

Share via

GPT-4o Finetuning Failed

0 additional answers

Your answer