answersLogoWhite

0


Best Answer

UTF-8 is a variable-length text encoding scheme. Standard ASCII characters are represented by a single byte with a value in the range 0 to 127 (0x00 through 0x7F). Bit-7, the most significant bit, is never used in standard ASCII and is always zero in 8-bit encodings. Bit-7 can be used to signify one of 128 characters in the extended ASCII character set, however this set of characters depends on which code page is currently in use, and limits the total number of character representations to 256. To overcome this limit, UTF-8 uses bit-7 to signify that a multi-byte character follows.

The most-significant bits are used to determine how many continuation bytes will follow the leading byte:

0xxxxxxx -- one-byte standard 7-bit ASCII character

110xxxxx -- two-byte UTF-8 character

1110xxxx -- three-byte UTF-8 character

11110xxx -- four-byte UTF-8 character

Each of the continuation bytes take the form 10xxxxxx. This ensures a clear distinction between standard ASCII characters, leading bytes and continuation bytes. Note that the number of high-order 1 bits in the leading byte can also be used to determine how many bytes are used to represent a UTF-8 character code, including the leading byte (two for two-byte encodings, three for three-byte encodings and four for four-byte encodings).

Multi-byte UTF-8 encodings have to be decoded to produce a 24-bit code point:

2-bytes: 0x000080 through 0x0007FF

3-bytes: 0x000800 through 0x00FFFF

4-bytes: 0x010000 through 0x10FFFF

These code points map to code points in the UNICODE character set (UTF-8 is often mistaken for UNICODE for this reason).

As a result of the encoding method, certain bit sequences become invalid under UTF-8. For instance, any bit sequence with 11111xxx in the leading byte is invalid because it does not match any of the UTF-8 leading byte configurations. Similarly, any continuation byte of the form 0xxxxxxx or 11xxxxxx is invalid. However, there are older UTF-8 specifications that use some of these sequences.

Some applications (including Windows Notepad) add a byte-order mark (BOM) at the start of the file, even though this is unnecessary in UTF-8 encoding (the high-order byte always comes first). If a BOM is present, the first three characters will be 0xEF, 0xBB and 0xBF. These will translate to three garbage characters in applications that do not take account of the BOM. Many programmers mistakenly believe it is impossible to reliably detect UTF-8 without testing for a leading BOM which is not the case at all. A simple search through multi-byte sequences to ensure their validity is enough to validate the encoding. Even if randomly generated, the chances of finding 7 valid UTF-8 encodings is lower than the chance that the first three random character form a UTF-8 BOM.

To sum up, either the file is (pseudo) randomly-generated and just happens to have a UTF-8 BOM at the beginning, or the file was encoded using an older UTF-8 specification. If the file is valid, a text editor that supports these older encodings should help you determine which encoding was used. UTF-8 files used by websites must contain a plain-text ASCII header which will tell you precisely how the remainder of the file was encoded. If no such header exists then it is not UTF-8 encoded, it's either UNICODE (in which case the BOM will indicate which encoding was used) or it is standard ASCII. If the latter, the data may include extended ASCII encodings but you'd need to know which code page was used to generate the data in order to decode it correctly. It's not possible to determine this from encoding alone, but the code page should be included in the header. If not, the file should be treated as being corrupt.

User Avatar

Wiki User

7y ago
This answer is:
User Avatar

Add your answer:

Earn +20 pts
Q: I am trying to validate a website but it says it can't validate because on line 2 it contained one or more bytes that it cannot interpret as utf-8 I What does this mean?
Write your answer...
Submit
Still have questions?
magnify glass
imp
Related questions

What does it mean to validate a website for authenticity?

To make sure that the website is real.


How do you play spiral knights without downloading it?

go on the freaking website and where it says play PLAY IT! it will validate it, it will take a few minutes because your on the website then it'll show up somewhere and u can play it.....


How do you validate windows home edition?

Go to the following website: http://www.microsoft.com/genuine/ and click on the Validate button. This will confirm that your installation is genuine and will allow you to access the updates that only validated copies of Windows can access.


What tools can someone use to validate HTML code?

Some tools people could use to validate HTML code are Dr. Watson, Validate HTM Firefox add on and WDG HTML validator. Also popular to validate HTML codes are Cahse HTML validator life and HTML Toolbox.


What website can you use to interpret quotes?

I'm not sure if this is what you are looking for. Below in the Related links section is a website that has quote interpretations.


What is needed to validate a copy of Windows operating system?

Microsoft Windows Operating Systems have been the most forged software of all time. In order to make sure your copy is legitamate one must validate it by visiting microsoft's website or calling their hotline.


Is Wynonna Judd tall?

Wynonna Judd's website lists her height as 5'5". I just want to validate the above answer. Other sources also indicate her height as 5'5".


When trying to download software invalid handle flashed and would not allow me to install he cd. What does this mean?

Your security system stopped you from downloading because it could not validate the website. You can program your security to accept invalid sites if you feel they are safe and will not add any virus to your system.


What can one find on the official train website of Belgium?

On the official railway website of Belgium, you may purchase tickets and browse points of interest and suggested excursions. You may also find real time information and validate your railcard.


How to delete information to the website text and images?

i may not know the definite answer to this question. But, generally, to delete information to the website, text and images, is it to delete the information, text and images contained in the website?????


What is the legality of the website 'Lovethecock'?

Despite its nature, this website is legal under American law. It should be noted, however, that the information contained within this website is intended for adult audiences only.


How do you get memberships on websites?

You would need to join up using a form contained on that website you wanted to join as a member.