Assn 5: UTF-8
Due: 5:00pm, Friday, September 26. Value: 40 pts.
In this assignment, you'll implement two functions to manipulate a Unicode string stored using the UTF-8 encoding. [UTF-8 notes from class]
In C, a char
is always a single 8-bit byte,
and so a char
array is most easily used for representing
a string of ASCII characters. All of C's built-in string
manipulation functions are built for ASCII.
However, we can use a char
array as a Unicode string using UTF-8,
as long as we consistently use only functions designed specifically for
UTF-8. As an example of encoding a string into an array of char
s,
consider the Spanish word baño.
The third letter ñ is not an ASCII character,
but it has a Unicode codepoint of U+00F1. Since this does not fit
into 7 bits but does fit into 11 bits, UTF-8 represents it
into two separate bytes. Consequently, baño would be
represented in memory as an array of six bytes,
whose hexadecimal values are
0x62, 0x61, 0xC3, 0xB1, 0x6F, and the terminating 0x00.
letter: b a ñ o codepoint: U+0062 U+0061 U+00F1 U+006F bytes: 62 61 C3 B1 6F
For this assignment, you will complete two utility functions for UTF-8.
int u8get(char *s, int index)
Returns the Unicode codepoint for the Unicode character at index
index
in the string. For example, ifs
is baño represented as above,u8get(s, 3)
should return 0x006F, the Unicode codepoint for the o, andu8get(s, 2)
should return 0x00F1.You may assume that
s
contains only codepoints that fit into 16 bits. This means that you need only worry about the one-, two-, and three-byte cases for UTF-8. Also, you can assume thatindex
is between 0 and 1 less than the number of Unicode characters represented ins
.int u8find(char *s, int ch)
Returns the index in the string
s
where the Unicode codepointch
first occurs, or −1 if the codepoint does not occur in the string. For example, ifs
is baño represented as above,u8find(s, 0xF1)
should return 2.As before, you may assume that
s
andch
contain only codepoints that fit into 16 bits.
As in prior assignments, the handout code includes three files.
- u8.c
Defines utility functions for manipulating UTF-8 encoded strings. This is the only file you will modify for this assignment.
- u8.h
Contains prototypes for the functions found in u8.c.
- u8test.c
Contains a
main
function that runs through a battery of tests. If your program passes all the tests with no memory problems reported by valgrind, there is a good chance that your solution is correct. That is not a guarantee, though. The tests are all based on the following Unicode strings:- baño (Spanish for bathroom)
letter: b a ñ o codepoint: U+0062 U+0061 U+00F1 U+006F bytes: 62 61 C3 B1 6F - Meßgröße (German for measured variable)
letter: M e ß g r ö ß e codepoint: U+004D U+0065 U+00DF U+0067 U+0072 U+00F6 U+00DF U+0065 bytes: 4D 65 C3 9F 67 72 C3 B6 C3 9F 65 - εἰς (first word of Hendrix's motto as as rendered in Ancient Greek on its seal)
letter: ε ἰ ς codepoint: U+03B5 U+1F30 U+03C2 bytes: CE B5 E1 BC B0 CF 82 - سلام (Arabic word salaam, meaning peace and frequently used for greeting)
letter: س ل ا م codepoint: U+0633 U+0644 U+0627 U+0645 bytes: D8 B3 D9 84 D8 A7 D9 85 - ±√b²−4ac (discriminant portion of quadratic formula)
letter: ± √ b ² − 4 a c codepoint: U+00B1 U+221A U+0062 U+00B2 U+2212 U+0034 U+0061 U+0063 bytes: C2 B1 E2 88 9A 62 C2 B2 E2 88 92 34 61 63
To submit your solution, include only the u8.c file.