Codepoints java offset9/3/2023 For example, "ą́" is composed of three codepoints, but is rendered as a single selectable symbol. > The atom of selection in the browser is the symbol, or grapheme. A symbol is not (necessarily) a codepoint In order to review this, I'm going to first lay out the possible points of ambiguity, and then review what the spec seems to say on these issues. Even if I'm misreading it and it is clear, I'm not sure it makes a recommendation that is practical. > I've reviewed what the model spec currently says and I'm not sure it's particularly precise on this point. > to what do the numbers 478 and 512 refer? These numbers will likely be interpreted by other components specified by this WG (such as the RangeFinder API), not to mention external systems, and we need to make sure we are consistent in our definitions across these specifications. Specifically, if I have a selector such as > To: Resent-From: One of the most useful discussions at our working group F2F last week was the result of a question from Takeshi Kanai about how we calculate character offsets such as those used by Text Position Selector in the draft model. > Any comment/suggestion welcome (I've cross-posted intentionally, please remove recipients if not appropriate.) > TAG members - has the issue of dealing with symbols vs characters/codepoints come up in TAG discussion? > Please feel free to come back again here or contact the I18N WG. > what the Character Model has to say about this: > I'd suggest we schedule a discussion of this issue in an upcoming call. > Unfortunately for us, both considerations apply in the annotation use > b) "user interaction is a primary concern" - in which case grapheme > "code unit strings" (I presume interop with existing DOM APIs would also > a) there are performance considerations that would predicate the use of > clear that recommendation is to use character strings (i.e. > The character model lays out the problems more clearly than I have. > Thanks for this reference, Martin, and thanks for passing this to TAG, > When transfering data, it is important that the other implementation counts offsets the same way. > On my wishlist, I would hope that the new Annotation standard would include a normative list (SHOULD not MUST) of string counting functions for all major programming languages and other standards like SPARQL to tackle interoperability. > (yes, we consider moving to a W3C community group for further improvement) > - the "definition of string" section in the NIF spec: However, NFD is not in wide use and the annotation of diacritics is probably out of scope. > in NFD you can annotate the code point for the diacritic separately. However, if people wish to annotate diacritics independently. > There is a problem with Unicode Normal Form (NF). Personally I think, byte offset for text is unnecessary, simply because code points are better, i.e. > Anyhow, I wouldn't know a single use case for using Code Units for annotation. > It was quite difficult to work with the byte offset given that the original formats where HTML, txt, PDFs and docx. > For the NLP2RDF project we converted these 30 million annotations to RDF: > Python, len() in combination with decode(): len("ä".decode("UTF-8")) =1 Any deviation will lead to side effects such as "ä" having the length 2: > Regarding annotation, using code points or Character Strings is definitely the best practice. > On the (serialized) web, UTF-8 is predominant, which is really not the question here as the choice between graphems, code points and units is orthogonal to encoding. Maybe some DOM parser rely on UTF-16 internally too, but still count Code Points C/C++ has a dataype widechar using 16 bits as it is easier to allocate memory for variables. This means that you can use byte offsets easily to jump to certain positions in the text. > While UTF-8 has a variable length of one to four bytes per code point, UTF-16 and 32 have the advantage of a fixed length. you can encode the same code point in UTF-8, UTF-16 and UTF-32 which will definitely change the number of code units and bytes needed. > UTF-16 is the encoding of the string and is independent of code points, units and graphems, i.e. > From my understanding the example in is not good: > I am a bit puzzled why is renaming Unicode Code Points (a clearly defined thing) to Character String. Here I show it in python (note the u'xxx' is a UTF-16): That you can calculate an offset of a character is not true.įor example characters of the use 4 bytes Like UTF-8 it can use up to 4 bytes.Ĭases where UTF-16 uses 4 bytes may be 'pathological', but the assumption UTF-16 is **not** a fixed length encoding. > UTF-16 and 32 have the advantage of a fixed length. > While UTF-8 has a variable length of one to four bytes per code point, To: Sebastian Hellmann, Public TAG List ĬC: W3C Public Annotation List, nlp2rdf
0 Comments
Leave a Reply.AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |