NAME Unicode::Truncate - Unicode-aware efficient string truncation SYNOPSIS use Unicode::Truncate; truncate_utf8("hello world", 7); ## "hell…"; truncate_utf8("hello world", 7, ''); ## "hello w" truncate_utf8('深圳', 7); ## "深…" DESCRIPTION This module is for truncating UTF-8 encoded Unicode text to particular byte lengths while inflicting the least amount of data corruption possible. The resulting truncated string will be no longer than your specified number of bytes (after UTF-8 encoding). With this module's "truncate_utf8", all truncated strings will continue to be valid UTF-8: it won't cut in the middle of a UTF-8 encoded code-point. Furthermore, if your text contains combining diacritical marks, this module will not cut in between a diacritical mark and the base character. RATIONALE Why not just use "substr" on a string before UTF-8 encoding it? The main problem is that the number of bytes that an encoded string will consume is not known until after you encode it. It depends on how many "high" code-points are in the string, how "high" those code-points are, the normalisation form chosen, and (relatedly) how many combining marks are used. Even before encoding, "substr" also may cut in front of combining marks. Truncating post-encoding may result in invalid UTF-8 partials at the end of your string, as well as cutting in front of combining marks. I knew I had to write this module after I asked Tom Christiansen about the best way to truncate unicode to fit in fixed-byte fields and he got angry at me and told me to never do that. :) Of course in a perfect world we would only need to worry about the amount of space some text takes up on the screen, in the real world we often have to or want to make sure things fit within certain byte size capacity limits. Many data-bases, network protocols, and file-formats require honouring byte-length restrictions. Even if they automatically truncate for you, are they doing it properly and consistently? On many file-systems, file and directory names are subject to byte-size limits. Many APIs that use C structs have fixed limits as well. You may even wish to do things like guarantee that a collection of news headlines will fit in a single ethernet packet. One interesting aspect of unicode's combining marks is that there is no specified limit to the number of combining marks that can be applied. So in some interpretations a single decomposed unicode character can take up an arbitrarily large number of bytes in its UTF-8 encoding. However, there are various recommendations such as the unicode standard UAX15-D3 "stream-safe" limit of 30. Reportedly the largest known "legitimate" use is a 1 base + 8 combining marks grapheme used in a Tibetan script. ELLIPSIS When a string is truncated, "truncate_utf8" indicates this by appending an ellipsis. By default this is the character U+2026 (…) however you can use any other string by passing it in as the third argument. Note that in UTF-8 encoding the default ellipsis consumes 3 bytes (the same as 3 periods in a row). The length of the truncated content *including* the ellipsis is guaranteed to be no greater than the byte size limit you specified. IMPLEMENTATION This module uses the ragel state machine compiler to parse/validate UTF-8 and to determine the presence of combining characters. Ragel is nice because we can determine the truncation location with a single pass through the data in an optimised C loop. One feature of this design is that it will not scan further than it needs to in order to determine the truncation location. So creating short truncations of really long strings doesn't even require traversing the long strings. Another purpose of this module is to be a "proof of concept" for the Inline::Filters::Ragel source filter as well as a demonstration of the really cool Inline::Module system. SEE ALSO Unicode-Truncate github repo Although very efficient, as discussed above, "substr" will not be able to give you a guaranteed byte-length output (if done pre-encoding) and/or will potentially corrupt text (if done post-encoding). There are several similar modules such as Text::Truncate, String::Truncate, Text::Elide but they are all essentially wrappers around "substr" and are subject to its limitations. A reasonable "99%" solution is to encode your string as UTF-8, truncate at the byte-level with "substr", decode with "Encode::FB_QUIET", and then re-encode it to UTF-8. This will ensure that the output is always valid UTF-8, but will still risk corrupting unicode text that contains combining marks. Ricardo Signes described a very correct but inefficient algorithm: use Unicode::GCString and strip off extended grapheme clusters from the end (or add one-by-one from the start) until the length of the UTF-8 encoded string exceeds your byte-limit. BUGS This module currently only implements a sub-set of unicode's grapheme cluster boundary rules . Eventually I plan to extend this so the module "does the right thing" in more cases. Of course I can't test this on all the writing systems of the world so I don't know the severity of the corruption in all situations. It's possible that the corruption can be minimised in additional ways without sacrificing the simplicity or efficiency of the algorithm. If you have any ideas please let me know and I'll try to incorporate them. One obvious enhancement for languages that use white-space is to chop off the last (potentially partial) word up to the next whitespace block: "s/\S+$//" (note you'll have to worry about the ellipsis yourself in this case). Currently building this module requires Inline::Filters::Ragel to be installed. I'd like to add an option to Inline::Module that has ragel run at dist time instead. Perl internally supports characters outside what is officially unicode. This module only works with the official UTF-8 range so if you are using this perl extension (perhaps for some sort of non-unicode sentinel value) this module will throw an exception indicating invalid UTF-8 encoding. AUTHOR Doug Hoyte, "" COPYRIGHT & LICENSE Copyright 2014 Doug Hoyte. This module is licensed under the same terms as perl itself.