Unicode Script Detector
Detect all Unicode scripts in a text.
Installation
Add this line to your application's Gemfile:
gem "unicode_script_detector"
Or install it globally:
$ gem install unicode_script_detector
Detect all the characters of a string
UnicodeScriptDetector.detect_characters "Hel6б\t"
#Output:
[
#<UnicodeScriptDetector::Character:0x00007768fefdead8 @char="H", @name="Latin", @script=:Latin>,
#<UnicodeScriptDetector::Character:0x00007768fefdea10 @char="e", @name="Latin", @script=:Latin>,
#<UnicodeScriptDetector::Character:0x00007768fefde970 @char="l", @name="Latin", @script=:Latin>,
#<UnicodeScriptDetector::Character:0x00007768fefde8d0 @char="6", @name="Digit", @script=:Digit>,
#<UnicodeScriptDetector::Character:0x00007768fefde830 @char="б", @name="Cyrillic", @script=:Cyrillic>,
#<UnicodeScriptDetector::Character:0x00007768fefde830 @char="\t", @name="Tab", @script=:Tab>
]
Detect if a script contains certain scripts
# This will return true because it contains Latin and Cyrillic
UnicodeScriptDetector.contains? "Helб🔥", [:Latin, :Cyrillic]
Detect if a script contains only certain scripts
# This will return false because it contains an Emoji as well
UnicodeScriptDetector.contains_only? "Helб🔥", [:Latin, :Cyrillic]
Detect all the characters of a string, grouped by the script
UnicodeScriptDetector.script_groups("Hel6б how are you?").each do |group|
puts "#{group.name}: #{group.text} (#{group.length} characters)"
end
#Output:
Latin: Hel (3 characters)
Digit: 6 (1 characters)
Cyrillic: б (1 characters)
Whitespace: (1 characters)
Latin: how (3 characters)
Whitespace: (1 characters)
Latin: are (3 characters)
Whitespace: (1 characters)
Latin: you (3 characters)
Punctuation: ? (1 characters)
Get a homographic spoof analysis
UnicodeScriptDetector.spoof_analysis "Раypal"
=>
[
#<struct UnicodeScriptDetector::SpoofDetector::Detection
type=:confusable,
message="Found 2 character(s) from non-Latin scripts that visually resemble Latin letters",
characters=[
#<struct UnicodeScriptDetector::SpoofDetector::ConfusableChar char="Р", script="Cyrillic", looks_like="P", position=0>,
#<struct UnicodeScriptDetector::SpoofDetector::ConfusableChar char="а", script="Cyrillic", looks_like="a", position=1>
],
severity=:high>,
#<struct UnicodeScriptDetector::SpoofDetector::Detection
type=:mixed_scripts,
message="Text contains a mix of 2 scripts: Cyrillic, Latin",
characters=["Cyrillic", "Latin"],
severity=:medium>
]
Check whether a homograph spoof is detected
UnicodeScriptDetector.spoofed? "Раypal"
=> true
Development
- Start the console with
bin/console. - Run the tests with
bin/test. - Update confusables list from unicode.org with
rake update_confusables
Contributing
You're welcome to contribute to this project. See https://github.com/davidarendsen/unicode_script_detector.
License
This software is released under the MIT license.