Paul E. Black, a scientist at the National Institute of Standards and Technology (NIST) has developed an algorithm that may guide applicants in proposing new “top-level domains” such as .com, that people type in navigating the Web.
The algorithm has been made at the request Internet Corporation for Assigned Names and Numbers (ICANN), which checks whether the newly proposed name is confusingly similar to existing ones by looking for visual likenesses in its appearance.
Having visually distinct top-level domain names may help avoid confusion in navigating the ever-expanding Internet and combat fraud, by reducing the potential to create malicious look-alikes: .C0M with a zero instead of .COM, for instance.
ICANN is planning to launch the process for proposing a new round of “generic” top-level domains (gTLDs), strings such as .net, .gov and .org meant to indicate organizations or interests. In preparing for newly proposed gTLDs, ICANN reached out to various algorithm developers, including this one, as among those engaged to “provide an open, objective, and predictable mechanism for assessing the degree of visual confusion” in gTLDs.
Black’s algorithm which can be found here compares a proposed gTLD with other TLDs and generates a score based on their visual similarities. For example, the domain .C0M scores an 88 percent visual similarity with the familiar .COM. The resulting scores may help indicate whether the newly proposed domain name looks too much like existing ones.
The score is an enhanced Levenshtein distance that is adjusted for length and normalized. Some other possibilities for distance measures are Jaro-Winkler, Damerau-Levenshtein, cosine distance, and many others.
The code is written in Python. The interface to the algorithm itself is a single function, howConfusableAre(). It takes two parameters: the two strings to be compared.
To make its assessments, the algorithm rates the degree of similarity between pairs of alpha-numeric characters. Some pairs, such as the numeral “1″ and its dead-ringer, the lowercase letter “l,” are assigned the highest scores for visual similarity while other pairs, such as “h” and “n”, are given lower scores. The algorithm takes other considerations into account, for example how certain pairs of letters, like “c” and “l,” can join to look like a third letter (”d”), as in the case of “close” and “dose.” Employing these scores and considerations, the algorithm computes the “cost” of transforming one string of characters into another, such as “opel” into “apple.” Lower cost means higher visual similarity. The algorithm then adjusts for the relative lengths of the two strings (different lengths increase their distinctiveness) and converts the final cost into a percent similarity.
iEntry 10th Anniversary
Contact Us

0 responses so far
There are no comments yet...Kick things off by filling out the form below.
You must log in to post a comment.