Natural Language and Computational Linguistics

A bit of cryptography, passwords strength, and sysadmins of savings banks

Posted in General, Natural Language and Computational Linguistics on August 14th, 2011 by Hector Zenil – Be the first to comment

This recent webcomic from xkcd, Password Strength, relates to what I’ve been investigating lately–the question of whether there need be a tradeoff between password strength and ease of recall, using, as usual, my knowledge of finite algorithmic information theory. While my conclusions will be presented in a paper in the next few months/years, the webcomic made me realize how defective even sophisticated things sometimes turn out to be. The webcomic draws attention to an unnecessary, even silly requirement for password generation.

Password strength is related to two main parameters: length and the likelihood of being guessed. Length is a simple parameter, the longer the better, but the likelihood of something being guessed is a bit trickier. What makes a string like 0000000 more likely to be cracked than say 0110101? Classical probability tells us that both have the same likelihood of occurring, thanks to (uniform) chance. But there is a good reason not to choose a word in any language as a password, viz. that someone trying to guess a password will assume that people wish to remember their password and therefore will most likely use meaningful strings, such as words, anniversary dates, and the like. Hackers use a technique called ‘dictionary attacks’ where every entry in a dictionary is presumed to be a potential password and used as such. The results are often spectacular. Despite the general awareness that words are not good passwords, when hackers perform these attacks they usually find at least a few people with weak, word-based passwords. The only reason not to use words is dictionary attack, so sysadmins took it upon themselves to dissuade people from choosing words as passwords, and the way they set about this was to ask users, and now quite often to force them, to:

1. Observe fixed length requirements;
2. Sometimes require a combination of caps;
3. Sometimes require numbers and other symbols;
4. In the case of some sites, require frequent password changes.

While requirement number 1 makes some sense (people may argue, with reason, that a short but good password is superior to an easy-to-guess long one), requirements 2 to 4 are quite arbitrary, if not completely nonsensical.

If people didn’t pick word-based passwords, if every character were assumed to be equally likely, only the length of the password would matter. If a password were to be generated randomly, the probability of hitting a word by random letter generation is very low, because there are more letters than one-digit numbers and reasonable symbols to choose from. The problem is that words are not random in the sense that they are sequences of correlated letters according to a distribution of words in a language.

But under the circumstances, the way that sysadmins decided to address this issue was to force people not to choose single words by arbitrarily enforcing requirements 1 to 3, with the result that people frequently either no longer remember their passwords or they cannot have a range of different passwords for use at different sites, because remember that if a password is disclosed at one site it will be disclosed for all the others where said password is used. To fix this, again sysadmins introduced the procedure of forcing people to change their passwords, so that users can no longer synchronize their password among all the sites that need one, leading them to have to either write down their nonsensical passwords because they cannot recall them, store them where they would not be secure, or else ask the system to retrieve the password every time they want to log in, often through unsecured lines (e.g. public wireless networks). These arbitrary requirements aren’t listed while you’re trying to log in, so that you may not remember whether you used a number or symbol (it may seem a bad idea to make hackers privy to this, but they will be anyway if they have access to the registration process). The computational time cost of retrieving or generating a new password may be low, but the human time cost may not be.

A longer password will always be better than a shorter one, even compared to a shorter one containing all sort of numbers and symbols. Numbers and symbols are only useful for preventing dictionary attacks, but if people use leeting (or chat slang) then I’m afraid it would not solve the problem that sysadmins are faced with, because hackers could just apply the translation tables to the dictionaries and try again, which I’m sure would be very successful. There are leetspeak translation sites on the web (e.g. www.computerhope.com/cgi-bin/convert.pl). The only truly effective way to solve this problem is to require longer passwords (which also prevents the use of words, because most common words have less than 8 characters, so asking for a 10-character password may be enough to preempt the choice of most dictionary words, though more characters can be required for greater saftey).

Could this problem have been solved otherwise? It would certainly be too expensive–computationally speaking–to check every password against a dictionary to avoid every possible word in every language (and even harder to avoid employing anything having to do with the user’s life, such as his forename or anniversary date), but being asked to come up with silly characters became unbearable for many users.

The xkcd site gives you a foretaste of it of a side effect of my algorithmic approach to cryptography. One does not and should not have to sacrifice ease of recall for strength. Something as simple as composing random words has greater combinatorial strength than a leetspeak-corrupted single word or passwords comprising a few random characters, with the added advantage that even if longer, sentences are much easier to remember.

There is no trick to this. Choosing a few words means a longer password, and putting them together circumvents dictionary attacks (to perform dictionary attacks over full sentences is computationally too expensive, so there is no way hackers can start doing so).

One shouldn’t need to choose between a good password and one that’s easy to remember, because there are passwords that can be both. What one may not be able to do is have short passwords that are easy to remember and also strong enough. Thus requirement 1 is the only important one, because one can also control the number of possible dictionary words, e.g. by picking a password length greater than most commonly used words. Zip’s law tells us that the number of letters will not be that long, because the longest words are less frequently used and therefore also less likely to be picked as passwords (besides, systems could actually make sure that there is no word in any major language that is of the stipulated length).

Compare the result of putting 2 random short (but not that short) words together using Mathematica:

In[1]:= passwds=Table[StringJoin[
RandomChoice[
Select[DictionaryLookup[], (StringLength[#] <
6) && (LowerCaseQ[#]) &], 2]], {100}];
In[2]:= RandomChoice[passwds, 5]
Out[2]="rimecrave", "stoneobeys", "mixedtromp", "whiffmutt", "letopted"

with their leetspeak versions that one would probably be required by sysadmins to use:

In[3]:=ToLeetSpeak/@passwds
Out[3]=”r1m3cr4v3″, “570n30b3y5″, “m1x3d7r0mp”, “wh1ffmu77″, “l370p73d”

even if the joined words do not either exist in a standard English dictionary:

In[4]:=Select[passwds, MemberQ[DictionaryLookup[], #] &]
Out[4]={}

To have fun you can also generate 2 word passwords in other languages using Mathematica:

In Spanish:

Table[StringJoin[
RandomChoice[
Select[DictionaryLookup[{"Spanish", All}], StringLength[#] < 6 &],
2]], {100}]

or French:

Table[StringJoin[
RandomChoice[
Select[DictionaryLookup[{"French", All}], StringLength[#] < 6 &],
2]], {100}]

Or any other of these languages:

In[5]:= DictionaryLookup[All]
Out[5]= {“Arabic”, “BrazilianPortuguese”, “Breton”, “BritishEnglish”, “Catalan”, “Croatian”, “Danish”, “Dutch”, “English”, “Esperanto”, “Faroese”, “Finnish”, “French”, “Galician”, “German”, “Hebrew”, “Hindi”, “Hungarian”, “IrishGaelic”, “Italian”, “Latin”, “Polish”, “Portuguese”, “Russian”, “ScottishGaelic”, “Spanish”, “Swedish”}

Meaning is in the word net: cyclic self-referential definitions, dictionaries and found in translation

Posted in Complexity, Minds and Machines, Natural Language and Computational Linguistics, New Ideas on December 2nd, 2006 by Hector Zenil – Be the first to comment

In the “Period 1 Cycle English Dictionary” published by “No way, Inc.” (it’s been said to be the most accurate dictionary ever) one can read:

dog (Noun) : a dog is a dog.

The lazy creators of this dictionary appear to have forgotten what is broadly accepted by common sense. A definition would not be a definition if it were cyclical and self-referential, simply because one would, theoretically, fall into an infinite loop trying to define an unknown word that has as part of its definition the word itself.

Let’s leave aside the fictional Period 1 Cycle English Dictionary, which I just made up, and consider how things work in a regular dictionary. Assume you want to know the definition of a word. Looking it up, you’ll encounter a set of ordered words in the language in question(if the dictionary is not a bilingual one). Let’s prove that in the end, any definition of a word w in a closed finite dictionary is cyclic with period p_w:

Proof sketch:

We may hypothesize that the dictionary (D) is closed under the operation “the definition of” (i.e. all words in the definition of a word have a definition in the dictionary). One can also safely assume that all words in D have a non-empty definition (given the definition of “dictionary” itself!) and that the dictionary is finite. Then if w is a word in D, the orbit of the successive definitions d_1,d_2,…d_n, … for w (listed as sets of words themselves) will end up eventually crossing the path d_w={d_1,d_2,…d_n, …} because d_w cannot be an infinite sequence without repeating elements in D– the finite nature of D would necessitate  coming back to the original word itself after a certain period p depending on the word, making every word w in the dictionary cyclic with period p_w. 

If all the definitions in a dictionary are then cyclic and enclosed in the dictionary itself, where does the analytic knowledge that a dictionary seems to provide come from? When one consults a dictionary in a given language, one already knows some words in that language, but assuming one doesn’t, would a dictionary be completely useless, as the issue of ultimately cyclic self-referential definitions seems to suggest?

One could conceive of a dictionary as a whole as a syntactical knowledge container with no actual knowledge in it– to someone who does not bring to it any knowledge of the relevant language. One may wonder how something so perfectly self-referential could actually be of use, as dictionaries seem to be. Is it indeed because you always know some, or indeed many, words of a language already when you consult a dictionary? For since every word is defined by other words in the same language, looking up the meaning of a given word would lead you to other words and yet others and eventually back to the first word, the word whose meaning you set out to learn. This would be the case even if  the dictionary were bilingual, and the meaning of the word you wished to check was given in a second language. Thus all dictionaries are perfectly circular, closed, self-referential sources.

However, the analytical knowledge in a dictionary does not come from the definitions as words, but from the word net underneath, where the words are connected in some, perhaps unique fashion(modulo exact synonyms) to each other. That’s what quite successful projects such as WordNet and functions like WordData in Mathematica are about. The power of being able to analyze the language as a net in a computable way is priceless for the progress of computational linguistics and linguistics in general.

 

2-level depth wordnet for the word "chair"

2-level depth wordnet for the word "chair"

 

 

For example, if “chair” is connected to “table”, “office”, “dining room”, etc. it should be easy to map it to its equivalent in any other language. Unless the target language doesn’t have the concept “chair” as referring to a seat placed at a table in a dining room (which was perhaps the case in some Asian countries before colonialism), together with an explicit cognitive representation of it.

Of course problems arise when considering the mappings between words having the same written form but different senses. The word “chair,” for example, is a seat, but may also mean the officer who presides at a meeting. Also posing a problem would be cases of words being mapped to or from words belonging to different parts of speech, such as “personne” in French, which  maps onto two completely different words in Spanish: “persona” and “nadie”,  a noun and an adjective respectively, with completely different connections and different supporting nets. So even when it seems that most relations might be surjective, the general case is certainly not bijective, and that applies to homonyms too, which often creates ambiguities in translation. However, the supporting network  would be able to uncover this fact and solve a possible ambiguity based on  context by extending the word network to encompass the ambiguity. In other words, if a subnet cannot be uniquely mapped, extending it should eventually solve the ambiguity. What one would need  is a corpus big enough to build such a network once and for all and then simply make comparisons at the network level. This could work even for completely new or unknown languages, either dead, living or artificial, assuming that they share a part of our actual reality and hence some part of our mental representations  (In a sense this is what Champollion did when he deciphered the Rosetta stone– he discovered a partial mapping of a subnetwork of words from an unknown language – Egyptian – to a subnetwork of a known one – Greek). In the final analysis, each language has a single unique network (changing slightly through time but remaining well connected and strong enough to make  it unique and recognizable while being isomorphic with that of any other language).  All languages could be identified by their fingerprints -their word net. This kind of analysis would identify the lack of a word net structure in hoax languages, such as perhaps the Voynich manuscript.

Having  established that, what about mining the world of all possible meanings, the world of all possible translations, and the world of all possible ideas? We wouldn’t have the problem of distinguishing between a coherent idea and a non-coherent one since the network would provide some minimal coherence. Thus the net-into-the -net approach would give us a way of translating from word to word and from phrase to phrase and from idea to idea as faithfully as possible in most cases, since in the end all of us as human beings share a single reality, though we perhaps approach it from different points of view.

Again, the analytical knowledge in a dictionary comes from the net connecting the words, so even if someone does not know English at all, I would say that he would be able, albeit with considerable difficulty, to learn English just by deducing the net connecting objects, in other words, by mapping his own mental  representations of objects onto words in the  English dictionary. In the process he could encounter some ambiguities, but the further he goes, the more of these he would be able to resolve. On the other hand,  speakers of those languages in which “chair” does not exist, both in the language itself and as a real object in the culture,   would be able to deduce what  a chair is by tracking its  relations with the objects they know and for which they do have mental representations and the phonemes to externalize them. So the problem of translation, which began with the mapping of word onto word and then phrase onto phrase  with statistical tools,  becomes, with this approach, a matter of  mapping net to net.  Indeed this seems to be the approach adopted  by Meaningful Machines.  

These ideas could be carried to the limit by taking the sum total of human languages and enquiring into the mapping between such a network and our cognitive representations. Such a move would provide grounds  for rebutting the Chinese room argument, since in the end it does not matter whether someone inside the room has no knowledge at all of a language; he would be able to map what he is mechanically translating onto his own mental representations, generating what, according to the argument, could not be generated: understanding. Because Searle’s  idea was, as I recall, to build up a case against A.I. in terms of the meaningless of the Turing test and true AI in general.

One may actually use a dictionary without knowing a single word in it! That is because there is a mapping between the word net in the dictionary and one’s own language, or even better, a mapping (not necessarily injective or surjective) between the word net of the dictionary and your cognitive personal word net.

Oversimplifying, translation might be reduced to the search for the homomorphism between algebraic groups, with each group G={W,d} being a language dictionary, with W the set of words in that language and d the closed operation “definition of”. One can then see each definition as an oversimplification of a directed, ordered graph g={w,s}, with w the set of vertex words involved in the definition and s the ordered (since definitions may be not commutative) set of edges for an ideally formal- enough language dictionary.

Bad practices of cyclic definitions in a dictionary should rather be expressed as the practice of period 1 cycles, i.e. words that have in their  immediate definition the word itself.

This post is  related to a previous post titled “Meaning against A.I.

Human Readable Proofs Visualization

Posted in Foundations of Math, Natural Language and Computational Linguistics, New Ideas on November 12th, 2006 by Hector Zenil – Be the first to comment

- Symbolic Visualizations, University of Texas:http://cvcweb.ices.utexas.edu/ccv/projects/VisualEyes/SymbVis/index.php- Proof nets and zero-knowledge proofs.

Meaning against A.I.

Posted in Computer Science, Foundations of Computation, Minds and Machines, Natural Language and Computational Linguistics on October 12th, 2006 by Hector Zenil – 1 Comment

A significant number of researchers believe that there are sentences with semantic value that could never be understood by a machine. These researchers believe that the mind has a semantic component, unlike machines. Their’s is a  Chinese Room type argument a la Searle. Consider Chomsky’s example of two books in a library with the same title, and two readers, each taking out one of the books. Do they get the same book? These researchers argue that machines would be unable to answer correctly on the basis of context since the answer would depend on a cognitive understanding of the situation. My claim is that all  meaningful components of a situation are based on a hierarchy that can be artificially represented by categories capturing the essence and functionality of  human mental operations. The answer to Chomsky’s question would be “yes” if one is  referring to the information content of the books,  “no” if one is referring to the books as physical objects.