Theory of Letter Frequency

Letter frequency is important for many endeavors, not just keyboard design. But how to calculate letter frequency? It is of course impossible to calculate exact letter frequency, because that would require a record of every single thing that's ever been typed. But we can get close.

The purpose of my letter frequency is to produce an accurate representation of letter frequency and to emphasize quality over quantity. Some sources such as the Brown Corpus have a huge quantity of text but are heavily biased toward professional writing. I attempt to maximize the quality of text, and thus have a broad range of categories.

I have noticed that different types of text have very different letter frequencies. And some people do more typing of some types than of others. Some people do a lot of programming, while others write a lot of emails. So the letter frequency must be customizable. To allow for this, I have five different categories that text is in: prose, casual, programming, formal, and news. (The basic idea for categorization came from Arensito.) My justification for prose is that the writing style is more formalized than casual writing, but in different ways than formal writing; also, prose is frequently older, so there are unusual words and conventions. Casual requires no justification; this category includes topics such as email and blogs. Programming is significantly different from anything else, because of the vastly different syntax. Formal writing, which includes scientific papers and the like, has a different writing style and frequently has technical jargon. News, which I use to mean anything from a newspaper, is similar to casual but is somewhat more formal and follows certain syntax conventions; I include news mainly because I find that it well reflects the expected letter frequency (by "expected", I mean expected by letter frequency statistics that I have found online). Each of these categories is noticeably different; these are the letter frequencies that I have gotten for each category.

Prose:        e t a o n i h s r d l u m w c f g y p b v k x j q z
Casual:       e t a o i n s r h l d c u m g y f p w b v k x j q z
Programming:  e t a r i s n o l c d p u f m g h b v x y w k q j z
Formal:       e t a i o n s r h l d c u f m p g y w b v k x j q z
News:         e t a i o n s r h l d c u m p f g y w b v k x j z q

Or, the complete character frequency:

Prose:       e t a o i n h s r d l u m w c y f g , p b . v k ' " - ; ! ? x j q : z _ < > ) ( 1 2 0 4 3 5 9 8 6 7 * [ ] + & / } { % @ $ = ~
Casual:      e t a o i n s r h l d c u m g y f p w b . , v k 0 - ' x ) ( 1 j 2 : q " / 5 ! ? z 3 4 6 8 7 9 % [ ] * = + | _ ; \ > $ # ^ & @ < ~ { } `
Programming: e t a r i s n o l c d _ p u f m ( ) g h ; b , = . v x y * " k w - 0 / $ > { } 1 : ' \ 2 q [ ] j & + z < 3 | @ # 4 ! 8 5 6 9 7 % ? ~ ^ `
Formal:      e t a i o n s r h l d c u f m p g y w b , v . k - x " ; 1 j q 0 2 ' ) ( z : 9 [ ] 3 4 5 6 8 7 ? ` _ / ! & ^ + % = { * } | ~ > # < @ $
News:        e t a i o n s r h l d c u m p f g y w b , . v k " - 0 ' x j 1 z 2 q 9 5 3 8 4 7 : 6 ( ) $ ; | ? / ! & ] [ % @ _ > < * = + #

And for individual programming languages:

Composite: e t a r i s n o l c d _ p u f m ( ) g h ; b , = . v x y * " k w - 0 / $ > { } 1 : ' \ 2 q [ ] j & + z < 3 | @ # 4 ! 8 5 6 9 7 % ? ~ ^ `
C:         e t r i n s a o _ c l d u p f m ) ( h g ; , b * x = v k y - 0 w / . > 1 " { } 2 & \ q [ ] z + 3 : 8 4 # < 6 ! ' 5 9 j 7 | % @ ? ^ ` ~ $
Java:      e t a i r n s o l c d p u . ( ) g m f ; _ h b v = w x y / k " , { } j * 0 - + 1 q ] [ z 2 3 ! < 5 : & | 4 > \ ' 6 9 8 7 @ ? % # ^ ` $ ~
Perl:      e s t r $ a i n l o f d c u p _ m ; ( ) = { } " h , y > b ' g - : 0 x v \ @ k w / 1 q . | ] [ # 2 & + * ? % ~ z ! ^ < 3 5 j 4 8 6 9 7 `
Ruby:      e t n s a o r i l d c _ p u m f " . , = h ' ( : ) g b v > y w < [ ] / 1 x @ q k 0 \ 2 | ? { } 3 - j 5 4 z 6 7 % 9 8 + ! * & $ ; # ^ ~ `

They may look similar, but these differences are significant. (However, it is still worth noting that the differences here are more minor than some differences between supposedly comprehensive letter frequencies I have found online, which calls into question the other online frequencies' reliability.) When other characters besides letters are included, the differences are even greater. For example, programming uses far more semicolons than any other form of typing.

I think these categories adequately cover the different styles of typing. The next question is, by how much should they each be weighted? It will obviously differ from person to person; so in the letter frequency calculation program I am writing and will be releasing soon, the option to weigh these categories differently is left open. But I want to create a single letter frequency which is the best for the most people. This makes the weighing more tricky.

Each category gets a multiplier: for every one occurrence of some letter under this category, treat it as n occurrences. For example, these multipliers

Prose = 1, Casual = 1, Programming = 1, Formal = 1, News = 1

mean that each category is weighted equally, and

Prose = 2, Casual = 1, Programming = 1, Formal = 1, News = 0

means that prose is twice as important, while news is completely ignored.

Since I am still trying to determine the best weightings, here are several examples.

Prose = 1, Casual = 1, Programming = 1, Formal = 1, News = 1:  e t a o i n s r h l d c u m f p g y w b , \ . v k _ " ( ) ' - ; = x $ 0 : 1 / q j > { } 2 [ ] z * ? < ! 3 5 @ | 4 9 8 + 6 7 & # % ^ ~ `

The above one is definitely not accurate; for one, formal is not typed nearly as much as casual for most people.

Prose = 6, Casual = 8, Programming = 4, Formal = 2, News = 7 (unweighted):  e t a o i n s r h l d c u m f p g y w b . , v k _ ( ) ; " = ' - $ x / 0 : { } 1 j * > q 2 [ ] z ! \ ? < + 3 @ | 5 4 # & 6 8 9 7 % ~ ^ `

Prose = 6, Casual = 8, Programming = 4, Formal = 2, News = 7 (weighted)  :   e t a i o n s r h l d c u m p f g y w b , . v k " - 0 ' x j 1 z 2 q 9 5 3 8 4 7 : 6 ) ( $ ; | ? / ! & [ ] % _ @ > = * < + # ` ^ { } ~ \

The above proportions seem somewhat reasonable, and letter frequency is very close to that of letterfrequency.org. I think that programming and prose may be overstated.

I decided to take this a step further and subdivide programming by language. Some languages are more common than others, and my corpus heavily over-represents some languages. It includes code in C, Java, Perl and Ruby, weighted according to language popularity with the following values (adjusted 1/3/2012):

C = 4, Java = 2, Perl = 1, Ruby = 1

C is disproportionately represented because: (a) most modern languages are based on C and therefore have similar syntax; (b) C++ is a very popular language which is identical to C in most respects.

The letter frequency used above tends to over-represent programming text. I personally am a programmer so that's okay for me, but most people are not. Most people get no benefit at all out of the programming text in the corpus. After considering this, I changed the weightings to the following:

Prose = 18, Casual = 25, C = 4, Java = 2, Perl = 1, Ruby = 1, Formal = 15, News = 20

This way, programming code is still taken into account but it's not nearly as significant.

I used these weightings to produce the final frequency you see at my letter frequency page.