Prose: e t a o n i h s r d l u m w c f g y p b v k x j q z Casual: e t a o i n s r h l d c u m g y f p w b v k x j q z Programming: e t a r i s n o l c d p u f m g h b v x y w k q j z Formal: e t a i o n s r h l d c u f m p g y w b v k x j q z News: e t a i o n s r h l d c u m p f g y w b v k x j z q
Prose: e t a o i n h s r d l u m w c y f g , p b . v k ' " - ; ! ? x j q : z _ < > ) ( 1 2 0 4 3 5 9 8 6 7 * [ ] + & / } { % @ $ = ~ Casual: e t a o i n s r h l d c u m g y f p w b . , v k 0 - ' x ) ( 1 j 2 : q " / 5 ! ? z 3 4 6 8 7 9 % [ ] * = + | _ ; \ > $ # ^ & @ < ~ { } ` Programming: e t a r i s n o l c d _ p u f m ( ) g h ; b , = . v x y * " k w - 0 / $ > { } 1 : ' \ 2 q [ ] j & + z < 3 | @ # 4 ! 8 5 6 9 7 % ? ~ ^ ` Formal: e t a i o n s r h l d c u f m p g y w b , v . k - x " ; 1 j q 0 2 ' ) ( z : 9 [ ] 3 4 5 6 8 7 ? ` _ / ! & ^ + % = { * } | ~ > # < @ $ News: e t a i o n s r h l d c u m p f g y w b , . v k " - 0 ' x j 1 z 2 q 9 5 3 8 4 7 : 6 ( ) $ ; | ? / ! & ] [ % @ _ > < * = + #
And for individual programming languages:
Composite: e t a r i s n o l c d _ p u f m ( ) g h ; b , = . v x y * " k w - 0 / $ > { } 1 : ' \ 2 q [ ] j & + z < 3 | @ # 4 ! 8 5 6 9 7 % ? ~ ^ ` C: e t r i n s a o _ c l d u p f m ) ( h g ; , b * x = v k y - 0 w / . > 1 " { } 2 & \ q [ ] z + 3 : 8 4 # < 6 ! ' 5 9 j 7 | % @ ? ^ ` ~ $ Java: e t a i r n s o l c d p u . ( ) g m f ; _ h b v = w x y / k " , { } j * 0 - + 1 q ] [ z 2 3 ! < 5 : & | 4 > \ ' 6 9 8 7 @ ? % # ^ ` $ ~ Perl: e s t r $ a i n l o f d c u p _ m ; ( ) = { } " h , y > b ' g - : 0 x v \ @ k w / 1 q . | ] [ # 2 & + * ? % ~ z ! ^ < 3 5 j 4 8 6 9 7 ` Ruby: e t n s a o r i l d c _ p u m f " . , = h ' ( : ) g b v > y w < [ ] / 1 x @ q k 0 \ 2 | ? { } 3 - j 5 4 z 6 7 % 9 8 + ! * & $ ; # ^ ~ `
Prose = 1, Casual = 1, Programming = 1, Formal = 1, News = 1
Prose = 2, Casual = 1, Programming = 1, Formal = 1, News = 0
Prose = 1, Casual = 1, Programming = 1, Formal = 1, News = 1: e t a o i n s r h l d c u m f p g y w b , \ . v k _ " ( ) ' - ; = x $ 0 : 1 / q j > { } 2 [ ] z * ? < ! 3 5 @ | 4 9 8 + 6 7 & # % ^ ~ `
Prose = 6, Casual = 8, Programming = 4, Formal = 2, News = 7 (unweighted): e t a o i n s r h l d c u m f p g y w b . , v k _ ( ) ; " = ' - $ x / 0 : { } 1 j * > q 2 [ ] z ! \ ? < + 3 @ | 5 4 # & 6 8 9 7 % ~ ^ `
Prose = 6, Casual = 8, Programming = 4, Formal = 2, News = 7 (weighted) : e t a i o n s r h l d c u m p f g y w b , . v k " - 0 ' x j 1 z 2 q 9 5 3 8 4 7 : 6 ) ( $ ; | ? / ! & [ ] % _ @ > = * < + # ` ^ { } ~ \
The above proportions seem somewhat reasonable, and letter frequency is very close to that of letterfrequency.org. I think that programming and prose may be overstated.
I decided to take this a step further and subdivide programming by language. Some languages are more common than others, and my corpus heavily over-represents some languages. It includes code in C, Java, Perl and Ruby, weighted according to language popularity with the following values (adjusted 1/3/2012):
C = 4, Java = 2, Perl = 1, Ruby = 1
C is disproportionately represented because: (a) most modern languages are based on C and therefore have similar syntax; (b) C++ is a very popular language which is identical to C in most respects.
The letter frequency used above tends to over-represent programming text. I personally am a programmer so that's okay for me, but most people are not. Most people get no benefit at all out of the programming text in the corpus. After considering this, I changed the weightings to the following:
Prose = 18, Casual = 25, C = 4, Java = 2, Perl = 1, Ruby = 1, Formal = 15, News = 20
This way, programming code is still taken into account but it's not nearly as significant.
I used these weightings to produce the final frequency you see at my letter frequency page.