Zipf's Law states that in any corpus of natural language utterances,
the frequency of usage of any word form is inversely proportional to its rank in the frequency table. So the most common word
occurs about twice as often as the second most common, three times as often as the third most common, and so on. Since the
Cantigas de Santa Maria can indeed be described (a bit drily) as a ‘corpus of natural language utterances’ I thought
it might be interesting to try the law out.
The simple test is to plot a graph of word rank against frequency. When both axes use a logarithmic
scale, the points should form a roughly straight line. This is in fact what we get for the Cantigas,
as demonstrated below. You can click on ‘Log 2’ to hide the blue circles (which are explained in the notes) and play around with the
other presets and individual settings, until you either understand the data thoroughly or decide that singing practice is more important.
frequency
Presets:
Rank (x-axis)
Frequency (y-axis)
Words (circle size)
Logarithmic:
Axis scaling:
Global scaling:
Notes
As you can see on the main Concordance page, there are a total
of 162243 words in this edition of the CSM (ignoring the 9 repeated cantigas) made up of
10220 distinct word forms.
There are 267 points on the graph, corresponding to the number of distinct frequencies (or ranks)
for the word forms.
The diameter of the circle drawn at each point indicates the number of distinct word forms that have the associated frequency and rank.
In the initial plot this is on a logarithmic scale like everything else, but I've reduced the diameters by a further
factor of 10 so as not to swamp the graph. (You can change 0.1 to 1 and see what happens...)
The first few points, which each represent just one word with a unique frequency, are labelled with the word itself, for orientation.
The top word e with rank 1 occurs 9613 times.
The last and biggest circle represents the set of 4725 word forms that only occur once, and have an equal statistical rank of 5496.
I'm using base-10 logarithms here simply because they look better on the graph than natural logarithms: it makes the red scale lines coincide with the grey grid
at integral scaling factors.