Zipf's Law graph for the Cantigas de Santa Maria

Zipf's Law states that in any corpus of natural language utterances, the frequency of usage of any word form is inversely proportional to its rank in the frequency table. So the most common word occurs about twice as often as the second most common, three times as often as the third most common, and so on. Since the Cantigas de Santa Maria can indeed be described (a bit drily) as a ‘corpus of natural language utterances’ I thought it might be interesting to try the law out.

The simple test is to plot a graph of word rank against frequency. When both axes use a logarithmic scale, the points should form a roughly straight line. This is in fact what we get for the Cantigas, as demonstrated below. You can click on ‘Log 2’ to hide the blue circles (which are explained in the notes) and play around with the other presets and individual settings, until you either understand the data thoroughly or decide that singing practice is more important.

frequency

Presets:          

 Rank (x-axis)Frequency (y-axis)Words (circle size)
Logarithmic:
Axis scaling:
Global scaling:

Notes

  • As you can see on the main Concordance page, there are a total of 162244 words in this edition of the CSM (ignoring the 9 repeated cantigas) made up of 10224 distinct word forms.
  • There are 265 points on the graph, corresponding to the number of distinct frequencies (or ranks) for the word forms.
  • The diameter of the circle drawn at each point indicates the number of distinct word forms that have the associated frequency and rank. In the initial plot this is on a logarithmic scale like everything else, but I've reduced the diameters by a further factor of 10 so as not to swamp the graph. (You can change 0.1 to 1 and see what happens...)
  • The first few points, which each represent just one word with a unique frequency, are labelled with the word itself, for orientation. The top word e with rank 1 occurs 9611 times.
  • The last and biggest circle represents the set of 4724 word forms that only occur once, and have an equal statistical rank of 5501.
  • I'm using base-10 logarithms here simply because they look better on the graph than natural logarithms: it makes the red scale lines coincide with the grey grid at integral scaling factors.

Sorry, either JavaScript is disabled, or your browser does not support the canvas element for drawing graphics. This page is known to work correctly in Internet Explorer 9 and recent versions of Firefox and Chrome, so you may wish to consider upgrading.
rank