Analyzing Big Data with Benford’s Law: A Lesson for the Classroom

Most literature related to Benford’s Law discusses what the law can be used for and how it works but fails to address effective methods and procedures for teaching the law to students. This article examines existing information resources to determine the most effective methods and procedures used to explain this Law to those who have no experience with it. A contribution to knowledge is made by providing step by step instructional approaches for teaching Benford’s Law to students that are tied to existing literature. Benford’s Law is a fascinating lesson for students who have been exposed to statistical and mathematical concepts for as long as they can remember yet know nothing of the law’s existence. This lesson is suitable for any introductory statistics or mathematics course where students are learning about probability. The Law has a practical application in the field of business and can also be taught as part of a fraud examination, data analytics, or auditing course.


INTRODUCTION
enford's Law states that the frequencies of the first or leading digits in a dataset of naturally occurring random numbers can be easily computed given that it follows a well-defined trend and that this trend is not uniform (Benford, 1938). The creation of appropriate teaching materials for students is required for a complete conceptual and theoretical understanding of Benford's Law. Benford's Law is something students must experience and is not effectively learned through lecture alone. It is a fascinating lesson for students who have been exposed to mathematical concepts for as long as they can remember yet know nothing of the law's existence. It is called a mathematical phenomenon by some scholars (Chenavier, Massé, & Schneider, 2018;Phatarfod, 2013), however, can certainly be explained well enough to make it useful. A good explanation requires the use of databases and procedures that are easy to understand and suitable for the classroom. Frank Benford discovered this mathematical pattern within our ten-digit numerical system in the 1930s after noticing that the earlier pages within a book of logarithms were more worn than the later pages (Mir & Ausloos, 2018). Why were people needing to look up lower digits more frequently than higher digits? This same observation was first made by Simon Newcome in 1881 (Mir & Ausloos, 2018).
Benford's Law is often taught in statistics or mathematics courses around discussions of probability. The Law has a practical application in the field of business and can also be taught as part of a fraud examination, data analytics, or auditing course. Bradley & Farnsworth argue, "Benford's Law is a useful vehicle for teaching statistics. Its surprising existence, coupled with numerous examples, can be used to entice students to the statistical material" (2009). It has even been argued that a student's knowledge of Benford's Law could help them choose correct answers in multiple choice exams (Hoppe, 2016). Answers to end of chapter questions in textbooks were also discovered to conform to Benford's Law. Educators have been surprised to see this result. Slepkov, Ironside, and DiBattista point out that "Most people tend to believe that the leading digits of random values, and the numbers in arithmetic problems, would be uniformly distributed (2015).
The purpose of this study is to provide the most effective methods and procedures used to explain Benford's Law to those who have no experience with it by examining existing information resources or secondary research. Step by step instructional approaches for teaching Benford's Law to students that are tied to existing literature are provided along with a simple explanation of how the law works. The proposed lesson involves activity-based learning which has been shown to be effective in teaching probability (Wroughton & Nolan, 2012). It should be noted, however, that this lesson is meant for students in introductory statistics or mathematics courses and is not tied to any rigorous statistical exercise or theorem. This study should benefit educators and provide insight into the problem of the lack of published material regarding how to teach Benford's Law to students. These insights may assist future researchers in developing ideas or hypotheses for additional study into this area.

LITERATURE
There is no shortage of literature attempting to explain Benford's Law. Most of this literature, however, is written in a way that makes it difficult for the average reader to comprehend. Phatarfod argued, "The fact that most of the work on this topic concentrates on pure mathematics rather than statistics has made it somewhat inaccessible to most readers" (2013). Kruger and Yadavalli described Benford's Law as "counter-intuitive, difficult to explain in simple terms, and has suffered from being described variously as 'a numerical aberration', 'an oddity', 'a mystery' -but also as 'a mathematical gem'" (2017). Ross (2011) further stated, "The most successful attempt at understanding the phenomenon seems to be Theodore Hill's (Hill(1995a)) very nice and sophisticated analysis. I found this analysis challenging, and I am reasonably acquainted with probability." In addition to attempting his own explanation of Benford's Law, Theodore Hill listed classical explanations (1995b) including "the usual number-theoretic (or Cesaro) method for assigning densities to the sets in question; continuous analogs of the Cesaro method based on integration techniques; various probabilistic urn-schemes; demonstrations based on assumptions of continuity and scale invariance; and statistical descripting arguments." These explanations can be difficult to follow for the novice just wanting a conceptual understanding of Benford's Law.
Even though various aspects of Benford's Law remain an enigma, there are valuable explanations found within the literature one can use to put together an effective lesson for students. Bradley and Farnsworth (2009) provide an easy to follow description of the Law and recommend use of a database published by the Bureau of Labor Statistics students can obtain on the Internet to test the Law. The data within this database is well organized and it is easy to understand what is being tested. Coman, Horga, Danila, and Coman (2018) provide examples of how Microsoft Excel can be used in testing a dataset for conformity to Benford's Law. Phatarfod (2013) provides an excellent statistical analysis of various aspects of the Law taking care to write the article using language that is easy to understand. Stoessiger (2013) discusses how growth affects data and how this may be a link to Benford's Law. The descriptions and explanations provided within the literature are further discussed in the following sections that attempt to present an engaging lesson for students on Benford's Law.

BENFORD'S LAW LESSON
When teaching complex material that students often find difficult to conceptualize and learn, using real world examples and visualizations can be more effective than simply explaining the concepts. An instructor beginning a lesson on Benford's Law might ask his or her students to visualize themselves standing in front of a forest of 10,000 trees. Some of the trees are the tallest on the planet, while others are barely peeking through the dirt. The student has been asked to go out and measure, in millimeters, the height of every tree in the forest. What the student will have at the end of the task is a database with 10,000 naturally occurring random numbers. Each number representing the height in millimeters of one tree. Some of these numbers will be very small, possibly one or two digits (for example, 3 millimeters, 27 millimeters, or 11 millimeters). Other numbers in the database will be very large with many digits (for example, 954,135 millimeters, or 745,681 millimeters). A second database should be created from this first set of numbers that includes just the first digit in each number. The new database would still contain 10,000 numbers but

Copyright by author(s); CC-BY 35
The Clute Institute each number would just be one digit in length. Given the examples above of 3, 27, 11, 954,135, and 745,681, the new database would contain the numbers 3, 2, 1, 9, and 7.
Every number in this new database will either be a 1, 2, 3, 4, 5, 6, 7, 8, or 9. The question is; how often will a 2 be present versus a 9; or a 4 versus a 7? If the 10,000 numbers were divided by nine, will there by roughly 1,111 of each number 1 through 9? At first glance, it seems to make sense that this random dataset would have roughly the same number of 8s as 3s with each number having an equal probability of occurring. This probability would be equal to one in nine or 11.1%. It might also make sense that the probability of each number occurring would be unpredictable and random with every database. Benford's Law says no. Each number does not have an equal probability of occurring and one can predict what the probability of each number occurring is. A 1 will be present, for example, 30.1% of the time and a 9 only 4.6% of the time. In the database of 10,000 numbers, roughly 3,010 should be a 1 and only 460 would be a 9. Table 1 below details the percentage probabilities of each number occurring as the first digit according to Benford's law (Bradley & Farnsworth, 2009;Kruger & Yadavalli, 2017;Stoessiger, 2013).  (Stoessiger, 2013). Data produced by chance like lottery numbers would not conform given that these figures are not naturally occurring random numbers, and each would have an equal probability of occurring. Databases containing telephone numbers or zip codes would not conform given that these numbers also are not naturally occurring and random. Students should also realize that Benford's Law does not work in a dataset of natural numbers (Phatarfod, 2013). If a dataset containing the natural numbers 1 through 99 were analyzed, for example, each digit, 1 through 9 would occur as the leading digit in the database 11 times. If a dataset containing the natural numbers 1 through 99,999 were analyzed, each digit, 1 through 9 would occur as the leading digit in the database 11,111 times. If a dataset contained every natural number known, the frequencies of the occurrence of each digit, 1 through 9 as the leading digit, would be approximately 11.1%. This realization may help students appreciate just how phenomenal Benford's Law is.

Experience Benford's Law
To truly appreciate and understand Benford's Law, students must experience it. To do that, a database is needed that is easy to obtain and easy to understand. There are multiple sources of big data one can use to test Benford's Law. One such source, referenced by Bradley and Farnsworth (2009) Once the database has been cleaned, an analysis of the data can be performed to see if it conforms to Benford's Law.
Only the first digit of each total employed figure needs reviewed. The first step in a Benford's Law analysis, therefore, is to create another column to the right of the total employed column showing just the first digit of each total employed figure. Name this column First Digit. Then, in the second row of this column, enter the formula =LEFT(C2), where the C2 represents the first numerical cell in the TOT_EMP column, or the cell containing the number 1,922,570 in the 2017 issue of SOEWE (Table 2). This formula would need to be carried down to the bottom of the spreadsheet. To do this, place the cursor on the lower right corner of cell C2. Double click the + sign when it appears. The formula will automatically be carried to the final row in the data set. The next step is to count the number of times each number 1 through 9 appears in the First Digit column. To do this, the COUNTIF function in Excel is used. The COUNTIF function asks Excel to count the digit if it is a number 1, for example, then a number 2, and so on. The sum of the total of all the counts will also have to be determined. The COUNTIF and SUM formulas are shown in Table 3.   Once the total number of occurrences of each digit 1 through 9 is determined, the next step is to calculate the percentage frequency of each digit to the entire database. This is a very easy mathematical exercise where you divide the total number of occurrences for each digit by the total digits in the database. For example, as shown in Table 5, the digit 1 occurred 30.02% of the time or 10,657 divided by 35,505. The digit nine occurred 4.57% of the time, or 1,623 divided by 35,505. Now that the total percentage of the database for each digit 1 through 9 is calculated, a comparison of these probability figures to Benford's Law first digit probability figures needs to be made. A simple Internet search of Benford's Law will provide multiple sources for Benford's Law figures (shown in Table 1). These are shown in Table 6 along with the calculations from the SOEWE database.

Copyright by author(s); CC-BY 38 The Clute Institute
Highlight the percentage figures in both probability columns as shown in Table 6. Once highlighted, click on "insert" and "line chart" in Excel to create a line chart of both frequency distributions. As shown in Figure 1, the frequency distribution of the digits 1 through 9 are almost identical. Benford's Law should work with any database containing naturally occurring random numbers and some databases containing nonrandom numbers (Phatarfod, 2013). Students should be permitted during class to find a database on their own to analyze (Bradley & Farnsworth, 2009). There are multiple sources of free databases available online (data.gov, bls.gov, fdic.gov). Should you or a student find a database that you believe is naturally occurring and random, but does not conform to Benford's Law, it is a fun and interesting exercise to try and figure out why it does not conform. Typically, those databases are not as random as one first thought.
It's also a fun in-class exercise to test the sensitivity of Benford's Law. The SOEWE database contained 35,505 valid numbers to test. Will the Benford's Law curve be evident when only 1,000 or 500 numbers are in the database? The answer is yes. As a rule, however, the more numbers you have in a database, the more accurate the Benford's Law curve will be.

Practical Applications of Benford's Law
Mark Nigrini was the first to introduce Benford's Law as a tool for auditors and forensic accountants in the investigation of fraud (1992,1999). Benford's Law works with financial data given that this type of data is naturally occurring and, for the most part, random. There are times, however, when financial data is not random and would not conform to Benford's Law. Assume an analysis is made of an accounts payable database with 20,000 entries for a company who has a policy that every payment over $500 must be approved by a supervisor prior to payment. Testing the frequency distribution of the first digit in every payable against Benford's Law might show an unusually large number of payables beginning with the number 4 if fraud is present. The fraudster, who more than likely works in Accounts Payable, could be sending payments to fictitious vendors slightly under the $500 level to avoid supervisor review. A chart showing what the frequency distributions for the Company's Accounts Payable compared to Benford's Law might look like in this case is shown in Figure 2.

How Benford's Law Works
As discussed above, some still call Benford's Law a mathematical phenomenon (Chenavier, Massé, & Schneider, 2018;Phatarfod, 2013). The frequency distributions of the first digit 1 through 9 as defined by Benford's Law and detailed in Table 1 are simply one of many numerical patterns found within our ten-digit number system (Goudsmit & Furry, 1944;Weaver, 1963). When numbers grow, it takes longer to move from a 1 to a 2 than it does to move from an 8 to a 9.
Students should perform this simple exercise of exponential growth to see how Benford's Law works (Stoessiger, 2013). Open a new sheet in Excel. In the first cell of the sheet (cell A1) put $1.00. Grow that dollar one by 1% per period (per row) until row 233. The formula in cell A2 should be =A1*1.01. This formula would then be carried down the spreadsheet until row 233. Notice that by row 233, the $1.00 has grown to $10.06. It took 70 rows to move from $1.00 to $2.00. It took 41 rows to move from $2.00 to $3.00. It only took 12 rows of data to move from $8.00 to $9.00. Each student should next test the frequency distribution of the first digits 1 through 9 against Benford's Law using the same steps detailed above when the SOEWE database was analyzed. The result should show that the frequency distribution of the 232 rows of data (row 233 is not included where it rolled over to $10.00), mirrors Benford's Law.
The Benford's Law numerical pattern is also present when using a base 10 logarithmic scale (Phatarfod, 2013). Instead of growing numbers, logarithmic scales can be used to scale numbers down to make them more meaningful. The logarithm of a number is the exponent you would have to raise the number 10 to to get that number. The log of 100, for example, is 2 because 10 2 or 10*10 equals 100. The log of 10 is 1 because 10 1 or 10*1 equals 10. The log of a number less than 10 is a little more complicated to determine given that you are asking, how many times would you have to multiply 10 by itself to get a number 9, for example. The number 9 is close to 10 and given that the log of 10 is 1, it makes sense that the log of 9 would be close to 1 or .954. If you subtract .954 from 1, you get .046; the Benford's Law percent distribution for the number 9 (Benford, 1938). Table 7 details how the Benford's Law percentage distributions for each digit 1 through 9 are calculated from logarithms. Understanding the logarithmic scale in relation to the percentage distributions found within Benford's Law is an important exercise given that additional insight can be gained into this pattern of numbers within our ten-digit number system. Review the comparison of a logarithmic scale to a normal arithmetic scale in Figure 3 below. The numerical change within an arithmetic scale is equal. In a logarithmic scale, the change from number to number is not equal, however, there is an equal amount of percentage change when looking at various numbers. Review Figure  4 below which provides more detail within a logarithmic scale. Notice that the space between 1 and 2, 2 and 4, and 4 and 8 are all equal. Each space measures a 100% increase given that logarithmic units measure equal percentage change. Each space totals 30.1% of the total space between 1 and 10. This means that within the SOEWE database discussed above, if the total number of 2s and 3s as the first digit were added together, there would be roughly the same total number of 1s in the database (Stoessiger, 2013). If the total number of 4s, 5s, 6s, and 7s were added together, they too would be roughly equal to the number of 1s in the database.

Copyright by author(s); CC-BY 41
The Clute Institute

CONCLUSION
There are several statistical concepts and laws that students find facinating to learn. The instructor can see a difference in student engagement and overall interest when teaching these concepts. Benford's Law is one such lesson. This Law has been taught using the above directions multiple times by this author in fraud examination and data analytics courses. Students are always excited and have even gasped in class when they graph the frequency of the first digits in their dataset and see that they do in fact conform to Benford's Law. Many students are shocked that they didn't know of the Law's existence and will ask what other mathematical or statistical phenomena like this are out there. This leads to discussions on things like the Fibonacci Sequence and Collatz Conjecture and an endless number of others. In that moment, I know I have them, and there really is nothing better.

AUTHOR BIOGRAPHY
Susan Lanham has been working as a forensic accountant for the past twenty-four years. She joined Marshall University in 2014 as an assistant professor of accounting in the Brad D. Smith Schools of Business. Her teaching interests include forensic and financial accounting, and data analytics. E-mail: lanham53@marshall.edu