AI, NLG, AND MACHINE LEARNING
The Law of Large Numbers in AI and Big Data
​
The key concepts and framework behind the evolution of data science and artificial intelligence (AI) have existed for many decades. The digital revolution has paved the way for torrents of data to be available, resulting in the birth of emerging technologies, such as AI.
​
​
By Dr. Deepak Kallepalli
January 16, 2020
Introduction
​
“AI is the new electricity,” stated Prof. Andrew Ng of Stanford University.
Electricity revolutionized the world, from home appliances to entire industries. In a similar manner, AI is revolutionizing almost every field in the current century.
The key concepts and framework behind AI have already existed for years, but AI has taken a leap only in the past few years due to the digital technology revolution. Through the usage of smartphones and social media platforms, such as Facebook and Twitter, torrents of data are available for analytics for smart products, like Amazon Alexa; smart technologies, like AI and bots; and services, like Uber and Lyft.
The following sections demonstrate how big data plays an important role in the accuracy of the result and in enhancing the performance of the algorithms used.
Accuracy and big data
​
When an experiment is performed, an accurate and precise result is desired. Accuracy of an experiment is defined as to whether or not the result of a measurement conforms to the correct value or expected value.
Precision refers to the degree to which measured values agree and thus asserts the repeatability of an outcome. The underlying principle to achieve both comes from the Law of Large Numbers (LLN), one of the basic principles of experimental physics and statistics. The law states that the average of an experiment performed many times converges to its true or expectation value.
To illustrate with an example, let us consider an experiment of throwing a six-sided dice. The measurement is to get the average of all possible outcomes.
The sample space, which is a set of all possible outcomes, consists of S = {1, 2, 3, 4, 5, 6}. The average value or true expectation value in this example is 3.5. The law states that when the dice is thrown several times and its average (sample average) is taken each time, it converges to 3.5 for a large number of throws.
Figure 1
Figure 1 shows a simulation of this experiment. (For more details regarding the Python code, please refer to my GitHub page.) In the simulation, the maximum number of trials was taken at 100,000. The simulation clearly indicates that when the number of trials is above 40,000, then the sample average is close to 3.5.
This example provides an understanding of how the size of data has an impact on accuracy and precision. In real AI applications (such as image recognition using Convolutional Neural Networks [CNN]), where we have multiple outcomes, it requires big data (data size is far above the 100,000 shown in the example of dice) so that algorithms, when applied, converge to sample mean.
Figure 1: Python simulation demonstrating the law of large numbers (convergence of sample mean to its expectation value)
​
​
​
In AI applications, neural networks (deep learning) are used. First, these neural networks are trained by working through a training set of the data. As shown in figure 1, to obtain an accurate result, the training set should represent the entire sample space and, for this reason, algorithms are trained over large data sets.
Through increased usage of smartphones and other devices, we have gained access to large data sets that were not available in the past. These data sets represent the correct distribution of the problem that we are trying to solve.
Once the correct training data set is considered, the next task is to train algorithms over large data sets to estimate the hypothesis function (h) with its features (fitting parameters), as shown in the schematic (figure 2).
Once the hypothesis function is obtained from multiple iterations over the training data set, then it can be verified with test data sets (input to hypothesis function, as shown in the schematic). When the hypothesis function does not give the right output, this means that (i) either the training set is not large enough to represent the problem in question or (ii) there are an insufficient number of features in the hypothesis function.
The latter requires increasing the feature size or the neural network. Both of these cases again emphasize the size of the data.
Figure 2: A schematic showing how the large data sets are used in the training set and the role of large numbers behind the scenes.
​
​
​
​
​
Performance of algorithms
​
In addition to accuracy, the performance of algorithms depends upon the size of data or amount of data. When the data size increases, we are forcing algorithms to fit the data properly, thus minimizing the errors. (The more data there is, the less prone it is to statistical error.) Figure 3 shows a schematic graph illustrating the performance of algorithms with the set amount of data.
Four different curves representing the data trained with (i) traditional machine learning algorithm (black), (ii) small neural network (NN) (blue), (iii) medium NN (green), and (iv) large NN (red) are shown. All curves clearly show the increase in performance of algorithms with the amount of data until they reach plateau.
When neural networks are chosen (blue, green, and red curves), we can train traditional machine learning algorithms (black curve) further to enhance the performance. In the schematic, both x and y axes are given in arbitrary units because these values depend on the nature of the complexity of the problems under study.
Figure 3: Performance of algorithms (y axis) with the amount of data (x axis). Performance of algorithm gets better with the amount of data and the size of neural networks shown in different colors.
Conclusion
​
Smart devices generate large data sets which are processed by technologies, like AI and bots, to provide and improve the quality of services to customers in fields such as healthcare and medicine.
Precision medicine is one application area that uses big data analytics with large neural networks. Given the advent of bots augmented with neural networks, the world would reach a standstill point without AI.