A remarkable story is that of “big data.” The story coincides with the deterioration of the denominator. It also coincides with the shift to the use of workflows and algorithms, which we cannot see. Our urge to follow up on data requires a huge leap of faith. However, the core beliefs of statistics do not apply when working with large sets of data. By the way, this is the frequently asked question at statistics assignments service.
The assumption is not far from being true. The use of small datasets to work by researchers makes it possible to verify the results they get. It works when they put extra care into the analysis they make. Data scientists rely on tools that rely on automation when the datasets increase in size. When the sets become large, the workflows and algorithms become complex. The scientist detaches from the data and tools underlying their work.
The base of data is toolkits and commercial algorithms that have zero visibility. These range from network construction and mining to geographic imputations and demographic estimations. An example is the analysis of social media. The samples of algorithms that govern the algorithms are becoming opaque. Researching the world today would bring rough estimations. In the past, the resulting findings are more accurate. The chances of the results being correct is almost 100 percent. Now different tools bring out conflicting results. The difference in results comes from index distribution. The number of index servers that return within a period also matters. Other scientists fuse random seeds into the estimations they get. There is no visibility of all these to the analysts who use these platforms.
Sentiment analysis provides no models and codes. These codes and models are important when generating the scores. Only a few tools give histograms that show the words. The tools also have constructs that influence the scores.
Scientists with no programming background are not familiar with the influence that implementation brings to an algorithm. The scientist with the background lacks extensive training in algorithmic implementations and numerical methods to assess a certain toolkit’s implementation of a particular algorithm. A lot of “big data” toolkits encounter failure in understanding rudimental problems. The issues include the impact of the multiplication of large numbers of tiny numbers. We lose our visibility in how analysis works as a shift occurs to commercial software from the open-source.
Currently, we do not use the basics of statistics, such as understanding the denominators and algorithms we put to use. Data scientists no longer use important basics when doing data analysis. Scientists also lack a solid foundation in the statistical world. What does this mean? The scientists do not understand the importance of making raw count reports from the dataset that changes. It can lead to findings that are not correct. The method and workflow remain static, yet there are changes in the “big data.” It is also possible to conclude that we accept the analysis of data that we lack understanding of.
We come to realize that at the end of the day, data scientists do not care anymore. Their interests are not what data tells them. It means that applies to the results they get too.
Must Read: Top Big Data Security and Privacy Challenges