When the Future You Anticipate Never Arrives: Ensuring Your Expected Outcomes Are Accurate
Today, where machines make decisions, the notion of the “right set of data” becomes a lot harder to understand. This is because machines learn differently than we do. In addition, the rationale for the output that they produce is more difficult to reconstruct. Machines do not have the intuition or the critical reasoning that can help to elevate or discount one data point over another. Input data must be free from bias, representative and as accurate as possible.
These best practices center around ensuring the enterprise has representative data that is free from bias.
The Oxford English Dictionary defines accuracy as "the quality or state of being correct or precise". With respect to input data for machines, accurate data means the accurate representation of occurrences as they actually occurred. In predictive models for weather, for example, accuracy could mean the details on whether it rained at a particular time.
For predicting sales, accurate data might be the actual daily sales over a particular period. For an automation project, accurate data might be the correct actions taken by a person for a given process. Having accurate data is essential because a machine can learn on both accurate and inaccurate data, but only accurate data provides the desired results: a machine that provides output which is reliable.
For machine-based learning, having a representative sample set is crucial. A basic definition of representative samples is a subgroup of a larger population that reflects the key characteristics of that larger population as a whole. Samples are used when it is impossible or prohibitive to work with the entire set of data.
With process automation, for example, rather than record the actions of a loan officer for every task he/she performs for every loan processed over a year, we might take a subset of loans and associated actions. This subset of loans would include examples of each variety of loan type and conditions. The key is that our subset must accurately reflect the full body of work over that year. If more auto loans were processed, then more auto loans actions should be included in the subset. The machine learning system learns from what it is given. If we entirely left out auto loans from the loan subset, the system would not learn about auto loans. As a result, the system would fail to process car loans adequately.
Free of Bias
When it comes to machine learning, we generally define “bias” as an error in output based upon incorrect assumptions about the input data. Actual algorithms can create bias because consciously or unconsciously, people introduce bias with their incorrect beliefs about how the algorithm should operate. Especially in machine learning, bias can also result from data inputs that—while accurate and representative—do not provide the right type of data to use for machine learning.
A classic example of this type of bias is the hypothesis that eating ice cream leads to more drownings. We can correlate an increase of drownings to an increase in sales of ice cream using as input, the data set of ice cream sales and drownings to predict when to staff more lifeguards at a pool. However, this data set is missing a key variable: daily temperature. The reality is that a hot summer day results in an increase in ice cream consumption and swimming. The output of a machine trained on this biased data would be incorrect.
Having input data free of bias requires domain subject matter expertise for a given problem and plenty of analysis to ensure that the data accurately reflects the reason for the outcomes. In many respects, having a representative data set can remove bias. The incorrect inclusion of data can also introduce bias such as including data about gender or race in the loan approval process. Even if the data is representative, this data might reflect the inherent bias of the loan officers at a particular bank. Therefore, it could be used incorrectly as a relevant attribute for automating the loan approval decision process with a machine.
Getting the Right Data from the Start
Organizations of all types try to make the best decisions, creating model processes that are intended to be as efficient and accurate as possible. As machine learning capabilities continue to be incorporated into these efforts, an increased risk of adverse outcomes exists due to a lack of experience and focus on the input data sets used to train these new systems. Automation systems used to reduce costs can result in the increase in costly manual efforts. Predictive systems used to help plan for the future can result in strategies meant to deal with outcomes that never occur. Systems used to optimize decision making can lead to selection of the wrong path.
The answer to all of these problems lies with the input data that is used. It must represent the real situation in which the problem domain lives, must be accurate and it must be free of hidden bias. The risk of not having your umbrella on a rainy day should be the least of your concerns.