Of all the jargon and buzzwords beloved of IT professionals – “the cloud”, “SaaS”, “web 2.0” and an infinity of others – “Big Data” is the most alluringly easy to misunderstand. Whilst big data systems do entail a large volume of data, the real benefits come from the speed (or ‘velocity’) of accumulation, and the array of different types of data (or ‘variety’) that are collected and analysed. Sources might be traditional databases, GPS logs, social media feeds, photos, video and other mixed media inputs. These first three ‘Vs’ – volume, velocity and variety – are then ideally processed by the system to give new insights, resulting in our fourth ‘V’ of Big Data systems: ‘veracity’.
There are various studies that have claimed to show that the volume of data in the world is increasing on an exponential curve. Whilst methods of measuring data, and what can properly be classed as data, are debatable, it is unquestionably the case that the growth rate is impressive. The rapid digitisation of our communications and production systems in the last two decades have led to an explosion in the amount of information available in digital form, and the rate at which new digital content is being created. YouTube.com alone now sees more than 48 hours of video footage uploaded every minute of every day! On a more personal level we will all be aware from our everyday experience how many more pictures we take using our phones for upload to the Internet than was ever the case with old film cameras.
Mining new data sources: the Internet of Things
But direct human agency is not the only way that new data is created. Increasingly a plethora of network connected sensors create streams of data. Our phones are constantly and automatically roaming from cell tower to cell tower, connecting from one wifi network to another, and recording our position with accuracy to the single square metre with GPS. Our cars record every mile driven, and every engine contains hundreds of sensors. Goods in transit are marked with radio frequency ID tags. Our credit cards use the same technology to pay using contactless terminals. Add to this new devices that link in to our existing personal networks, such as the Nest learning thermostat. This combination of hardware, smartphone app and web service uses GPS data from our smartphone and information about the weather to decide when and to what temperature to heat our homes based on its knowledge of where we are and when we are likely to be at home.
These new network connected devices are giving rise to an ‘Internet of Things’ or IoT which both creates and relies upon data to provide added value and services. The idea that a device can use data gathered from its own sensors, together with data available from elsewhere (other sensors and devices, the wider Internet, public feeds etc.) together, and then process that combined set of data to deliver services efficiently and effectively is at the heart of this IoT revolution. But the combination of senor data and ubiquitous network connectivity (via wifi or cellular phone networks) still requires something extra to deliver the ‘smart’ element – some form of intelligent processing.
Four bytes good, two byes bad: all data are equal, some are just more equal than others
The issue – at least historically – is that computers cannot process all data equally usefully. Mathematical operations on numeric values are easy, but having a computer in any sense ‘understand’ information in the form of natural language commentary, or images or video, is much more difficult. Traditionally, overcoming this barrier of ‘understanding’ from a computer science perspective has been the domain of artificial intelligence, or AI.
Consider a typical natural language problem: looking for negative comments in the virtual fire hose of data that constitutes a typical social media feed. A computer programmed to look for certain words or phrases – “rubbish”; “poor service”; “very disappointed” etc. – might filter out a large number of complaint messages from a feed, but it might miss the subtlety of something like “unlikely to recommend this service to my friends”. What we really want is to elevate the computer from simple pattern matching (the word search approach) to something that is evaluating the sense of the message.
Similarly, teaching a computer to recognise any generic picture of a cat, or a dog, or to distinguish pictures of particular breeds of dog, requires something more than simple shape- or pattern matching. Animals can be found in too many different poses for any straightforward shape matching to work well. Instead the computer needs to distinguish the real 3D shape of the animal in space, and its various joints and scope of movement. This moves the computer from the domain of image recognition to image comprehension.
Until recently, these types of natural language or image comprehension systems were the stuff of science fiction. Over the last decade or so, new AI machine learning systems based on ‘deep learning’ techniques have seen computer systems overtake human performance in key fields – from winning televised game shows to outperforming humans at the ImageNet Large Scale Visual Recognition Challenge. These systems rely on multiple layers of interconnected nodes that can learn based on examples, and improve themselves given a greater number of examples over time.
As a result data that was not susceptible to processing previously can now be understood and processed as part of a big data system.
New insights, new connections and new services
The qualities of big data systems – the volume, velocity and variety – allow the data sets to be mined for connections and correlations not previously spotted or understood. In turn, these connections themselves become data points that add to the whole, allowing for that data to be mined for new insights.
These insights can range from the most significant to the trivial – from new methods of identifying those at risk of developing life-threatening diseases, to smarter ways to recommend what you might want to watch next on Netflix. Because the machines are programmed to find and make new connections, no one knows in advance what connections or correlations they will find, nor on what basis. Once identified, the new connections and insights then potentially power new services – everything from ultra-personalised advertisements for products that are very likely to be of actual interest to the recipient, to more efficient ways to heat your home, pay your bills or shop for groceries. The systems can learn to anticipate need – allowing demand to be predicted more accurately, reducing waste and reducing disappointment as products are not available.
The laws of data robotics
So as the combination of data, ubiquitously connected IoT devices and AI enable the provision of big data-driven products and services, the value of the data itself, its sources and the processing technologies all increase. Both the law and the public discourse have recognised this shift in the value of data over time. In the EU, over the next two years or so the General Data Protection Regulation will impose a new more stringent protection regime for personal data, backed by some of the most severe sanctions in the world for companies that fail to adhere to its requirements. In other parts of the world, everywhere from South-East Asia to the United States, new data breach and data security requirements are being enacted.
At the same time, as both consumers and companies become increasingly reliant upon data-driven services underpinned by AI data processing engines, the security and reliability of those systems will matter as never before. Whilst in the more laissez-faire parts of the world, providers may get away with extreme disclaimers of liability in their standard terms, this is unlikely to be the case in places with a stronger tradition of consumer protection. Indeed in the UK at least, in some cases, the quality of output from a digital service might already be the subject of legal protection under the Consumer Rights Act 2015.
For business-to-business contracts, the position is likely to be more complex. Just as we have seen a gradual shift in cloud operators moving from their own very supplier-friendly standard terms in order to attract customer from major customers in regulated sectors (especially the banks, insurance companies, pharmaceutical companies etc.), we would expect the same to be true in ‘Big Data as a Service’ and ‘AI as a Service’ contracts. The spending power of these customers, and realities of the regulatory environments within which they operate, make this kind of trend almost inevitable.
And as we have commented in other briefing notes, particularly where the machines themselves have been responsible for developing the correlations that underpin particular data processing, the operators need to be wary of the basis of that processing. If the machine is making decisions on a basis that would be considered discriminatory at law, even if it has come to those conclusions itself based on its own analysis of the input data, that would mean that the operator of the system is guilty of that discrimination. With so many data points available, system operators will need to check very carefully that their systems are not producing different outputs based on protected characteristics, such as sex, sexual orientation, ethnic origin or age.