Dataresponsibility in Machine Learning

Recently I’ve been articulating some trends around Machine Learning (ML) for a corporate. The main intension was, to find the right platform, framework, toolset etc.

After endless interviews and clarifications what Machine Learning NOT is and that Artificial Intelligence is far more than just a few ML algorithms or Watson.

Some enablers were identified like custom hardware (ASIC) or arm clusters with own GPUs or some future enablers like neuromorphic computing to simulate the morphology of individual neurons.

Or quantum computing. The story about QC and its similarity to the human brain will tell in another post 😉

Finally the core findings were, that for most purposes the right platform, cloud service or framework does not matter. Usually it is about the personal preference of the ML experts.

With the right algorithms, the results can be done more or less elegant, with every “tool”, cause it is (in the majority of cases!) about fancy statistics.

No magic.  Here comes the but and it is a big BUT.

The data is the most important and therefore most valuable. Departments in big companies are siloed and hierarchical. This implicates issues sharing their data. Rarely there is a (real) Data Lake.

Instead of Big Data, we should call it Small Data.

So they start to collect everything, without an clue what exactly is needed. What the system has to “learn” for data-driven predictions or decisions to uncover “hidden insights”.

When using ML we have additional (organizational) findings that are important keeping in scope. Let’s call it data hygiene.

Quality of Data

Data makes machine learning work, therefore it becomes necessary getting the data right, not just the big data, all the data that drives a business.

In most applications we use today, data is retrieved by the source code of the application and is then used to make decisions. The application is ultimately affected by the data, but source code determines how the application performs, how it does its work and how the data is used.

Today, in a world of AI and machine learning, data has a new role – becoming essentially the source code for machine-driven insight. With AI and machine learning, the data is the core of what fuels the algorithm and drives results. Without a significant quantity of good quality data related to the problem, it’s impossible to create a useful model.

Ethics around Data

With the collected datasets and AI, new responsibility becomes inevitable.

Why? Because “shit in = shit out”.

With enough data, correlations between shoe size and crime can be visualized. Just because we can do this, does not mean we should.

As AI becomes more ubiquitous in our lives, we should realize how important it is for the machines we created to treat people fairly.

The more ML and AI are put into practice, the more we see how easy it is for a machine to perpetuate the unfairness of the past.

Ethics has to be built into the ground floor of the data engineering and it needs to be reviewed consistently.

Ethics has to be fundamental, but that human refinement of what is right and fair also needs to be on the deployment side to keep AI from adversely affecting people.

Preventing Biased Data

Collected data is about the past. Unless you want your ML model to persist the biases of the past, you need a values-based approach.

Every dataset is biased in some way. Incompatible data sets, questions with hidden prejudices, measurement errors, low quality data and more can often result in bias of ML applications.

A ML system trained on existing customer data only, may not be able to predict the needs of new customer groups that are not represented in the training data.

When trained on man-made data, ML is likely to pick up the same constitutional and unconscious biases already present in society.

Some examples:

-Machine learning systems used for criminal risk assessment have been found to be biased against black people.

-In 2015, Google photos often tagged black people as gorillas. In 2018 this was still not resolved. Google still used a workaround to remove all gorilla from the training data, and therefore was not able to recognize real gorillas at all.

-In 2016, Microsoft tested a chatbot that learned from Twitter, and it quickly picked up racist and sexist language. Before using data it has to be made sure, that the results are not biased by testing the results, and refining them.

Privacy becomes key

Who ownes the data, who makes money with the data and how is data used?

Customers are becoming more aware of their data and that other parties are also interested into their data and behavior patterns.

GDPR, scandals around Facebook, Quora, Cambridge Analytica or Palantir are just the beginning.

Every responsible data collector and user, has to make transparent what happens to the data and to securely store and process the data.

Preventing Adversarial Effects

ML has a vulnerability to Adversarial Effects, patterns that are not visible for humans, but for pattern recognition of ML algorithms.

Best could describe it as optical malware.

It will become absolutely essential to test algorithms to such vulnerabilities or to implement secure procedures to prevent unwanted results.

Data governance

Every company needs breaking down data siloes, getting workflows set up for cleaning data, and establishing a culture that’s based around using data engineering and data science appropriately.

Data engineering becomes such an essential aspect of AI and ML that data needs a new significance and awareness, instead of taking it for granted.

Nowadays, software isn’t anymore where the greatest value comes from. For ML, good, clean, labelled datasets are where the real value lies. 

Data is the new foundation of business and services. Keeping this data hygiene is a recurring procedure, that needs to become a fundamental imperative in every company.