As the technology market and other business segments have been suggesting for some years, we are living in a time where more and more data is being produced. With this growth of data, there are more business opportunities to solve known problems from the past more quickly and more efficiently. 

We could say that we've been producing and consuming more digital data than ever, and the trend suggests this will continue to increase (see the graph below). So, in 2020 we will exceed all limits of digital data creation!

Still in this wave, Machine Learning and Deep Learning applications will consume even more of this huge amount of stored data in Data Lakes and other storage systems. And in other cases, algorithms are already extracting data from documents, creating databases with data that was never worked on before, and in this sense, we can imagine new data opportunities to be explored.

But, of course this is very challenging for companies to achieve a good level of maturity in this field and some of the reasons which we could list are, the variation of frameworks and languages, data formats and so on. We can draw a conclusion that it is one of the reasons that the maturity of Machine Learning is not so sophisticated yet. According to the graph below, from a survey produced by O’Reilly company in 2018 [4], we can better figure out Machine Learning adoption within companies.

We see that there is a huge trend concerning Machine Learning solutions, but where is data privacy within this subject? Are these companies able to have hybrid cloud solutions, moving data between those infrastructure tiers and keeping it all safe? And when we say ‘security and privacy’, we can think straight about the sensitivity of data.

According to what Gabriel Ghinita, “Governments in the European Union, United States, and Asia-Pacific have all started some form of initiative to mandate privacy requirements for service providers that handle individuals’ location data” [2]. As a result, some laws have come about, such as GDPR (https://www.eugdpr.org/the-regulation.html) and CCPA (California Consumer Privacy Act), to protect us and establish a standard practice in the market and for companies around personal data usage.

With privacy you might have heard:

“Privacy is preserved if, after the analysis, the analyser doesn’t know anything about the people in the dataset. They remain ‘unobserved'". [3]


However, we need to bear in mind that some scientific researches need to use sensitive data to get the most accurate results and generate benefits for the population, for example, cancer researches needs to have access to CT Scans results to better predict and identify tumours. Also these benefits can be in the corporate world, where for example, multiple Machine Learning models needs to access different kinds of data that contains mostly personal and behavioural data and normally it has different data owners inside of the corporations. But to keep the privacy and security for these data operations is a painfull task, as probably not all Data Scientists can have access to all necessary datasets. So, the key point here is, how to get access and manipulate this very sensitive data for different purposes and keep it private and secure?

There are some answers which have started to come with some privacy techniques or practices that all technology big players such as Google, Microsoft and some others are investing in and for sure it will become a huge trend in the Artificial Intelligence and Deep Learning market. 

For this reason, techniques such as Differential Privacy, Federated Learning, Homomorphic Encryption are starting to appear more often at the current debates and data conferences, which clearly show us a global concern which has been pushing this movement.

Differential Privacy

To add a random noise to each query result to preserve data privacy, this is what Differential Privacy does, allowing users to interact with the database only by means of statistical queries. Basically, in a very simple high-level explanation, differential privacy adds some noise on top of the dataset and what defines how much noise is applied, is the sensitivity, so this means that for more sensitive data, more noise is applied. It allows us to figure out the different levels of privacy for each database and measure the likely  impact of the possible case of data leaking.

One of the techniques to apply differential privacy for neural networks is using the PATE algorithm (Private Aggregation of Teacher Ensembles, by Papernot, GoodFellow et all), where, "The student learns to predict an output chosen by noisy voting among all of the teachers, and cannot directly access an individual teacher or the underlying data or parameters"[8], you can find many web blogs explaining this approach or take a look at the paper here. The image below represents the flow between Teachers and Students [9] according to the PATE approach.

Google Tensorflow also has a framework for differential privacy which is called Differential Privacy Stochastic Gradient Descent (DP-SGD), take a look into the repository.


Federated Learning

Another practice that needs to be highlighted is Federated Learning, which basically consists of bringing the model to the data, which is the opposite way, comparing with the classical approach which aims to send all the data to the model.

The most famous use case that uses federated learning is in the context of mobile phones where it "enables mobile phones to collaboratively learn a shared prediction model while keeping all the training data on device" [7], take a look here for more details.

Understanding a little bit about this flow: "Your phone personalizes the model locally, based on your usage (A). Many users' updates are aggregated (B) to form a consensus change (C) to the shared model, after which the procedure is repeated".[7] The point is, with federated learning we have ML model transfer instead of data transfers. Google Tensorflow has a dedicated website for this subject where you can find out more implementation details, click here for more information. 


Homomorphic Encryption

This is a known encryption approach and it has some known limitations, that part of them have been improved according to the time as the performance. Despite these limitations, homomorphic encryption is a very good solution when we think in data solution on cloud. First constructed in 2009, with Homomorphic Encryption we can say that, "it allows computation to be performed directly on encrypted data without requiring access to a secret key. The result of such a computation remains in encrypted form, and can at a later point be revealed by the owner of the secret key" [5].

In other words, homomorphic encryption can preserve the mathematical operations regardless of the encryption and it can guarantee the operation. For example, "the cloud can directly operate on the encrypted data, and return only the encrypted result to the owner of the data." [10]

Microsoft has created the Microsoft SEAL to provide homomorphic encryption for their customers, and it works as shown below [10].

More information about homomorphic encryption can be found in the Homomorphic Encryption Standardization and to go more deep into the code you can visit the HElib project repository.

Conclusions

Of course we can find some limitations and challenges to implement these different  techniques, but we need to bear in mind that some of them are already implemented and these topics are a natural trend for the near future in data privacy and data security fields for Machine and Deep Learning.

References

  1.  Javier Luraschi, Kevin Kuo, Edgar Ruiz. Mastering Spark with R
  2.  Gabriel Ghinita. Privacy for Location-based Services
  3.  Introducing Differential Privacy: https://classroom.udacity.com/courses/ud185
  4.  Ben Lorica, Paco Nathan. The State of Machine Learning Adoption in the Enterprise
  5.  Homomorphic Encryption Standardisation: https://homomorphicencryption.org
  6.  How to build privacy and security into deep learning models - Yishay Carmiel: https://learning.oreilly.com/videos/how-to-build/0636920339366/0636920339366-video327673?autoplay=false
  7. Google AI Blog: https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
  8. Semi-Supervised Knowledge Transfer for Deep Learning from Private Training Data: https://arxiv.org/pdf/1610.05755.pdf
  9. Privacy and machine learning: two unexpected allies?: http://www.cleverhans.io/privacy/2018/04/29/privacy-and-machine-learning.html
  10. Microsoft SEAL: https://www.microsoft.com/en-us/research/project/microsoft-seal/
  11. AI Blues (Joke Image): https://www.jetferry.ai/images/joke-9.png
  12. Main image: https://kapost-files-prod.s3.amazonaws.com/uploads/asset/file/5970a12b14b1dc004e000126/cybersecurity_embed.jpg