Brief summary
A changing regulatory environment has made it more important than ever for organizations to embed privacy in their data infrastructure. Doing so, however, can be complicated — that means data scientists have an vital role to play in ensuring privacy is a key concern from both a technical and commercial perspective.
Thoughtworker and data scientist Katharine Jarmul is eager to help fellow data scientists master privacy principles and techniques. Her new book, Practical Data Privacy, covers everything from the fundamentals of governance and anonymization through to advanced approaches to data privacy like federated learning and encrypted computation.
In this episode of the Technology Podcast, Katharine joins hosts Rebecca Parsons and Birgitta Böckeler to discuss the book and explain why data scientists need to be on the frontline in the fight for privacy.
Episode transcript
[MUSIC PLAYING]
Rebecca Parsons: Hello, everyone. My name is Rebecca Parsons. And I'd like to welcome you to another edition of the Thoughtworks Technology Podcast. And I am joined today by one of my co-hosts, Birgitta.
Birgitta Bockeler: Hi. This is Birgitta Böckeler. I'm a technical principal with Thoughtworks in Berlin, Germany.
Rebecca: And our special guest today is Katharine, who has a new book almost coming out, getting very close, on practical data privacy. So Katharine, would you please introduce yourself?
Katharine Jarmul: Yeah. No problem. Thank you so much for having me.
My name is Katharine Jarmul. I'm a principal data scientist at Thoughtworks Germany. And my focus for the past five, six years has been on data privacy in the area of machine learning or what we would now call AI and data science. And the upcoming book, which I'm excited to talk about, is coming by May 2023 on O'Reilly called Practical Data Privacy. And I believe the tagline is "enhancing privacy and security in data."
Rebecca: So why don't you start by telling us a little bit about the book, the approach that you took to the subject, and who you're trying to reach?
Katharine: Yeah. So the book in and of itself was I guess first conceived by O'Reilly who gave me a call because I had run some trainings on data privacy for them. And I think that definitely what we're seeing in the world of generative AI but also some of the trends that we're seeing globally around data protection regulation, we're starting to shift the amount of interest that O'Reilly was seeing around data privacy as a topic. They were aware of this. And they were kind of aware that there's a bit of a gap in their library, which is why they really wanted a text that I was very, very happy to write, which is for data scientists by data scientists, so really somebody that's practically kind of doing data scientist work every day, whether it be machine learning or not: how do we actually build privacy into these systems, because a lot of these systems currently, they're not built with privacy as a first-class citizen, so to speak, of the data science process.
And that's been a lot of my work for, again, the past five, six years is thinking about how do we build privacy into the data science process and how do we as a data science and machine learning community, how do we learn how to use these tools because a lot of us don't come from a background where we were necessarily trained on principles like privacy by design and stuff like that. And so overall, I think technologists overall can learn from it. But it's specifically geared towards data scientists and teaching them about privacy technologies and data privacy as a concept to be able to build them into real data science workflows.
Birgitta: So is it fair to say then that it's all about navigating that big tradeoff of, as a data scientist, I need all of this data to get the best insights out of it but also a lot of that data, I actually do not even want to see and I don't want my model to see because I want to preserve privacy? And then you constantly have to navigate that tradeoff somehow. Yeah? And that's what it's about? Is that fair to say?
Katharine: Yeah. I mean I think that's a general theme in the book. And I really like that you point it out. And it's kind of like sometimes we have this false mindset that privacy is on or off. And early on in the book, I start to introduce the idea of privacy and utility or privacy and information continuum, that we have this continuum, and we can find ourselves anywhere along it between the amount of privacy we'd like to offer for the people in the data set and then the amount of information we'd like to gather from the data and kind of constantly coming back to this theme of figuring out how do we choose a good point along that continuum and improve that over time, hopefully using technologies that work really well so that it ends up being a bit of a win-win rather than a full tradeoff or compromise.
Rebecca: The title of the book really speaks to me with that focus on practical. Tell me a little bit more about how that informed your thinking about the kind of content that you would put in in addition to the way it informs your thinking about the field in general?
Katharine: Yeah. I mean so when I started-- another reason the book exists or why I was really excited to write the book is when I started getting into this topic, a lot of the content available was either very, very high level — kind of general — and one was left thinking, OK, I get the idea but how do I do it, how do I build it? And then another area of the literature was extremely research-based and academic in nature. And I love reading research. Don't get me wrong. And I love diving into it. But that also often lacked this ability for me to take something directly from research.
A) do I understand it? Do I know all the words? Do I know all the concepts or do I have to do a bunch of extra reading to understand it and contextualize it? And B) even if I do understand it from the reading I've done thus far, how do I actually take it and use it?
And I think that's really truly what this book is aimed at is bridging that gap between a more general and a more kind of general technologist or general data person understanding of these concepts, so making it easier, making the learning curve less steep to get into the topics, and then, B, every chapter has an implementation section. So every chapter has actual code.
There's a repository. You can run the stuff. There's example data that I use. So my goal is to get people out of being privacy interested and into actually privacy active where they can actively use the code from the book or use the libraries and concepts from the book, know the base theory, and then apply it to their work.
Birgitta: And it's also sometimes like me, for example, the way I learn is that I always need some examples of practical application. Otherwise it just doesn't stick for me often. Yeah. And I had a look at that repository that you have for the book. And I really like how, in the notebooks, you almost have a little storyline. Like I saw stuff like, oops, that didn't work but now we still need to identify this person. So that's like a nice little storyline even in the notebooks in the repository, which I really liked.
Katharine: Well, thanks. Yeah. Yeah. I think like a lot of us, we learned by doing, right? And so I think there's a non-trivial amount of technologists who just really want to play with something. In the notebooks, too, there's a lot of challenges like homework, like can you figure this out? Can you do this? And I'm hoping that there will also be some reader contributions to the notebooks over time.
Rebecca: So one of the things that I was really struck by in the book was the way you were tying privacy explicitly to business goals. And you actually had a sentence, and I'm paraphrasing it: "And if you think that privacy conflicts with your business goals, you need to rethink your business goals"! Can you talk a little bit more about that because I think when you were talking about that continuum earlier, I do think some people still see these things in conflict.
Katharine: Yeah. I mean there's been decades now of us collecting potentially the most data that the world has ever collected with increasing acceleration. And that's kind of been the trend. And that's been greatly accepted by industry. We've made amazing inroads in doing high performance networking, high performance computing, distributed computing, and all these things. And now I think finally with kind of both climate change and the climate disaster questions, but also with privacy questions, and also with usefulness questions, there's a bit of now starting to push back of can we do more with less? Is there this ability to do so?
And one of the big reasons why I got into privacy is I started to notice-- I was working in natural language processing, so some of the same technologies that now things like ChatGPT use. And what I noticed is there was a never ending quest for more data and the idea of bigger and bigger models — which, of course, we see now with the large language models — but the more data that we put into it, particularly the more readily available data, the more we started running into quite toxic issues in the language modeling where there was whole regions of the language model that were really abhorrent in a lot of ways. And a lot of that had to deal with figuring out what data do I actually want to use, and do I want to be picky about the data I use, and do I want to think about the use of this model when I'm training all of this and when I'm doing all of this.
And that actually drove me to privacy because it drove me to privacy because we were asking these questions of what things should a model be able to learn when given private or sensitive data? And there one of the connections we saw is if I have a model and it makes a choice based on a sensitive attribute, let's say your gender, or your race or ethnicity, then I've built probably a model that I need to question or think about.
And when we think about the business goals of implementing these models, it's kind of like when we look at the engagement models or the recommendation models that we have now.
They've been running for quite some time. But is really what we're after more clicks, more shares, more comments? Because that's what we've tended to optimize, and yet we find that sometimes these systems then essentially will engage with the most enraging content possible.
And so a lot of times when we build a business model and it's in contradiction to our definition of, let's say, ethical use of technology or our definition of privacy, we have to start questioning what it is we're actually trying to do with the technology that we're building. And I think that's, I guess, not only how I got into privacy but also this point of, if you can't build it with privacy built-in, maybe you need to wonder what's the goal of what you're building.
Birgitta: So these abhorrent parts of the model that you're talking about, are those examples there maybe not directly privacy questions but it's kind of like a similar problem, like that you have to think about what you put into the model and that kind of led you all the way to the root to this privacy question, yeah?
Katharine: Indeed. Yeah. And when we think about the language modeling, it was very open. One of the reasons why I think privacy in language modeling would be really cool is we wouldn't have to use terabytes of data scraped from the web. If, let's say, the three of us wanted to get together, and we wanted to build a large language model, and we trusted each other, and we trusted each other enough to build a model together, we could presumably use our personal texts-- these are often off limits of people like OpenAI. Right?
We could use our personal text so we could combine them to create language models that are closer to how we speak with one another and how we speak individually. Of course, we also need massive compute power, and I'm skipping over some of the implementation details. But the goal is for what that I see of privacy and machine learning, and in, quote unquote, "generative AI systems," is that humans feel more trusting towards the smaller groups and that communities can form and actually create the data, create these models, they can be maybe one day collectively owned or collectively used, and that these contributions maybe can also be tracked and revoked and so forth so with respect to privacy law.
And the goal there is less of “let's scrape the entire internet and push it into a language mode”l, and more let's think about whose data we can use with respect to the users' needs and build a language model that's better for everything that they're actually trying to do. And that's the end goal I guess.
Birgitta: Yeah. But to come back to Rebecca's question about privacy and its relation to business goals, Thoughtworks just released, together with the MIT, a study about responsible technologies and asked a lot of businesses also about the different areas of responsible tech that they are thinking about.
And Rebecca, you might remember the numbers better than I. I read it a few weeks ago. But there was like privacy concerns were actually high up on the list because it's become a lot more a topic among consumers and all of the different fail stories in the media and so on, right?
Rebecca: Yeah. And also interestingly — and frankly, I was a bit disappointed in this — while many businesses acknowledged that there were true business objectives around brand protection, brand recognition, employee brand recognition as well in terms of trying to be able to recruit talent, compliance was still considered a big driver for their interest.
And so I do think having the regulatory frameworks are also helping to drive some of that focus from a business perspective. And if that's what it takes to get it, I mean I would far rather they look at this from the perspective of the business opportunities that are made available by being privacy forward and seeing in the marketplace as respecting privacy. But who knows.
Katharine: Yeah. I think there's an interesting connection there though because the changes we're seeing in the regulatory environment are driven by democratic wishes for better privacy controls. So I mean folks are talking about this and kind of putting it forward as, hey, these issues are important to me. What's not yet clear yet is who's going to win the kind of consumer-facing part of that.
And I would hope, or one of the goals that I have, especially in my work here at Thoughtworks, is to empower more companies to be both data-forward, so still leading on data strategy and developing forward-thinking data initiatives, but also to be privacy-aware, because what we're seeing right now in terms of the people that have deployed or the companies that have deployed these systems, is it's often the very large technology companies that have deployed these systems thoroughly.
So they're not worried about compliance because they're kind of leading the compliance in terms of the technology advancements that they're actively deploying into production systems. And what I hope is that there's more players in that space, that it's not just Apple that gets to brag about their differential privacy implementation, that it can be every company, or at least lots of companies that can actually use these tools and empower their users to make consensual choices about their data use and collection.
Birgitta: So then what are these tools, Katharine? What are roughly the areas that your book covers, the practical things companies can do?
Katharine: Yeah. I mean we start with the very basics, which is really thinking through data governance. And in this case, you would also want to think through AI governance or machine learning governance as well. And with that comes kind of knowing what your data is, ensuring that your data is properly cataloged or organized, documented and so forth, understanding the consent and the other rules of your data collection. So that means also working with legal professionals and so forth, and basically ensuring that you're set up with the most basic of privacy protections, things like making sure there's tokens, or masks, or pseudonymization of privately identifiable information or PII.
That's kind of the basic level of stuff. And there I think there's still a lot of opportunities for every company to take a look at what they're doing and to increase kind of not only your data understanding, which is going to help with any data science initiatives you have, but also with just better quality data, better data literacy across the organization, as well as better compliance, auditing, monitoring, and kind of privacy initiatives.
Birgitta: So first you need to figure out what you even have to or want to keep private so you need the transparency.
Katharine: Absolutely. Yeah. Yeah. It doesn't help to apply a technology to everything. You need to kind of know what you're doing. So with a lot of these technologies, where they're at right now is it's not a generic one-size-fits-all solution. So knowing your data, knowing your use cases, knowing the rights that you have for what data you're trying to use, this can go a long way in setting yourself up for not only good experimentation but actually good practical and deployed usage of any of the new technology.
Rebecca: And so that's really the first step for an overall organization. But how would technologists who are intrigued by this notion of privacy engineering and privacy by design, what does it take to actually get started in doing this? I know you reference many different toolsets that are out there, but what's the way to get started?
Katharine: Yeah. So I mean I think that the data governance is a really good starting place. So ensuring that, again, the understanding of the data is high, that you also understand how data moves through the organization and what rights or what use cases should be given what access. And this is kind of where you're at the basic starting point.
If you want to then take it to the next level — so let's say you've kind of covered the bases on that initial step. If you want to say, OK, we have some new marketing use cases and we'd like to let's say compare data with another company and we want to figure out how to do that in the most privacy respecting way-- and this is not only good for privacy, this is also good for the proprietary information in your company, for things not leaking every time that you analyze a new partnership or every time that you sign up with a new data sharing platform. This is also kind of your data as a competitive advantage, which generally I think is kind of overseen as an advantage of deploying privacy technology but is clearly one.
And then the book goes through three major technologies that I'm really excited about and that I think are production ready. And that's differential privacy as a way to anonymize data. This is the strictest form of what we would call anonymization and then also things like federated learning, federated data analysis where we actually don't even centralize the data at all. We leave the data wherever it is. And this could be great for partnerships — so you're working with a new partner; you don't send them any data, they don't send you any data, you still perform analysis together. And you send each other just the results of these analysis. This can go a long way and can be combined also with differential privacy should you need it. And then a third technology that I'm really excited about that's covered in the book is a variety of types of encrypted computation.
And this means that we can actually compute. We can do data processing on data without ever decrypting it. So we can actually process, do insights, even run machine learning on encrypted data. And that goes a long way again in these partnerships in any types of data sharing situations to say, you know what actually, let's just only share the results of this analysis. Let's not actually share all of the data that we have here. And all of these can be and are used in production systems today. And so the book kind of takes you from, OK, you have your basics covered but let's move into kind of the new, and exciting, and usable types of advanced privacy technologies that have really come out of research labs and into production systems in the past, I don't know, five to 10 years.
Rebecca: And I can kind of get my head around both the differential privacy and the federated learning, conceptually. But my brain kind of goes into fits and starts of thinking about computation on encrypted data. Can you go that next level deep on how in the world does that work?
Katharine: Yeah. Yeah.
Birgitta: Yeah. Just before that real quick, it's the same for me also. Years ago I think at the Chaos Communication Congress, I saw a talk about sending queries to a server but the server must not understand your query but can still return a result for you so that also similarly broke my brain a little bit.
Katharine: Yeah. Yeah!
Birgitta: So yeah. I would love to hear some like for dummies summary from you!
Katharine: Well, I don't think I have any dummies on this podcast call right now, but I'm happy to take it at a step! So encrypted computation, really it's kind of the coolest field ever. If I have to choose the coolest field ever, it's encrypted computation. It's a subfield of cryptography as a whole.
So a lot of the core math that we use in cryptography, it uses this concept of a ring or of a field. So you know how a ring, so you have a clock. A clock is a ring. It's really easy. It goes to 12 and then it comes back around and then it goes to 12 again. And when we operate in a ring, we can use modular arithmetic. So we can use remainder arithmetic. Right. And so when I go past 12:00, I go back, I wrap around to 1:00, and to 2:00, and so forth.
And so these properties are actually often used in cryptography in what's called a field. So, if instead of choosing 12:00, I choose let's say a super large prime number — and you might remember this from setting up cryptography systems, and you're like, OK, there's a prime that we use, and there's a prime generator and so forth — this is a lot of how some of these systems work.
And so let's say you have a really big prime. And instead of choosing 12 as the end of your wraparound, you use a really huge prime number, well, one of the cool things that that does is it hides your data. So when it wraps around and it goes again, we can use the properties of this field to essentially encrypt the number and hide it in a variety of methods and then decrypt it. So we use this field to say, OK, I want to take these numbers away from it in order to get back my original value. We could do that. I'm simplifying a little bit for ease of understanding here.
And so what we do in encrypted computation, one of the things that we use is we use this property but we use only very special cryptosystems or very special protocols in the field of — there's a subfield that's also talked about in the book called multi-party computation where you can essentially split a piece of data into a few different secrets and share it with a few different people — and only if those people combine their data can they use it to compute.
So you have their two methods, either a special cryptosystem or these special protocols like secret sharing. And then you can actually use the same properties of the field to make sure you get a correct result, which is kind of crazy when you think about it; that, at the end of the day, this is all ensuring that the mathematic ends up so that when you add two numbers in encrypted space, so say an encrypted four and an encrypted eight, you make sure that actually the math works out such that you get an encrypted 12 as the result. And then you can use that same decryption method to actually reveal the final output.
Of course, many of these things are much more complex than a simple addition, right? But it essentially uses these base properties to ensure that you get a manageable result. But it doesn't work with every single cryptosystem. We have to use ones with homomorphic properties or we have to use special protocols like secret sharing. So you can't just do it with, let's say, RSA or something like this.
Birgitta: And then you were talking about computational numbers so I'm guessing this is then all based on data that has already been transformed into numbers, into features for example, and so on. Yeah?
Katharine: Absolutely. Yeah. That's an important point.
Birgitta: So there's no more name there or, I don't know, like let's say colors of clothes or stuff like that. But it's all been turned into numerical data. Yeah?
Katharine: Yeah, absolutely. So you need to do the same encoding that you would expect like for a typical data science problem or machine learning problem where you take, let's say, all of the colors you can represent and you decide red is one, and blue is two, and so on and so forth. And with these categorical structures, you can then still aggregate in encrypted space and so forth.
And there's a few code examples in the chapter in the book obviously and some notebooks that basically go through some of these base understandings of how fields function and how the basic crypto math functions in case people want to poke around and give it a try. It's pretty cool when you start reading the mathematics of crypto how much it actually gives us, how much thought also has gone into the ways that we decide to obfuscate data when we encrypt it and why that works.
Rebecca: Well, Katharine, what's your favorite message in the book or what's the one thing that you would want to get across to people about your philosophy on data privacy, privacy engineering?
Katharine: Yeah. I mean I think there's a few messages that are really core. I think one of them we already talked about a little bit, which is — and Rebecca, I really liked you bringing this up — which is if privacy is in direct contention with what you're trying to build, spend some time pondering that. And I think one of the sentences I have in the book that I liked a lot was “some models should never be built.” So if we know that we're building a model that will harm people, if we know that we're building a model that's going to directly change or impact somebody's life in a severely negative way of no wrongdoing of their own other than living in the wrong place, being the wrong person, then this is obviously something we need to question.
But a second one that I was excited about in this conversation too is how do we empower people with their data, and how do we work better collectively and communally with data with one another, and how can we maybe even change the way that the data landscape looks like? If we were to really offer privacy-respecting alternatives, could we build really cool GPT models that have less toxic problems? Could we change the way that people decide to share data with one another? And this, I think, is something I'm really excited to see about in the future.
Birgitta: So much maths magic these days in the industry...
Katharine: Indeed.
Rebecca: Well, Thank. You so much, Katharine, for joining us today and for writing the book. I think it's going to have an impact on lots of people who want to understand a bit more how we can make our systems more privacy aware, more privacy forward, and therefore more responsible. So thank you, Katharine.
Katharine: Thank you, Rebecca. Thank you, Birgitta.
Rebecca: And thank you, Birgitta, as well for joining me. And I hope to see everybody on the next or to have everybody on the next edition of the Thoughtworks Technology podcast.
Rebecca: Thank you.
Katharine: Thank you.
[MUSIC PLAYING]