Top Privacy Concerns in adopting GenAI & practical tips to mitigate them

ChatGPT is powerful but your data is sensitive & critical.

“Despite the enthusiasm, enterprises are slow to adopt commercial LLMs — like GPT provided by OpenAI — as they share several concerns. In fact, less than a 1/4th of surveyed companies are comfortable using commercial LLMs in production.At a high level, data privacy concerns top the list. In our discussions, nearly 40% of companies voiced concerns about sharing proprietary or sensitive data with LLM vendors”

Survey conducted by Predibase with over 150 CXOs/Leads in adopting GenAI in their organisations.

From my discussion with multiple Heads/DataScientists across cos, one approach that everyone talked about is to fine-tune open source models such as Llama2 and host it in the private infrastructure. This may help alleviate from sharing sensitive data with public vendors like OpenAI/ChatGPT/Mistral/Anthropic.

However, In an adverse event of gaining access to your model weights, a malicious actor could extract your organisation’s sensitive data from the model.

Your Generative AI model is an asset. Treat it like one.

Practical Tips & Necessary GuardRails to Mitigate Privacy Concerns

Zero-Data-Loss and Data Anonymisation

Personal, financial, health related or sensitive data cannot be fed into

Prompts
ChatGPT, OpenAI/Public LLM Endpoints
Internal LLM Models
APIs & 3rd Party Systems

Ensure it is anonymised to protect individual identities such as name, age, email, phone number, personal identities, health information, credit card, expiry dates, cvv etc. and organisation sensitive information such as Revenue numbers, business strategies, financials, brand values etc.

Techniques such as data masking can prevent the disclosure of personal information.

Synthetic Data

It is possible to extract training data including sensitive confidential information from pre-trained model using simple attack vectors.

Use synthetic data(pseudo anonymisation) as a replacer to real sensitive data to fine tune a LLM. This way your data even if extracted does not leak confidential information.

Ensure that the synthetic replaceable data is contextually relevant, personalised & biased with your business context, (eg. you sell products in Singapore, have the products, brands relevant to the region)

Data Moderation with confidential terms detection

Every business has a set of key/valuable terms that are sensitive & proprietary to the organisation. It could be your marketing strategy, revenue numbers, high-premium customer segments.

Your chat prompts and the model end-point needs a moderation layer that should detect these sensitive terms and blocks them through pre-defined policies.

This privacy layer provides a single pass sensitive/personal data identification, redaction, replacement with fake data that is contextually relevant and coherent. The responses from the LLMs have to be moderated to detect malicious code.

Data & Model Governance

Implement strict access controls through a governance framework to limit who can input data into the GenAI system and who can access the outputs. Ensuring that only authorised personnel have access can significantly reduce the risk of data breaches.

The moderation layer can be intelligent to incorporate your organisation’s authorisations & privileges to access resources.
Treat your model & the necessary data that goes in and out of the model as another set of resources.
For eg. this could be your HR policies which are available for access to only a certain grade and above. Compensation for a grade like ‘G6‘ cannot be made available to any grades below ‘G5‘

Privacy-by-Design

Adopt a privacy-by-design approach at every stage of the development process, from initial design, model design, pre-training/fine-tuning process to deployment, inference & GA access, ensuring that privacy protection is baked into the technology.

Centralised Inventory, Catalog of GenAI Implementations

Have a repo of all GenAI implementations that tracks

the models and their checkpoints
datasets used
productionalised versions and their purposes (explainability)

This repository and catalog listing improves transparency & explainability of your models & their usage across the organisation.

Regular Audits and Compliance Checks

Conduct regular audits of your GenAI systems to ensure they comply with data protection laws such as GDPR, CCPA, DPDPA or any other relevant legislation.

This includes reviewing data handling practices and the model's outputs for any potential privacy violations.

Transparency and Consent

Be transparent & explainable with your stakeholders about the use of GenAI technologies and the data it processes.

Obtain explicit consent from individuals whose data may be used, clearly explaining how their data will be handled and for what purposes.
Keep your “Data Principles” informed of the data used. These are your customers/users of your services.
Keep your Compliance & Risk Officer in loop for all sources of information including the data that you train on the model

We are building a customisable privacy layer with necessary guardrails, policy based configurable detection & moderation capabilities needed to secure Business adoption of GenAI.

Talk to our product

#GenerativeAIPrivacy #DataLossPrevention #DataPrivacy #GenAIAdoption #PrivacyMitigationStrategies #GenAI #AIComplianceStrategies

We're on DesignRush

...