Security and Privacy: Closed Source vs Open Source Battle

Daniel Hoyos
Blue Orange Digital
9 min readJan 25, 2024
DALL-E generated image

The integration of AI in various industries has brought about concerns regarding privacy and security. The adoption of Generative AI solutions has raised fears of insecure usage of private enterprise data, data leaks after using 3rd party APIs, and the potential for exploiting an LLM to reveal critical information or make it behave in unintended ways. Critical events highlighting these issues include Samsung’s Data Leaks claims, Apple taking a stance on internal usage, and critical data breaches in OpenAI through ChatGPT. The AI privacy paradox represents one of the most significant challenges of our time. President Biden has issued sweeping artificial intelligence directives targeting safety, security, privacy, and trust, which cover a variety of issues for private and public entities domestically and internationally. Companies that develop AI systems have an important role to play in working with privacy and data. They engage with regulators to share information about how their AI systems work, how they protect personal data, and the lessons they have learned as they have developed privacy, security, and responsible AI governance systems. The challenges posed by AI regarding data privacy and security are substantial, but not insurmountable. As AI continues to evolve, so must our strategies to protect and secure data. Collaboration between AI developers, cybersecurity experts, and policymakers is vital to ensure a future where AI benefits society without compromising privacy and security.

Closed Source during 2023

DALL-E generated image using the “close source” concept

Even though OpenAI never implied they were using data traveling through their API to train their models, at the beginning of 2023, OpenAI changed their stance and openly said they were not going to use customer data by default. Thereafter, as a result of Microsoft’s investment in OpenAI, Microsoft released Azure OpenAI services that promised not only full access to high-scaling capabilities within Azure but also a complete set of privacy and security assurances based on their Responsible AI Principles, transforming Azure into a very appealing cloud option for companies that wanted to leverage OpenAI’s GPT LLMs. Moreover, in August 2023 OpenAI introduced their ChatGPT enterprise offering. These steps have made enterprises recover a portion of the trust in OpenAI, and little by little have started integrating these offerings into their workflows under the security and privacy umbrella.

Another player in the closed source scene is Anthropic. They seem to have prioritized data privacy and security, taking a holistic approach by assessing worldwide privacy laws and regulations along with customer needs. The company has established a Privacy Policy, Data Processing Addendum, and Responsible Disclosure Policy to explain how it handles personal data and to provide clear guidelines for security researchers conducting vulnerability assessments. In addition, Anthropic has shared recommendations for government regulatory approaches to encourage the adoption of cybersecurity best practices for advanced AI models, emphasizing the critical priority of ensuring the security of these systems. The company’s efforts to distinguish itself on ethics grounds and its partnerships with major organizations, such as Amazon, have drawn attention and raised discussions about the future of AI development and regulation.

Open Source during 2023

DALL-E generated image using the “open source” concept

Catching up with open-source solutions, the community’s freely available LLMs on the HuggingFace platform and their capable tools, have made great strides during 2023 demonstrating that it is not impossible to match, or nearly match, GPT’s level of performance for specific tasks. Hence, making these models appealing for the security and privacy benefits they intrinsically provide by letting you keep your data and LLM in your platform, while completely owning your LLM. Above all, the ability to deploy in your enterprise’s platforms/cloud and completely shutting down the need for 3rd party APIs with your data has made many companies and individuals invest in skilling up to learn how to manage, train/fine-tune and deploy their custom LLMs.

The event that sparked the open-source wave of technological advancements in 2023 was the release of Llama by Meta in February 2023. Although this LLM wasn’t released to the public for commercial use, researchers were able to use it freely. This sparked the development of powerful fine-tuned versions that closely matched GPT’s performances such as Alpaca by Standford in March 2023; Vicuna by a joint effort of UC Berkeley, CMU, Stanford, UC San Diego, and MBZUAI in March 2023; Koala by UC Berkeley in April 2023.

The impact of these releases was threefold: first, they made fine-tuning much more affordable. With under $300 USD, the Vicuna team was able to reach 90% of GPT4 performance with their dataset. Second, having a good conversational dataset to fine-tune is key, however creating it with human annotators is complicated and expensive, so a technique using GPT4 to synthetically create, augment, and improve quality datasets specifically for fine-tuning emerged. Third, these releases provided a good and powerful foundational LLM to use as a basis for producing powerful fine-tuned LLMs with custom data.

Each of these outstanding results sparked new lines of research on their own. First, the open-source community made outstanding contributions to radically reduce the cost of fine-tuning an LLM. The advent of techniques such as Peft, and their underlying approaches (LoRa, Prefix-Tuning, P-Tuning, Prompt Tuning, read Peft GitHub for all of the techniques) made the community realize that you don’t need to adjust all parameters to fine-tune a model, just a carefully selected set of parameters will do the trick. Peft helped reduce the time it took to fine-tune models in the cloud, and therefore made it cheaper. Furthermore, improvements in quantization helped shrink the memory footprint of an LLM loaded in a GPU. Usually, LLM weights needed 16 bits to run with good performance, but advancements in this area showcased that you can load a model in 8 bits, and now in 4 bits with a technique called QLora. In short, these advancements demonstrate that a carefully selected set of weights still needs to be 16 bits to avoid degrading performance, while the vast majority of the remaining weights can be as low as 4 bits. These quantization steps made it possible to load big LLMs into consumer hardware, something never before possible, making the experimentation capabilities widespread to anybody with even a gaming machine, especially for models with lower than 13B parameters. By enabling any sized LLMs to fit in a quarter of previously needed GPU memory, cloud costs plummeted and experimentation became almost free for all with free tools such as Colab and their free GPU notebooks.

Second, very interesting advancements in synthetic data generation/augmentation methods surfaced. Generally, augmenting data artificially hasn’t yielded good results in the ML and AI fields, and it is a widely known fact. However, these methods increased in popularity after ChatGPT’s release, because there was a capable and intelligent AI teacher capable of generating meaningful and unique data samples for custom training LLMs. Alpaca creators used a method called Self-Instruct to create a large dataset of close to 52k samples and used it in their fine-tuning. The gist of this is that these LLMs were discovered to perform very well on various tasks, while earlier LLMs were only trained as single-task specialists and could not extrapolate their knowledge to other tasks. Creating a good instruction-following dataset, though, proved that you could fine-tune an LLM in such a manner as to transform it into a multi-tasker. The only caveat is that you need a dataset of good varied instructions to make it learn how to perform and respond. Hence, the Self-Instruct technique starting with a small pool of human-annotated instructions and through an iterative process of instruction generation, revision, scoring, and selection, synthetically expanded the pool to a meaningfully sized dataset. Throughout 2023, other techniques have also appeared such as Reinforcement Self Training and Scaling Self-Training. These ideas can be summed up by the concept of “distillation,” which uses a larger LLM as a teacher of a smaller LLM, making the latter learn from the former. If you noticed, all of the techniques mentioned above use GPT as the teacher. So, in a way, OpenAI inadvertently became a platform for Open Source development.

Finally, having access to a strong foundational LLM was shown to be one of the most important gatekeepers for quality Open Source LLMs. Meta’s Llama release let researchers discover what was possible, reaching GPT levels of performance with a custom LLM. Therefore, when LLama2 by Meta was released in July 2023 available free of charge for research AND commercial use, it took the world by storm. Many companies started joining the fray in the LLM race and an interesting shift in the second half of 2023 happened. New techniques, innovations, and tools around inference optimization and deployment, making LLMs fast and viable in production environments, started popping up everywhere. Techniques like Flash Attention then packages such as vLLMs for blazing fast serving and inference through optimization algorithms, and finally, open-source solutions for cheap multi-cloud deployments, such as Skypilot, have made it easier than ever to leverage Open Source LLMs for businesses in production environments.

The ability to host your LLM, fine-tune it with your data for very specific and narrow use cases, security and privacy in your environment, and the capacity to serve it efficiently are transforming how businesses perceive the technology, making it more appealing than ever. Moreover, seeing day by day the open source community catching up with GPT's level of performance has reduced the fear of compromising security and privacy with quality and performance. With open-source LLMs, you can have the best of both worlds. As evidence of this, open-source LLM Mixtral 8x7B is very high up the ladder, behind GPT-4 and Claude-2 offerings, and surpassing Gemini-Pro and GPT 3.5

Conclusion

Reflecting on the journey AI has gone through in 2023, it becomes clear that security and privacy have been at the forefront of this evolution, being more important than ever before. The year has been marked by a significant shift from closed-source preference to open-source Large Language Models (LLMs) interest in the industry, a trend that is anticipated to continue to gain momentum in 2024.

The proactive steps taken by companies like OpenAI and Anthropic, in prioritizing user privacy and data security, have been instrumental in rebuilding enterprise trust. OpenAI’s commitment to ethical AI usage, demonstrated through initiatives like Azure OpenAI services, sets a new standard in the industry. Similarly, Anthropic’s holistic approach to data privacy and security, aligned with global regulations, serves as a benchmark for responsible AI development.

Concurrently, the open-source community, led by a giant, HuggingFace, has demonstrated that owning your data and LLMs does not necessitate a compromise in performance or quality. The advancements in model fine-tuning, cost reduction, and synthetic data generation methods, exemplified by techniques like Peft and QLora, have democratized AI, making it more accessible and reducing reliance on third-party APIs.

Meta’s Llama and Llama2 have proven to be additional game-changers, paving the way for a new era in open-source AI development. These have not only bolstered the open-source community but also ignited a competitive spirit across the industry. This competition, far from being a zero-sum game, is propelling the industry forward with each advancement benefiting the collective ecosystem.

Looking ahead to 2024, the challenges of ensuring data privacy, maintaining ethical AI practices, and navigating the complex landscape of global regulations are substantial. Yet, they present unparalleled opportunities for innovation, collaboration, and growth. The joint work of AI developers, businesses, policymakers, and researchers, embracing these challenges, will make the ecosystem harness the full potential of AI technologies, ensuring they serve society’s needs without compromising on privacy and security.

The AI landscape is not static, but dynamically evolving. As the performance gap between closed-source and open-source LLMs continues to be closed by the relentless innovation of the community, the future of AI looks promising. It’s a future where technology serves business needs while also upholding the highest standards of ethical responsibility and societal benefit.

--

--