Workshop Event Summary: Embrace the Future with AI

7 min readJan 8, 2024

I organized a technology workshop event last Saturday(2024.1.6) in Pudong Shanghai. The topic is: “Embrace the Future with AI: Discover How Large Language Models Can Empower Your Personal and Professional Life! 🌟”

This technology workshop covered the fundamental concepts of machine learning and deep learning, the summary of large language models (LLMs) in 2023, as well as the current technological limitations and strategies to address them. Starting from the perspective of AI as the engine of modern society, we explored how it learns and improves from experience, including its applications in supervised learning, unsupervised learning, reinforcement learning, and deep learning.

In terms of the LLM pipeline, we delved into the full process from raw data collection to data pre-processing, constructing training data, training embeddings or representations, the pre-training phase, fine-tuning for specific uses, and finally, deployment and continuous learning. Each stage is crucial to the performance of the model, ensuring it meets the specific needs and business applications.

The workshop also discussed the limitations of LLM technology, such as performance accuracy, domain knowledge, costs, hardware, storage, and inference speed, and proposed solutions like prompting, fine-tuning, and Retrieval-Augmented Generation (RAG) to tackle these challenges.

Lastly, we examined the ethical challenges facing AI today, highlighting the importance of issues such as privacy, bias, automation, accountability, transparency, and potential misuse.

The following introduces the main content of this event.

Machine Learning: The Engine of AI

Machine Learning is a subset of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.

Types of Machine Learning:

Supervised Learning: The model is trained on labeled data. It learns to map inputs to outputs based on input-output pairs.
Unsupervised Learning: The model works with unlabeled data and learns to find patterns and relationships within the data.

Reinforcement Learning: The model learns to make decisions by performing actions in an environment to achieve a goal.

Key Concepts:

Feature: An individual measurable property or characteristic of a phenomenon being observed.
Algorithm: A process or set of rules to be followed in calculations or other problem-solving operations.
Model: A representation (a set of patterns) learned from data.
Training: The process of teaching a model to make predictions or decisions, typically by learning patterns from data.
Overfitting and Underfitting: Overfitting is when a model learns the detail and noise in the training data to the extent that it negatively impacts the performance of the model on new data. Underfitting is when a model cannot capture the underlying trend of the data.

Deep Learning: The Brain of AI

Deep Learning is a subset of machine learning where neural networks — algorithms inspired by the human brain — learn from large amounts of data.

Neural Networks:

Neurons: Basic units of neural networks, similar to nerve cells in the human brain.
Layers: Neural networks consist of layers of neurons. These include input layers, hidden layers, and an output layer.
Deep Neural Networks: Neural networks with multiple hidden layers.

Key Concepts:

Activation Function: A function in a neural network that helps determine the output of a node (or neuron).
Backpropagation: A method used in training neural networks, where the error is propagated back through the network to adjust the weights.
Convolutional Neural Networks (CNNs): A type of deep neural network used mainly to process pixel data, popular in image recognition tasks.
Recurrent Neural Networks (RNNs): A type of neural network where connections between nodes form a directed graph along a temporal sequence, allowing it to exhibit temporal dynamic behavior.

Main LLMs 2023

Pipeline of Large Language Models Techniques

1. Raw Data Collection

Overview: This stage involves gathering a vast and diverse dataset, which is the foundation of any LLM. It includes text from books, articles, websites, and other digital media.
Challenges and Considerations: Ensuring data diversity to avoid biases, and adhering to privacy and ethical guidelines during collection.
Example: Assembling a dataset from diverse sources for a multilingual LLM. This includes extracting English scientific journals, Chinese news websites, Spanish literature, and Hindi social media posts. Special emphasis is placed on ensuring a wide range of topics and styles (academic, colloquial, formal) are represented.

2. Data Pre-processing into Structured Data

Process: Transforming raw data into a structured format suitable for machine learning models. This involves cleaning the data, removing irrelevant or sensitive information, and organizing it into a coherent structure.
Techniques Used: Tokenization (breaking text into words, phrases, symbols), normalization (standardizing text format), and part-of-speech tagging.
Example: Processing a dataset containing mixed-language news articles. The pre-processing includes removing non-textual elements like images and ads, normalizing different date formats to a standard form, and resolving inconsistencies in the use of punctuation and capitalization.

3. Constructing Training Data

Objective: Prepare the processed data for training by creating a dataset that the model can learn from. This includes labeling the data if needed and splitting it into training and validation sets.
Approach: Employing techniques like sequence labeling and text classification in the dataset preparation.
Example: For a sentiment analysis task, manually annotating a set of product reviews with labels such as “positive,” “negative,” or “neutral.” This involves reading each review and assigning an appropriate sentiment label, ensuring a balanced representation of various product types.

4. Training Embeddings or Representations

Methodology: Develop embeddings, which are vector representations of words or phrases. These embeddings capture the contextual meanings of words.
Key Concepts: Utilizing models like Word2Vec or GloVe for initial embeddings, and understanding the importance of context in these representations.
Example: Utilizing a large corpus of legal documents to train embeddings specific to legal terminology using the GloVe model. These embeddings capture the nuanced meanings of legal terms based on their usage in legal contexts.

5. Pre-training Stage

Process: The LLM is trained on a large, generalized dataset. This stage involves learning the language’s structure, grammar, and common usage patterns.
Techniques: Utilizing unsupervised learning methods like masked language modeling (MLM) and next sentence prediction(NSP).
Example: Training a model like RoBERTa on a dataset comprising millions of English sentences sourced from books and websites. The training involves randomly masking 15% of words in each sentence and training the model to predict these masked words, thereby learning contextual relationships.

6. Fine-tuning Stage for Specific Usage

Customization: Adapting the pre-trained model to specific tasks or domains. This involves further training the model on a smaller, task-specific dataset.
Applications: Tailoring the model for tasks like sentiment analysis, question-answering, text summarization, or specific industry applications.
Example: Adapting a pre-trained LLM for medical diagnosis assistance. The fine-tuning involves training the model on a dataset of patient interviews, medical histories, and diagnosis records, enabling the model to assist doctors by suggesting potential diagnoses based on patient information.

7. Deployment and Continuous Learning

Implementation: Integrating the fine-tuned model into applications or services.
Ongoing Improvement: Continuously collecting feedback and data to refine and update the model, ensuring its relevance and accuracy over time.
Example: Deploying a fine-tuned LLM in a virtual assistant app for personalized learning. The app uses the model to generate educational content and quizzes tailored to the user’s learning style and progress. The model continuously learns from the user’s interactions and feedback, adapting the content over time for optimal learning.

Limitations

The current limitations of Large Language Models (LLMs) are multifaceted and can impact their performance and applicability in various domains. These limitations include:

Performance Accuracy: Ensuring that LLMs understand and generate accurate and relevant content remains a challenge, particularly in complex or nuanced situations.
Domain Knowledge: LLMs often struggle with specialized knowledge areas, requiring additional fine-tuning and domain-specific data to perform well.
Cost: Developing and maintaining LLMs can be costly, involving significant investment in computational resources and data management.
Hardware: The hardware required to train and run LLMs can be expensive and energy-intensive, posing scalability issues.
Storage: LLMs require substantial storage for the vast amounts of training data and model parameters, which can be a constraint for many organizations.
Inference Speed: The time it takes for LLMs to process information and provide outputs can be slow, affecting their use in real-time applications.

Potential Solutions

Prompt Engineering: Crafting inputs to the model (prompts) in a way that guides the LLM to generate more accurate and relevant responses. This can help in improving performance accuracy without the need for extensive retraining.
Fine-tuning: Adjusting a pre-trained model on a specific dataset, usually smaller and domain-specific, to enhance its understanding and performance in that particular area. This addresses issues with domain knowledge and can also improve inference speed for specialized tasks.
Retrieval-Augmented Generation (RAG): Combining the retrieval of information from databases or documents with the generative capabilities of LLMs. This approach can generate more accurate and context-relevant responses, especially when dealing with questions that require expert knowledge.
Cost Management: Implementing more cost-effective training procedures, such as using model distillation to create smaller, more efficient models that retain the performance characteristics of larger ones, can reduce financial barriers.

Ethical Challenges

AI technology faces several ethical challenges that require careful consideration and management. include:

Privacy: Safeguarding personal information used in training and operationalizing AI models, ensuring that privacy is not compromised.
Bias: AI systems can inherit or even amplify biases present in their training data, leading to unfair or discriminatory outcomes.
Accountability: Determining who is responsible for the decisions made by AI systems, especially when they lead to negative outcomes.
Transparency: Ensuring that AI algorithms and their decision-making processes are transparent and understandable to users and stakeholders.
Potential Misuse: There is a risk that AI technology could be used for harmful purposes, necessitating frameworks to prevent misuse and promote responsible usage.

In summary, the technology workshop not only provided a comprehensive understanding of AI and LLM technologies but also emphasized the importance of continuing to study the technology’s development and its societal impacts in depth.

To encourage more creative thinking, we cordially invite more individuals to join this discussion, collectively advancing technological innovation and ethical progress.