Difference between revisions of "NLP"

From UFRC
Jump to navigation Jump to search
 
(34 intermediate revisions by 4 users not shown)
Line 1: Line 1:
[[Category:Software]][[Category:Machine Learning]]
+
[[Category:Help]]
{|<!--CONFIGURATION: REQUIRED-->
+
{|align=right
|{{#vardefine:app|nlp}}
+
  |__TOC__
|{{#vardefine:url|}}
+
  |}
<!--CONFIGURATION: OPTIONAL (|1}} means it's ON)-->
+
This page describes the collection of Natural language processing software on HiperGator. Natural Language Processing (NLP) is a part of artificial intelligence (AI) that helps computers understand and respond to human language. It's used in things like voice assistants, chatbots, and translation apps. NLP combines language rules with machine learning to help computers grasp not just words, but also the intent and feelings behind them. NLP is improving how AI works in different areas. For example, in healthcare, it helps analyze medical records to aid patient care. Research Computing can help with language modeling for knowledge exploration, measurement, classification, summarization, conversational AI, or other uses via [https://support.rc.ufl.edu/ support requests] or [https://www.rc.ufl.edu/get-started/purchase-allocation/training--consultation-rates/ consulting].
|{{#vardefine:conf|}}          <!--CONFIGURATION-->
 
|{{#vardefine:exe|}}            <!--ADDITIONAL INFO-->
 
|{{#vardefine:job|}}            <!--JOB SCRIPTS-->
 
|{{#vardefine:policy|}}        <!--POLICY-->
 
|{{#vardefine:testing|}}      <!--PROFILING-->
 
|{{#vardefine:faq|}}            <!--FAQ-->
 
|{{#vardefine:citation|}}      <!--CITATION-->
 
|{{#vardefine:installation|}} <!--INSTALLATION-->
 
|}
 
<!--BODY-->
 
==Description==
 
{{#if: {{#var: url}}|
 
{{App_Description|app={{#var:app}}|url={{#var:url}}|name={{#var:app}}}}|}}
 
  
This page describes the collection of natural language processing software on HiperGator. NLP is involved in many other fields of AI, such as image recognition. Research computing will help with language modeling for conversational AI, measurement, classification tasks, etc. via [https://www.rc.ufl.edu/help/support-requests/ support requests] or [https://www.rc.ufl.edu/consultation-purchase-request-consultation/ consulting]. NVIDIA [https://github.com/NVIDIA/Megatron-LM Megatron] and [https://github.com/NVIDIA/NeMo NeMo] are open-source software using transformer neural networks that can scale to multiple nodes of GPUs. See the directory /data/ai for more information.
+
==Environment Modules for NLP==
 +
 
 +
*'''[[Nemo]]:''' <code>module load nemo</code> will provide a singularity container environment with Python and Nvidia NeMo. NeMo has NLP task training, plus speech-to-text and text-to-speech models, and the option to apply your own pretrained Megatron language models.
 +
**Use the following command to list the available versions on HiPerGator-AI:
 +
**<pre>module spider nemo</pre>
 +
 
 +
 
 +
*'''[[Bionemo]]''': <code>module load bionemo</code> will launch a singularity container environment equipped with Python and Nvidia BioNeMo. BioNeMo specializes in biomedical NLP tasks, featuring advanced models for tasks like medical text analysis, drug interaction extraction, and patient information processing. It also allows for the integration of your own pretrained Megatron language models, enhancing its versatility in the biomedical field.
 +
**Use the command below to list the available versions on HiPerGator-AI.  
 +
**<pre>module spider bionemo</pre>
 +
 
 +
 
 +
*'''[[Llama]]''': <code>module load llama</code> provides a scalable library for fine-tuning Meta Llama models, enabling users to quickly get started with various use cases, including domain adaptation and building LLM-based applications. It also showcases how to run Meta Llama locally, in the cloud, and on-premises.  
 +
**Use the command below to list the available versions on HiPerGator-AI.  
 +
**<pre>module spider llama</pre>
 +
 
 +
 
 +
*'''[[Mistral AI]]''': <code>module load mistral</code> is a set of tools to help you work with Mistral models. The first release contains tokenization. The tokenizers go beyond the usual text <-> tokens, adding parsing of tools and structured conversation. The validation and normalization code that is used in the API is also released.
 +
**Use the command below to list the available versions on HiPerGator-AI.
 +
**<pre>module spider mistral</pre>
 +
 
 +
 
 +
*'''[[Gemma LLMs]]''': <code>module load gemma_llm</code> is a set of tools to help you work with Google Gemma models. Gemma models are compatible with frameworks such as PyTorch, Keras-NLP, NVIDIA NeMo, and Hugging Face Transformers, streamlining model lifecycle management, serialization, and performance optimization.
 +
**Use the command below to list the available versions on HiPerGator-AI.  
 +
**<pre>module spider gemma_llm</pre>
  
<!--Modules-->
+
 
==Environment Modules for NLP==
+
*'''[[Pytorch]] or [[TensorFlow]]:''' Note, use <code>module load pytorch</code> or <code>tensorflow</code> to list the version we have available. If the nlp environments or these environments do not have the libraries you require, you may need to create a Conda environment. See [[Conda]] and [[Managing_Python_environments_and_Jupyter_kernels]] for more details.
*'''nlp:''' module load nlp will provide a Python environment with pytorch, torchtext, nltk, spacy, transformers, sentence-transformers, Flair, BERTopic for topic modeling, sentencepiece, RAPIDSai for data processing and machine learning algorithms, gensim, scikit-learn, and more.
+
**Use the following command to list the available versions on HiPerGator-AI:
 +
**<pre>module spider pytorch</pre> or <pre>module spider tensorflow</pre>
 +
 
 +
 
 +
*'''Transformers:''' Transformer packages are potent natural language processing tools that leverage transformer architecture, enabling models such as BERT and GPT to accurately process and generate text with a deep understanding of context. These packages offer thousands of pretrained models capable of handling various tasks across different modalities, including text, vision, and audio. The Transformer Python module is available in <code>nlp/1.3</code> and <code>llama/3</code>.
 +
 
 +
 
 +
*'''LangChain:''' LangChain is a framework designed to simplify the creation of applications using large language models. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis. The LangChain Python module is available in <code>nlp/1.3</code> and <code>llama/3</code>.
  
  
*'''nemo:''' module load nemo will provide a singularity container environment with Python and Nvidia NeMo. NeMo has NLP task training, plus speech-to-text and text-to-speech models.  
+
*'''LlamaIndex:''' LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs). The LlamaIndex Python module is available in <code>nlp/1.3</code> and <code>llama/3</code>.
  
  
*'''pytorch:''' Note, use module spider pytorch to list the version we have available. Beyond stock pytorch versions, we have a Nvidia pytorch singularity container with the requirements for Megatron-LM. Use module load ngc-pytorch to access this container, and you can run Megatron from source code.  
+
*'''TensorRT-LLM:''' NVIDIA TensorRT, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of the latest large language models (LLMs) on the NVIDIA AI platform. The TensorRT-LLM Python module is available in <code>llama/3</code>.
  
  
*'''spark-nlp:''' See our [[Spark]] help doc to start a Spark cluster. Spark-nlp Python module is available in tensorflow/2.4.1.
+
*'''[[nlp]]:''' <code>module load nlp</code> provides a Python environment with pytorch, torchtext, nltk, Spacy, transformers, sentence-transformers, Flair, BERTopic for topic modeling, sentencepiece, RAPIDSai for data processing and machine learning algorithms, gensim, scikit-learn, and more.
 +
**Use the following command to list the available versions on HiPerGator-AI:
 +
**<pre>module spider nlp</pre>
  
  
*'''parlai:''' Conversational AI framework by Facebook, includes a wide variety of models from 110M to 9B parameters.
+
*'''ngc-pytorch:''' <code>module load ngc-pytorch</code> will provide a singularity container Python environment with pytorch including the Nvidia Apex optimizers required for [https://github.com/NVIDIA/Megatron-LM Megatron-LM]. Research computing has pretrained, large parameter Megatron language models available to HiperGator users. See /data/ai/examples/nlp or [[AI_Examples]] for more information.
 +
**Use the following command to list the available versions on HiPerGator-AI:
 +
**<pre>module spider ngc-pytorch</pre>
  
==Examples==
 
Please see our /data/ai folder for examples, reference data, and pretrained language models. Notebooks and batch scripts cover everything from pretraining and inferencing to summarization, information extraction, and topic modeling. Some of the reference data is listed on the reference data page on this help site.
 
  
<!--Configuration-->
+
*'''spark-nlp:''' See our [[Spark]] help doc to start a Spark cluster. Spark-nlp Python module is available in <code>tensorflow/2.4.1</code>.
{{#if: {{#var: conf}}|==Configuration==
 
See the [[{{PAGENAME}}_Configuration]] page for {{#var: app}} configuration details.
 
|}}
 
<!--Run-->
 
{{#if: {{#var: exe}}|==Additional Information==
 
  
WRITE_ADDITIONAL_INSTRUCTIONS_ON_RUNNING_THE_SOFTWARE_IF_NECESSARY
 
  
|}}
+
*'''parlai:''' Conversational AI framework by Facebook, includes a wide variety of models from 110M to 9B parameters.
<!--Job Scripts-->
 
{{#if: {{#var: job}}|==Job Script Examples==
 
See the [[{{PAGENAME}}_Job_Scripts]] page for {{#var: app}} Job script examples.
 
|}}
 
<!--Policy-->
 
{{#if: {{#var: policy}}|==Usage Policy==
 
  
WRITE USAGE POLICY HERE (Licensing, usage, access).
 
  
|}}
+
*'''[[FlairNLP]]:''' See [[FlairNLP]] for more information.
<!--Performance-->
 
{{#if: {{#var: testing}}|==Performance==
 
  
WRITE_PERFORMANCE_TESTING_RESULTS_HERE
+
==Large Language Models==
  
|}}
+
A variety of large language models are accessible for open-source download, though they might need specific software frameworks or adhere to particular end-user license agreements. Examples include starter LLMs trained using Megatron-LM, Llama2, and Llama3 which are located in the examples and reference data folder. These models, such as the 20B parameter GPT and the 9B parameter BERT, can be used as they are, further trained, or fine-tuned to meet specific needs. For the latest LLMs, such as LLaMA, GEMMA, and Mistral AI models, which provide advanced features and enhanced performance, please submit a [https://support.rc.ufl.edu/enter_bug.cgi help ticket] for further details and support. You can also find more information on our AI Models page.
<!--Faq-->
 
{{#if: {{#var: faq}}|==FAQ==
 
*'''Q:''' **'''A:'''|}}
 
<!--Citation-->
 
{{#if: {{#var: citation}}|==Citation==
 
If you publish research that uses {{#var:app}} you have to cite it as follows:
 
  
WRITE_CITATION_HERE
+
==Examples and Reference Data==
  
|}}
+
Please see <code>/data/ai/</code> folder, [[AI_Examples]], and [[AI_Reference_Datasets]] for helpful resources.  Notebooks and batch scripts cover everything from pretraining and inferencing to summarization, information extraction, and topic modeling. Addition reference data, including benchmarks such as the popular [https://super.gluebenchmark.com/tasks superglue], are already available in <code>/data/ai/benchmarks/nlp</code>.
<!--Installation-->
 
{{#if: {{#var: installation}}|==Installation==
 
See the [[{{PAGENAME}}_Install]] page for {{#var: app}} installation notes.|}}
 
<!--Turn the Table of Contents and Edit paragraph links ON/OFF-->
 
__NOTOC____NOEDITSECTION__
 

Latest revision as of 18:24, 4 September 2024

This page describes the collection of Natural language processing software on HiperGator. Natural Language Processing (NLP) is a part of artificial intelligence (AI) that helps computers understand and respond to human language. It's used in things like voice assistants, chatbots, and translation apps. NLP combines language rules with machine learning to help computers grasp not just words, but also the intent and feelings behind them. NLP is improving how AI works in different areas. For example, in healthcare, it helps analyze medical records to aid patient care. Research Computing can help with language modeling for knowledge exploration, measurement, classification, summarization, conversational AI, or other uses via support requests or consulting.

Environment Modules for NLP

  • Nemo: module load nemo will provide a singularity container environment with Python and Nvidia NeMo. NeMo has NLP task training, plus speech-to-text and text-to-speech models, and the option to apply your own pretrained Megatron language models.
    • Use the following command to list the available versions on HiPerGator-AI:
    • module spider nemo


  • Bionemo: module load bionemo will launch a singularity container environment equipped with Python and Nvidia BioNeMo. BioNeMo specializes in biomedical NLP tasks, featuring advanced models for tasks like medical text analysis, drug interaction extraction, and patient information processing. It also allows for the integration of your own pretrained Megatron language models, enhancing its versatility in the biomedical field.
    • Use the command below to list the available versions on HiPerGator-AI.
    • module spider bionemo


  • Llama: module load llama provides a scalable library for fine-tuning Meta Llama models, enabling users to quickly get started with various use cases, including domain adaptation and building LLM-based applications. It also showcases how to run Meta Llama locally, in the cloud, and on-premises.
    • Use the command below to list the available versions on HiPerGator-AI.
    • module spider llama


  • Mistral AI: module load mistral is a set of tools to help you work with Mistral models. The first release contains tokenization. The tokenizers go beyond the usual text <-> tokens, adding parsing of tools and structured conversation. The validation and normalization code that is used in the API is also released.
    • Use the command below to list the available versions on HiPerGator-AI.
    • module spider mistral


  • Gemma LLMs: module load gemma_llm is a set of tools to help you work with Google Gemma models. Gemma models are compatible with frameworks such as PyTorch, Keras-NLP, NVIDIA NeMo, and Hugging Face Transformers, streamlining model lifecycle management, serialization, and performance optimization.
    • Use the command below to list the available versions on HiPerGator-AI.
    • module spider gemma_llm


  • Pytorch or TensorFlow: Note, use module load pytorch or tensorflow to list the version we have available. If the nlp environments or these environments do not have the libraries you require, you may need to create a Conda environment. See Conda and Managing_Python_environments_and_Jupyter_kernels for more details.
    • Use the following command to list the available versions on HiPerGator-AI:
    • module spider pytorch
      or
      module spider tensorflow


  • Transformers: Transformer packages are potent natural language processing tools that leverage transformer architecture, enabling models such as BERT and GPT to accurately process and generate text with a deep understanding of context. These packages offer thousands of pretrained models capable of handling various tasks across different modalities, including text, vision, and audio. The Transformer Python module is available in nlp/1.3 and llama/3.


  • LangChain: LangChain is a framework designed to simplify the creation of applications using large language models. As a language model integration framework, LangChain's use-cases largely overlap with those of language models in general, including document analysis and summarization, chatbots, and code analysis. The LangChain Python module is available in nlp/1.3 and llama/3.


  • LlamaIndex: LlamaIndex is a simple, flexible data framework for connecting custom data sources to large language models (LLMs). The LlamaIndex Python module is available in nlp/1.3 and llama/3.


  • TensorRT-LLM: NVIDIA TensorRT, an SDK for high-performance deep learning inference, includes a deep learning inference optimizer and runtime that delivers low latency and high throughput for inference applications. TensorRT-LLM is an open-source library that accelerates and optimizes inference performance of the latest large language models (LLMs) on the NVIDIA AI platform. The TensorRT-LLM Python module is available in llama/3.


  • nlp: module load nlp provides a Python environment with pytorch, torchtext, nltk, Spacy, transformers, sentence-transformers, Flair, BERTopic for topic modeling, sentencepiece, RAPIDSai for data processing and machine learning algorithms, gensim, scikit-learn, and more.
    • Use the following command to list the available versions on HiPerGator-AI:
    • module spider nlp


  • ngc-pytorch: module load ngc-pytorch will provide a singularity container Python environment with pytorch including the Nvidia Apex optimizers required for Megatron-LM. Research computing has pretrained, large parameter Megatron language models available to HiperGator users. See /data/ai/examples/nlp or AI_Examples for more information.
    • Use the following command to list the available versions on HiPerGator-AI:
    • module spider ngc-pytorch


  • spark-nlp: See our Spark help doc to start a Spark cluster. Spark-nlp Python module is available in tensorflow/2.4.1.


  • parlai: Conversational AI framework by Facebook, includes a wide variety of models from 110M to 9B parameters.


Large Language Models

A variety of large language models are accessible for open-source download, though they might need specific software frameworks or adhere to particular end-user license agreements. Examples include starter LLMs trained using Megatron-LM, Llama2, and Llama3 which are located in the examples and reference data folder. These models, such as the 20B parameter GPT and the 9B parameter BERT, can be used as they are, further trained, or fine-tuned to meet specific needs. For the latest LLMs, such as LLaMA, GEMMA, and Mistral AI models, which provide advanced features and enhanced performance, please submit a help ticket for further details and support. You can also find more information on our AI Models page.

Examples and Reference Data

Please see /data/ai/ folder, AI_Examples, and AI_Reference_Datasets for helpful resources. Notebooks and batch scripts cover everything from pretraining and inferencing to summarization, information extraction, and topic modeling. Addition reference data, including benchmarks such as the popular superglue, are already available in /data/ai/benchmarks/nlp.