FDA Guidance AI CHATBOT

As a current UX designer who works on designing medical device instructions, I frequently need to reference FDA standard and guidance documents to find very specific information, such as what font size the instructions should be written in. However, the FDA has many publicly-available guidance documents that can be over 50 pages long, making this information much more difficult to find or return to. This process of searching can often be frustrating and takes up valuable time. I have also found that using LLMs like ChatGPT to ask FDA guidance-related questions can sometimes lead to incorrect outputs. I wanted a way to only search specific documents that I could control. As a result, the goal of this project was to build a FDA guidance chatbot that quickly searches through a collection of FDA standard documents to provide answers to specific questions.

Methods and Tools

Python: Programming language; used to write code to run the chatbot and load in PDF documents
TextEdit: Native text program on Macbook; used to hold the Python code scripts
Terminal (macOS): Native command program on Macbook; used to run the Python scripts
Sentence Transformers Python library: Module to transform chunks of text into vectors so that they can be searched semantically
Pickle Python module: Module used to hold my processed files and chunks of text from the FDA PDFs; lighter and faster than JSON files
NumPy Python library: Module used to calculate the confidence for each response output
GPT4All: Local LLM that can be run privately on my computer
- Run with the Llama 3.2 3B Instruct Q4 Model: Lightweight chat model with 3 billion parameters
Streamlit: Transforms my chatbot into a website link
ngrok: Creates a public link for my chatbot that can be shared

Process

First, I installed GPT4All since I planned to run the chatbot locally on my own computer. This way, I could control what documents my LLM model would be searching though. Next, I made a project folder and uploaded four FDA guidance documents as a sample set to test out. These documents were around 20-50 pages long each. I used TextEdit to write my Python scripts and used Terminal for running any commands or files that I had saved to my local project folder. During this process, I learned about how to correctly format TextEdit files with .py extensions so that my Terminal commands know how to read the files.

Next, I researched the best way for LLMs to perform searches on documents. I also relied on ChatGPT for some clarification and advice on what methods made sense for my project. I started by installing the Sentence Transformers module which chunked my PDF text into vector embeddings that would be easier for my chatbot to search. This prepped my data to be usable by my Python code tools and my LLM. I also learned that this method would result in my LLM semantically interpreting the question context and meaning rather than simply using keywords to find exact matches.

During this process, I also learned about the pickle module files, which let me save my text chunk embeddings to a file and load it later. These files are also lightweight and can be run quickly which will be helpful when I add more PDFs to my project folder.

For my LLM, I downloaded the Llama 3.2 3B Instruct model in GPT4All. I installed streamlit so that I could access my chatbot as a local web link. I also tried downloading ngrok so that I could create a public-facing web link. This process involved creating an ngrok account for a personal token, and adding this token to my command.

I wrote my first Python code script to extract my PDFs from my project folder (file: extract_pdfs.py). It uses Sentence Transformers to chunk the PDF text and save it to a pickle file (file: embeddings.pkl). Next, my final Python script brings everything together to run the chatbot (file: chatbot_app.py). For example, it calls the pickle file for the text embeddings, imports and runs GPT4All and my LLM model (Llama 3.2 3B Instruct), sets up the chatbot input/output interaction, and imports streamlit.

Finally, I made a few updates based on my peer review feedback. I adjusted the format of the output so that the chatbot response is shown above the text chunk response. I felt that this was a great suggestion to make the output information more digestible for the user. My peer reviewer also suggested that I add hyperlinks to the text chunks output, which I think is a great idea to incorporate. While I was unable to complete this within the scope of this project, I plan to add it to my next steps.

STEP SUMMARY

● Open Terminal window

● Run cd ~/Desktop/FDA\ Chatbot

● Run python3 -m pip install streamlit

● Run python3 -m pip install sentence-transformers

● Run source project/bin/activate

● Run streamlit run chatbot_app.py

● Run ./ngrok http 8501

Files and Links

The Python code scripts referenced above can be seen in the Kaggle notebook here: Kaggle notebook.
- Contains: extract_pdfs.py, embeddings.pkl, chatbot_app.py, chatbot_terminal.py
A Zip file containing the full project and associated files can be accessed here: FDA Chatbot.zip
YouTube link to lightning talk presentation and demonstration: YouTube link

FINDINGS

After successfully running my chatbot with Streamlit, I wanted to ask a series of questions to understand how effective the outputs would be. The “effectiveness” of the output was based on how valuable I interpreted the response to be and/or how useful I would find it in a work context. While this measure is subjective, I feel that it is appropriate for the scope of this project given my familiarity with the FDA guidance documents I used. I investigated different types of prompt questions, such as slight wording differences, context-heavy questions, single-word questions, and completely irrelevant questions.

My overall findings from this exercise were mixed. While I found the chatbot to be effective in some cases, there were other instances in which I felt that the output was very underwhelming. I noted that changing a single word within my question input could have a major impact on the response. Additionally, I was surprised to see that adding more context to my prompt question didn’t seem to make a significant difference in the quality of the output. In fact, it seemed that in some cases adding too much information resulted in a more sporadic or generalized response from the chatbot, since it was trying to attend to each part of my input question. Overall, I think the chatbot that I built is promising and has a lot of potential for future applications. I plan to continue refining my code and exploring different ML models to improve the quality of the responses. Below, I’ve included an image of the chatbot UI itself along with a few examples of findings when asking my chatbot questions.

View of Chatbot UI

Comparison of a longer question with more context vs. a few words

I wanted to test what the output would be when asking a longer question (“What section titles or headings should I include in my Instructions for Use?”) vs. only inputting a few similar keywords (“Section titles”). It also appears as though the chatbot answer for my second prompt (“Section titles”) contained not only more content, but also more accurate content as a response. I was again surprised to see such a significant difference between the two responses. The first two matching document chunks were not the same, and only the third chunk aligned between the two.

Example of effective response output

I found this question (“When do I need an Instructions for Use”) to lead to one of the most effective responses from my chatbot. I assume that this was due to many of the documents having similar sections on this topic with overlapping information.

Example with a completely irrelevant question

Finally, I wanted to see what my chatbot would respond with when I asked it a question that I was sure was completely irrelevant to any of the FDA guidance documents. I asked “What is the best cat food for my cat?” The chatbot response was very strange. It began by saying that I should consult a veterinarian, followed by a question about the most efficient electric cars, and ended by analyzing its own response. While this was a very amusing output, it indicates to me that I should explore adding instructions into my code about what to do if there are no good matches across all of the guidance documents. Additionally, the top matching text chunks were also very weak, with irrelevant matches and low confidence scores.

Discussion

Overall, I feel that I was very effective in meeting my goal of building a chatbot that searches through FDA documents and responds to user questions. This project taught me so much about machine learning, and how chatbots are built from the ground up, even if on a very small scale. I learned about how to integrate Sentence Transformers into my Python code, and how this ML tool creates text embeddings from the PDFs I provided. This method made it so that my chatbot would search through the text semantically, and would not just match keywords. As noted in my findings, after asking my chatbot several questions with varying levels of complexity, I found mixed results about the outputs. While some responses were very accurate, others appeared to be very obviously incorrect. I think that a potential next step for this chatbot would be to test out different Sentence Transformer models to see if they could improve the accuracy of the outputs.

Furthermore, this project would make an excellent use case for someone interested in the applications of ML to UX. It allowed me to think through the user interaction and try to understand what information output would be most valuable to the user. In this case, I considered myself to be the end user of the chatbot. Therefore, I was able to determine what information I deemed to be most valuable, and measured the effectiveness of my chatbot based on this. Additionally, I think that many users of LLMs have at some point wondered about the accuracy of the chatbot’s output. I have personally experienced receiving incorrect answers from an LLM which is the reason behind why I wanted to build the document name and page number into my chatbot’s output. I feel that this increases my confidence and trust in the information it is providing. I think that this example can apply to wider ML applications in UX. By understanding what the user is thinking about interactions with ML, specifically LLMs, we can adapt them to cater more to the user, whether that is ensuring higher levels of confidence, building greater trust, and more.

Appendix

Project folder and files

Python scripts screenshots

File “extract_pdfs.py”

File “chatbot_app.py”

File “chatbot_app.py”