Happy Monday! I've been trying to gauge the community response about this "stance" I'm considering taking when it comes to recommending/purchasing AI tools that openly say that they are using your data and prompts (while de-identified) to train and improve their machine learning data models. The few that I've come across lately are the TurnItIn AI Detection tool and the KnowBe4 PhishER/PhishML tool. ChatGPT also says they train the models based on your inputs, but they give a very simple opt-out solution while the former two I mentioned do not. In contract, there are tools like Adobe CC Firefly that explicitly say they do not on your content.
So, here are the questions I have for the community:
Thanks for your thoughts!
We have a similar stance and lack clear guardrails for our community. As we are engaged, we have advised against using tools that train on our data. We are working towards an AI governance group comprised of various faculty and staff to help form guardrails going forward, but we see the same tidal wave coming, and keeping up with what each tool is doing may be a losing battle. I'd be interested in any ongoing conversations about this topic.
This is a great conversation and one that may eventually be decided by the FTC. Newly proposed changes to privacy laws would actually prohibit edtech software providers from using data collected for one purpose being used to train other, even similar, products. You can read more at FTC Proposes Strengthening Children's Privacy Rule to Further Limit Companies' Ability to Monetize Children's Data or comment on the proposed changes before March 11, 2024.
You are spot on and are being an excellent steward for your school.
As for TurnitIn's policy, this has always been their stated case, that they use the submissions to train their tools, both the integrity/plagarism detector and for their supposed AI detections tool.
Data governance is something we should all be concerned with with increasing requirements from many states and via the use of AI. It is better to get ahead of the curve now rather than playing catch up later.
Thank you @Brent Halsey and @Vinnie Vrotny! I'm so glad I'm not alone! I was getting worried! Do either of you have anything formal (or even casual) written up about this that you would be willing to share or collaborate on?
Hey @Nick Marchese, I agree with @Brent Halsey and @Vinnie Vrotny that data and AI governance are super important. There are still a lot of unintended and frankly unknown consequences to allowing large language models to be trained on sensitive or sensitive adjacent data.
We have a formal AI policy that applies to the whole school, which I'm attaching to this post. While I think it addresses what you're mentioning, Nick, I don't think it fully sets out all the guidelines we've internally set for ourselves when it comes to using AI for analyzing or manipulating data.
In our internal PD, what we've discussed are the following:
1) Any AI tools have to pass the same vetting we use for any of our tools. As a recent leak of user credentials via ChatGPT indicates, our AI tools need to meet the same basic cybersecurity guidelines as all of our other systems/platforms. This means ensuring things like SOC2, GDPR, CCPA, etc. compliance is met.
2) For now, our internal policy is to not use AI tools that train on our data or allow for things like human review/reinforcement when it comes to sensitive data. This includes de-identified/redacted data as there are examples of data being pieced back together by LLMs with enough context. We haven't gotten this far institutionally, but using synthetic data is a way to get around this issue and is perhaps something to look for in the way companies are training on data.
3) Again, we're not at the level of deciding to invest in enterprise features like OpenAIs Team or Enterprise plan (nor do we necessarily have the budget for it), but in my opinion that feels like a baseline before even considering entering sensitive data into an AI product. Either that or only using AI at the API level, which is not going to be accessible to most.
4) We still have ethical dilemmas around allowing models to train on student work, because of concerns that we might be giving away our students' voices, ideas, and creativity to an AI model. We also don't want to inadvertently reinforce biases present in these models.
5) There's been some discussion of open-source LLMs like LLama or Mistral, but there are questions about reliability, bias, deployment, logging, access control, etc.
We'll continue to iterate, and we're very much interested in what other schools are doing, so I'd love to collaborate with you/anyone else in the ATLIS community on this issue.
------------------------------Hudson HarperThe Downtown SchoolSeattle WA------------------------------
------------------------------Nick MarcheseEmma Willard SchoolTroy NYOriginal Message:Sent: 01-30-2024 09:49 AMFrom: Vinnie VrotnySubject: AI tools that use data prompts and user data to train models
------------------------------Vinnie VrotnyThe Kinkaid SchoolHouston TXvinnie.email@example.comOriginal Message:Sent: 01-29-2024 09:49 AMFrom: Nick MarcheseSubject: AI tools that use data prompts and user data to train models
#TeachingandLearning#CybersafetyandDataSecurity------------------------------Nick MarcheseEmma Willard SchoolTroy NY------------------------------
Code of Conduct
Learn more about ATLIS membership
© Association of Technology Leaders in Independent Schools4 Weems Lane #257Winchester, VA 22601