Breaking through AI data bottlenecks
One of the most significant bottlenecks in training specialized AI models is
the scarcity of high-quality, domain-specific data. Building enterprise-grade
AI requires increasing amounts of diverse, highly contextualized data, of
which there are limited supplies. This scarcity, sometimes known as the “cold
start” problem, is only growing as companies license their data and further
segment the internet. For startups and leading AI teams building
state-of-the-art generative AI products for specialized use cases, public data
sets also offer capped value, due to their lack of specificity and timeliness.
... Synthesizing data not only increases the volume of training data but also
enhances its diversity and relevance to specific problems. For instance,
financial services companies are already using synthetic data to rapidly
augment and diversify real-world training sets for more robust fraud detection
— an effort that is supported by financial regulators like the UK’s Financial
Conduct Authority. By using synthetic data, these companies can generate
simulations of never-before-seen scenarios and gain safe access to proprietary
data via digital sandboxes.
Five Common Misconceptions About Event-Driven Architecture
Event sourcing is an approach to persisting data within a service. Instead of
writing the current state to the database, and updating that stored data when
the state changes, you store an event for every state change. The state can
then be restored by replaying the events. Event-driven architecture is about
communication between services. A service publishes any changes in its
subdomain it deems potentially interesting for others, and other services
subscribe to these updates. These events are carriers of state and triggers of
actions on the subscriber side. While these two patterns complement each other
well, you can have either without the other. ... Just as you can use Kafka
without being event-driven, you can build an event-driven architecture without
Kafka. And I’m not only talking about “Kafka replacements”, i.e. other
log-based message brokers. I don’t know why you’d want to, but you could use a
store-and-forward message queue (like ActiveMQ or RabbitMQ) for your eventing.
You could even do it without any messaging infrastructure at all, e.g. by
implementing HTTP feeds. Just because you could, doesn’t mean you should! A
log-based message broker is most likely the best approach for you, too, if you
want an event-driven architecture.
Mostly AI’s synthetic text tool can unlock enterprise emails and conversations for AI training
Mostly AI provides enterprises with a platform to train their own AI
generators that can produce synthetic data on the fly. The company started off
by enabling the generation of structured tabular datasets, capturing nuances
of transaction records, patient journeys and customer relationship management
(CRM) databases. Now, as the next step, it is expanding to text data. While
proprietary text datasets – like emails, chatbot conversations and support
transcriptions – are collected on a large scale, they are difficult to use
because of the inclusion of PII (like customer information), diversity gaps
and structured data to some level. With the new synthetic text functionality
on the Mostly AI platform, users can train an AI generator using any
proprietary text they have and then deploy it to produce a cleansed synthetic
version of the original data, free from PII or diversity gaps. ... The new
feature, and its ability to unlock value from proprietary text without privacy
concerns, makes it a lucrative offering for enterprises looking to strengthen
their AI training efforts. The company claims training a text classifier on
its platform’s synthetic text resulted in 35% performance enhancement as
compared to data generated by prompting GPT-4o-mini.
Not Maintaining Data Quality Today Would Mean Garbage In, Disasters Out
Enterprises are increasingly data-driven and rely heavily on the collected data
to make decisions, says Choudhary. Also, a decade ago, a single application
stored all its data in a relational database for weekly reporting. Today, data
is scattered across various sources including relational databases, third-party
data stores, cloud environments, on-premise systems, and hybrid models, says
Choudhary. This shift has made data management much more complex, as all of
these sources need to be harmonized in one place. However, in the world of AI,
both structured and unstructured data need to be of high quality. Choudhary
states that not maintaining data quality in the AI age would lead to garbage in,
disasters out. Highlighting the relationship between AI and data observability
in enterprise settings, he says that given the role of both structured and
unstructured data in enterprises, data observability will become more critical.
... However, AI also requires the unstructured business context, such as
documents from wikis, emails, design documents, and business requirement
documents (BRDs). He stresses that this unstructured data adds context to the
factual information on which business models are built.
Three Evolving Cybersecurity Attack Strategies in OT Environments
Attackers are increasingly targeting supply chains, capitalizing on the trust
between vendors and users to breach OT systems. This method offers a high
return on investment, as compromising a single supplier can result in
widespread breaches. The Dragonfly attacks, where attackers penetrated
hundreds of OT systems by replacing legitimate software with Trojanized
versions, exemplify this threat. ... Attack strategies are shifting from
immediate exploitation to establishing persistent footholds within OT
environments. Attackers now prefer to lie dormant, waiting for an opportune
moment to strike, such as during economic instability or geopolitical events.
This approach allows them to exploit unknown or unpatched vulnerabilities, as
demonstrated by the Log4j and Pipedream attacks. ... Attackers are
increasingly focused on collecting and storing encrypted data from OT
environments for future exploitation, particularly with the impending advent
of post-quantum computing. This poses a significant risk to current encryption
methods, potentially allowing attackers to decrypt previously secure data.
Manufacturers must implement additional protective layers and consider
future-proofing their encryption strategies to safeguard data against these
emerging threats.
Mitigating Cybersecurity Risk in Open-Source Software
Unsurprisingly, open-source software's lineage is complex. Whereas commercial
software is typically designed, built and supported by one corporate entity,
open-source code could be written by a developer, a well-resourced
open-sourced community or a teenage whiz kid. Libraries containing all of this
open-source code, procedures and scripts are extensive. They can contain
libraries within libraries, each with its own family tree. A single
open-source project may have thousands of lines of code from hundreds of
authors which can make line-by-line code analysis impractical and may result
in vulnerabilities slipping through the cracks. These challenges are further
exacerbated by the fact that many libraries are stored on public repositories
such as GitHub, which may be compromised by bad actors injecting malicious
code into a component. Vulnerabilities can also be accidentally introduced by
developers. Synopsys' OSSRA report found that 74% of the audited code bases
had high-risk vulnerabilities. And don't forget patching, updates and security
notifications that are standard practices from commercial suppliers but likely
lacking (or far slower) in the world of open-source software.
Will AI Middle Managers Be the Next Big Disruption?
Trust remains a critical barrier, with many companies double-checking AI
outputs, especially in sensitive areas such as compliance. But as the use of
explainable AI grows, offering transparent decision-making, companies may
begin to relax their guard and fully integrate AI as a trusted part of the
workforce. But despite its vast potential and transformative abilities,
autonomous AI is unlikely to work without human supervision. AI lacks the
emotional intelligence needed to navigate complex human relationships, and
companies are often skeptical of assigning decision-making to AI tools. ...
"One thing that won't change is that work is still centered around humans, so
that people can bring their creativity, which is such an important human
trait," said Fiona Cicconi, chief people officer, Google. Accenture's report
highlights just that. Technology alone will not drive AI-driven growth. ...
Having said that, managers will have to roll up their sleeves, upskill and
adapt to AI and emerging technologies that benefit their teams and align with
organizational objectives. To fully realize the potential of AI, businesses
will need to prioritize human-AI collaboration.
Managing Risk: Is Your Data Center Insurance up to the Test?
E&O policies generally protect against liability to third parties for
losses arising from the insured’s errors and omissions in performing
“professional services.” ... Cyber coverage typically protects against a broad
range of first-party losses and liability claims arising from various causes,
including data breaches and other disclosures of non-public information. A
data center that processes data owned by third parties plainly has liability
exposure to such parties if their non-public information is disclosed as a
result of the data center’s operations. But even if a data center is
processing only its own company’s data, it still has liability exposure,
including for disclosure of non-public information belonging to its customers
and employees. Given the often-substantial costs of defending data breach
claims, data center operators would be well-advised to (1) review their cyber
policies carefully for exclusions or limitations that potentially could apply
to their liability coverage under circumstances particular to their operations
and (2) purchase cyber liability limits commensurate with the amount and
sensitivity of non-public data in their possession.
Attribution as the foundation of developer trust
With the need for more trust in AI-generated content, it is critical to credit
the author/subject matter expert and the larger community who created and
curated the content shared by an LLM. This also ensures LLMs use the most
relevant and up-to-date information and content, ultimately presenting the
Rosetta Stone needed by a model to build trust in sources and resulting
decisions. All of our OverflowAPI partners have enabled attribution through
retrieval augmented generation (RAG). For those who may not be familiar with
it, retrieval augmented generation is an AI framework that combines generative
large language models (LLMs) with traditional information retrieval systems to
update answers with the latest knowledge in real time (without requiring
re-training models). This is because generative AI technologies are powerful
but limited by what they “know” or “the data they have been trained on.” RAG
helps solve this by pairing information retrieval with carefully designed
system prompts that enable LLMs to provide relevant, contextual, and
up-to-date information from an external source. In instances involving
domain-specific knowledge, RAG can drastically improve the accuracy of an
LLM's responses.
Measurement Challenges in AI Catastrophic Risk Governance and Safety Frameworks
The current definition of catastrophic events, focusing on "large-scale
devastation... directly caused by an AI model," overlooks a critical aspect:
indirect causation and salient contributing causes. Indirect causation refers
to cases where AI plays a pivotal but not immediately apparent role. For
instance, the development and deployment of advanced AI models could trigger
an international AI arms race, becoming a salient contributor to increased
geopolitical instability or conflict. A concrete example might be AI-enhanced
cyber warfare capabilities leading to critical infrastructure failures across
multiple countries. AI systems might also amplify existing systemic risks or
introduce new vulnerabilities that become salient contributing causes to a
catastrophic event. The current narrow scope of AI catastrophic events may
lead to underestimating the full range of potential catastrophic outcomes
associated with advanced AI models, particularly those arising from complex
interactions between AI and other sociotechnical systems. This could include
scenarios where AI exacerbates climate change through increased energy
consumption or where AI-powered misinformation campaigns gradually lead to the
breakdown of trust in democratic institutions and social order.
Quote for the day:
"Facing difficult circumstances does
not determine who you are. They simply bring to light who you already were."
-- Chris Rollins
No comments:
Post a Comment