Daily Tech Digest - October 02, 2024

Breaking through AI data bottlenecks

One of the most significant bottlenecks in training specialized AI models is the scarcity of high-quality, domain-specific data. Building enterprise-grade AI requires increasing amounts of diverse, highly contextualized data, of which there are limited supplies. This scarcity, sometimes known as the “cold start” problem, is only growing as companies license their data and further segment the internet. For startups and leading AI teams building state-of-the-art generative AI products for specialized use cases, public data sets also offer capped value, due to their lack of specificity and timeliness. ... Synthesizing data not only increases the volume of training data but also enhances its diversity and relevance to specific problems. For instance, financial services companies are already using synthetic data to rapidly augment and diversify real-world training sets for more robust fraud detection — an effort that is supported by financial regulators like the UK’s Financial Conduct Authority. By using synthetic data, these companies can generate simulations of never-before-seen scenarios and gain safe access to proprietary data via digital sandboxes.


Five Common Misconceptions About Event-Driven Architecture

Event sourcing is an approach to persisting data within a service. Instead of writing the current state to the database, and updating that stored data when the state changes, you store an event for every state change. The state can then be restored by replaying the events. Event-driven architecture is about communication between services. A service publishes any changes in its subdomain it deems potentially interesting for others, and other services subscribe to these updates. These events are carriers of state and triggers of actions on the subscriber side. While these two patterns complement each other well, you can have either without the other. ... Just as you can use Kafka without being event-driven, you can build an event-driven architecture without Kafka. And I’m not only talking about “Kafka replacements”, i.e. other log-based message brokers. I don’t know why you’d want to, but you could use a store-and-forward message queue (like ActiveMQ or RabbitMQ) for your eventing. You could even do it without any messaging infrastructure at all, e.g. by implementing HTTP feeds. Just because you could, doesn’t mean you should! A log-based message broker is most likely the best approach for you, too, if you want an event-driven architecture.


Mostly AI’s synthetic text tool can unlock enterprise emails and conversations for AI training

Mostly AI provides enterprises with a platform to train their own AI generators that can produce synthetic data on the fly. The company started off by enabling the generation of structured tabular datasets, capturing nuances of transaction records, patient journeys and customer relationship management (CRM) databases. Now, as the next step, it is expanding to text data. While proprietary text datasets – like emails, chatbot conversations and support transcriptions – are collected on a large scale, they are difficult to use because of the inclusion of PII (like customer information), diversity gaps and structured data to some level. With the new synthetic text functionality on the Mostly AI platform, users can train an AI generator using any proprietary text they have and then deploy it to produce a cleansed synthetic version of the original data, free from PII or diversity gaps. ... The new feature, and its ability to unlock value from proprietary text without privacy concerns, makes it a lucrative offering for enterprises looking to strengthen their AI training efforts. The company claims training a text classifier on its platform’s synthetic text resulted in 35% performance enhancement as compared to data generated by prompting GPT-4o-mini.


Not Maintaining Data Quality Today Would Mean Garbage In, Disasters Out

Enterprises are increasingly data-driven and rely heavily on the collected data to make decisions, says Choudhary. Also, a decade ago, a single application stored all its data in a relational database for weekly reporting. Today, data is scattered across various sources including relational databases, third-party data stores, cloud environments, on-premise systems, and hybrid models, says Choudhary. This shift has made data management much more complex, as all of these sources need to be harmonized in one place. However, in the world of AI, both structured and unstructured data need to be of high quality. Choudhary states that not maintaining data quality in the AI age would lead to garbage in, disasters out. Highlighting the relationship between AI and data observability in enterprise settings, he says that given the role of both structured and unstructured data in enterprises, data observability will become more critical. ... However, AI also requires the unstructured business context, such as documents from wikis, emails, design documents, and business requirement documents (BRDs). He stresses that this unstructured data adds context to the factual information on which business models are built.


Three Evolving Cybersecurity Attack Strategies in OT Environments

Attackers are increasingly targeting supply chains, capitalizing on the trust between vendors and users to breach OT systems. This method offers a high return on investment, as compromising a single supplier can result in widespread breaches. The Dragonfly attacks, where attackers penetrated hundreds of OT systems by replacing legitimate software with Trojanized versions, exemplify this threat. ... Attack strategies are shifting from immediate exploitation to establishing persistent footholds within OT environments. Attackers now prefer to lie dormant, waiting for an opportune moment to strike, such as during economic instability or geopolitical events. This approach allows them to exploit unknown or unpatched vulnerabilities, as demonstrated by the Log4j and Pipedream attacks. ... Attackers are increasingly focused on collecting and storing encrypted data from OT environments for future exploitation, particularly with the impending advent of post-quantum computing. This poses a significant risk to current encryption methods, potentially allowing attackers to decrypt previously secure data. Manufacturers must implement additional protective layers and consider future-proofing their encryption strategies to safeguard data against these emerging threats.


Mitigating Cybersecurity Risk in Open-Source Software

Unsurprisingly, open-source software's lineage is complex. Whereas commercial software is typically designed, built and supported by one corporate entity, open-source code could be written by a developer, a well-resourced open-sourced community or a teenage whiz kid. Libraries containing all of this open-source code, procedures and scripts are extensive. They can contain libraries within libraries, each with its own family tree. A single open-source project may have thousands of lines of code from hundreds of authors which can make line-by-line code analysis impractical and may result in vulnerabilities slipping through the cracks. These challenges are further exacerbated by the fact that many libraries are stored on public repositories such as GitHub, which may be compromised by bad actors injecting malicious code into a component. Vulnerabilities can also be accidentally introduced by developers. Synopsys' OSSRA report found that 74% of the audited code bases had high-risk vulnerabilities. And don't forget patching, updates and security notifications that are standard practices from commercial suppliers but likely lacking (or far slower) in the world of open-source software. 


Will AI Middle Managers Be the Next Big Disruption?

Trust remains a critical barrier, with many companies double-checking AI outputs, especially in sensitive areas such as compliance. But as the use of explainable AI grows, offering transparent decision-making, companies may begin to relax their guard and fully integrate AI as a trusted part of the workforce. But despite its vast potential and transformative abilities, autonomous AI is unlikely to work without human supervision. AI lacks the emotional intelligence needed to navigate complex human relationships, and companies are often skeptical of assigning decision-making to AI tools. ... "One thing that won't change is that work is still centered around humans, so that people can bring their creativity, which is such an important human trait," said Fiona Cicconi, chief people officer, Google. Accenture's report highlights just that. Technology alone will not drive AI-driven growth. ... Having said that, managers will have to roll up their sleeves, upskill and adapt to AI and emerging technologies that benefit their teams and align with organizational objectives. To fully realize the potential of AI, businesses will need to prioritize human-AI collaboration.


Managing Risk: Is Your Data Center Insurance up to the Test?

E&O policies generally protect against liability to third parties for losses arising from the insured’s errors and omissions in performing “professional services.” ... Cyber coverage typically protects against a broad range of first-party losses and liability claims arising from various causes, including data breaches and other disclosures of non-public information. A data center that processes data owned by third parties plainly has liability exposure to such parties if their non-public information is disclosed as a result of the data center’s operations. But even if a data center is processing only its own company’s data, it still has liability exposure, including for disclosure of non-public information belonging to its customers and employees. Given the often-substantial costs of defending data breach claims, data center operators would be well-advised to (1) review their cyber policies carefully for exclusions or limitations that potentially could apply to their liability coverage under circumstances particular to their operations and (2) purchase cyber liability limits commensurate with the amount and sensitivity of non-public data in their possession.


Attribution as the foundation of developer trust

With the need for more trust in AI-generated content, it is critical to credit the author/subject matter expert and the larger community who created and curated the content shared by an LLM. This also ensures LLMs use the most relevant and up-to-date information and content, ultimately presenting the Rosetta Stone needed by a model to build trust in sources and resulting decisions. All of our OverflowAPI partners have enabled attribution through retrieval augmented generation (RAG). For those who may not be familiar with it, retrieval augmented generation is an AI framework that combines generative large language models (LLMs) with traditional information retrieval systems to update answers with the latest knowledge in real time (without requiring re-training models). This is because generative AI technologies are powerful but limited by what they “know” or “the data they have been trained on.” RAG helps solve this by pairing information retrieval with carefully designed system prompts that enable LLMs to provide relevant, contextual, and up-to-date information from an external source. In instances involving domain-specific knowledge, RAG can drastically improve the accuracy of an LLM's responses.


Measurement Challenges in AI Catastrophic Risk Governance and Safety Frameworks

The current definition of catastrophic events, focusing on "large-scale devastation... directly caused by an AI model," overlooks a critical aspect: indirect causation and salient contributing causes. Indirect causation refers to cases where AI plays a pivotal but not immediately apparent role. For instance, the development and deployment of advanced AI models could trigger an international AI arms race, becoming a salient contributor to increased geopolitical instability or conflict. A concrete example might be AI-enhanced cyber warfare capabilities leading to critical infrastructure failures across multiple countries. AI systems might also amplify existing systemic risks or introduce new vulnerabilities that become salient contributing causes to a catastrophic event. The current narrow scope of AI catastrophic events may lead to underestimating the full range of potential catastrophic outcomes associated with advanced AI models, particularly those arising from complex interactions between AI and other sociotechnical systems. This could include scenarios where AI exacerbates climate change through increased energy consumption or where AI-powered misinformation campaigns gradually lead to the breakdown of trust in democratic institutions and social order.



Quote for the day:

"Facing difficult circumstances does not determine who you are. They simply bring to light who you already were." -- Chris Rollins

No comments:

Post a Comment