Tech Bytes - Daily Digest: Daily Tech Digest

Intel investigating breach after 20GB of internal documents leak online

US chipmaker Intel is investigating a security breach after earlier today 20 GB of internal documents, with some marked "confidential" or "restricted secret," were uploaded online on file-sharing site MEGA. The data was published by Till Kottmann, a Swiss software engineer, who said he received the files from an anonymous hacker who claimed to have breached Intel earlier this year. Kottmann received the Intel leaks because he manages a very popular Telegram channel where he regularly publishes data that accidentally leaked online from major tech companies through misconfigured Git repositories, cloud servers, and online web portals. The Swiss engineer said today's leak represents the first part of a multi-part series of Intel-related leaks. ZDNet reviewed the content of today's files with security researchers who have previously analyzed Intel CPUs in past work, who deemed the leak authentic but didn't want to be named in this article due to ethical concerns of reviewing confidential data, and because of their ongoing relations with Intel. Per our analysis, the leaked files contained Intel intellectual property respective to the internal design of various chipsets.

Data Prep for Machine Learning: Normalization

Preparing data for use in a machine learning (ML) system is time consuming, tedious, and error prone. A reasonable rule of thumb is that data preparation requires at least 80 percent of the total time needed to create an ML system. There are three main phases of data preparation: cleaning; normalizing and encoding; and splitting. Each of the three phases has several steps. A good way to understand data normalization and see where this article is headed is to take a look at the screenshot of a demo program. The demo uses a small text file named people_clean.txt where each line represents one person. There are five fields/columns: sex, age, region, income, and political leaning. The "clean" in the file name indicates that the data has been standardized by removing missing values, and editing bad data so that all lines have the same format, but numeric values have not yet been normalized. The ultimate goal of a hypothetical ML system is to use the demo data to create a neural network model that predicts political leaning from sex, age, region, and income. The demo analyzes the age and income predictor fields, then normalizes those two fields using a technique called min-max normalization. The results are saved as a new file named people_normalized.

Microsoft Teams Patch Bypass Allows RCE

While Microsoft tried to cut off this vector as a conduit for remote code execution by restricting the ability to update Teams via a URL, it was not a complete fix, the researcher explained. “The updater allows local connections via a share or local folder for product updates,” Jayapaul said. “Initially, when I observed this finding, I figured it could still be used as a technique for lateral movement, however, I found the limitations added could be easily bypassed by pointing to an…SMB share.” Server Message Block (SMB) protocol is a network file sharing protocol. To exploit this, an attacker would need to drop a malicious file into an open shared folder – something that typically involves already having network access. However, to reduce this gating factor, an attacker can create a remote rather than local share. “This would allow them to download the remote payload and execute rather than trying to get the payload to a local share as an intermediary step,” Jayapaul said. Trustwave has published a proof-of-concept attack that uses Microsoft Teams Updater to download a payload – using known, common software called Samba to carry out remote downloading.

Federated learning improves how AI data is managed, thwarts data leakage

Researchers believe a shift in the way data is managed could allow more information to reach learning algorithms outside of a single institution, which could benefit the entire system. Penn Medicine researchers propose using a technique called federated learning that would allow users to train an algorithm across multiple decentralized data sources without having to actually exchange the data sets. Federated learning works by training an algorithm across many decentralized edge devices, as opposed running an analysis on data uploaded to one server. "The more data the computational model sees, the better it learns the problem, and the better it can address the question that it was designed to answer," said Spyridon Bakas, an instructor in the Perelman School of Medicine at the University of Pennsylvania, in a press release. Bakas is lead author of a study on the use of federated learning in medicine that was published in the journal Scientific Reports. "Traditionally, machine learning has used data from a single institution, and then it became apparent that those models do not perform or generalize well on data from other institutions," Bakas said.

10 Tools You Should Know As A Cybersecurity Engineer

Wireshark is the world’s best network analyzer tool. It is an open-source software that enables you to inspect real-time data on a live network. Wireshark can dissect packets of data into frames and segments giving you detailed information about the bits and bytes in a packet. Wireshark supports all major network protocols and media types. Wireshark can also be used as a packet sniffing tool if you are in a public network. Wireshark will have access to the entire network connected to a router. ... Netcat is a simple but powerful tool that can view and record data on a TCP or UDP network connections. Netcat functions as a back-end listener that allows for port scanning and port listening. You can also transfer files through Netcat or use it as a backdoor to your victim machine. This makes is a popular post-exploitation tool to establish connections after successful attacks. Netcat is also extensible given its capability to add scripting for larger or redundant tasks. In spite of the popularity of Netcat, it was not maintained actively by its community. The Nmap team built an updated version of Netcat called Ncat with features including support for SSL, IPv6, SOCKS, and HTTP proxies.

Hey software developers, you’re approaching machine learning the wrong way

Unfortunately, lots of folks who set out to learn Machine Learning today have the same experience I had when I was first introduced to Java. They’re given all the low-level details up front — layer architecture, back-propagation, dropout, etc — and come to think ML is really complicated and that maybe they should take a linear algebra class first, and give up. That’s a shame, because in the very near future, most software developers effectively using Machine Learning aren’t going to have to think or know about any of that low-level stuff. Just as we (usually) don’t write assembly or implement our own TCP stacks or encryption libraries, we’ll come to use ML as a tool and leave the implementation details to a small set of experts. At that point — after Machine Learning is “democratized” — developers will need to understand not implementation details but instead best practices in deploying these smart algorithms in the world. ... What makes Machine Learning algorithms distinct from standard software is that they’re probabilistic. Even a highly accurate model will be wrong some of the time, which means it’s not the right solution for lots of problems, especially on its own. Take ML-powered speech-to-text algorithms: it might be okay if occasionally, when you ask Alexa to “Turn off the music,” she instead sets your alarm for 4 AM.

Garmin Reportedly Paid a Ransom

WastedLocker, a ransomware strain that reportedly shut down Garmin's operations for several days in July, is designed to avoid security tools within infected devices, according to a technical analysis from Sophos. In June and July, several research firms published reports on WastedLocker, noting that the ransomware appears connected to the Evil Corp cybercrime group, originally known for its use of the Dridex banking Trojan. "Because WastedLocker has no known security vulnerabilities in how it performs its encryption, it's unlikely that Garmin obtained a working decryption key that fast in any other way but by paying the ransom," Chris Clements, vice president of solutions architecture for Cerberus Sentinel, tells ISMG. Fausto Oliveira, principal security architect at the security firm Acceptto, adds: "What I believe happened is that Garmin was unable to recover their services in a timely manner. Four days of disruption is too long if they are using any reliable type of backup and restore mechanisms. That might have been because their disaster recovery backup strategy failed or the invasion was to the extent that backup sources were compromised as well."

Splicing a Pause Button into Cloud Machines

Splice Machine was born in the days of Hadoop, and uses some of the same underlying data processing engines that were distributed in that platform. But Splice Machine has surpassed the capabilities of that earlier platform by ensuring tight integration with those engines in support of its customers enterprise AI initiatives, not to mention elastic scaling via Kubernetes. The way that Splice Machine engineered HBase (for storage) and Spark (for analytics), and its enablement of ACID capabilities for SQL transactions, are core differentiating factors that weigh in Splice Machine’s favor for being a platform on which to build real-time AI applications, according to Zweben. “Doing table scans as the basis of an analytical workload is abysmally slow in HBase, and so, in Splice Machine, we engineered at a very low level the access to the HBase storage with a wrapper of transactionality around it, so you’re only seeing what’s been committed in the database based on ACID semantics,” Zweben explained. “That goes under the cover at a very well-engineered level, looking at the HBase storage and grabbing that into Spark dataframes,” he continued. “We’ve engineered tightly integrated connectivity for performance. ...”

How Synthetic Data Accelerates Coronavirus Research

To access data at the speed required while also respecting the privacy and governance needs of patient data, Washington University at St. Louis, Jefferson Health in Philadelphia, and other healthcare organizations have opted for an alternative, using something called synthetic data. Gartner defines synthetic data as data that is "generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real world." Here's how Payne describes it: "We can take a set of data from real world patients but then produce a synthetic derivative that statistically is identical to those patents' data. You can drill down to the individual role level and it will look like the data extracted from the EHR (electronic health record), but there's no mutual information that connects that data to the source data from which it is derived." Why is that so important? "From the legal and regulatory and technical standpoint, this is no longer potentially identifiable human subjects' data, so now our investigators can literally watch a training video and get access to the system," Payne said. "They can sign a data use agreement and immediately start iterating through their analysis."

Realtime APIs: Mike Amundsen on Designing for Speed and Observability

For systems to perform as required, data read and write patterns will frequently have to be reengineered. Amundsen suggested judicious use of caching results, which can remove the need to constantly query upstream services. Data may also need to be “staged” appropriately throughout the entire end-to-end request handling process. For example, caching results and data in localized points of presence (PoPs) via content delivery networks (CDNs), caching in an API gateway, and replication of data stores across availability zones (local data centers) and globally. For some high transaction throughput use cases, writes may have to be streamed to meet demand, for example, writing data locally or via a high throughput distributed logging system like Apache Kafka for writing to an external data store at a later point in time. Engineers may have to “rethink the network,” (respecting the eight fallacies of distributed computing), and design their cloud infrastructure to follow best practices relevant to their cloud vendor and application architecture. Decreasing request and response size may also be required to meet demands. This may be engineered in tandem with the ability to increase the message volume.

Quote for the day:

"The secret of leadership is simple: Do what you believe in. Paint a picture of the future. Go there. People will follow." -- Seth Godin

Tech Bytes - Daily Digest

Pages

Daily Tech Digest - August 07, 2020