Section 1: Summarizing 2023 Bad Bot Report by Imperva The Imperva Bad Bot Report offers an incisive analysis of daily automated attacks that evade traditional detection methods. Based on extensive data, including 6 trillion blocked bad bot requests across thousands of domains, the report is a testament to the evolving landscape of cybersecurity threats.
From the early days of the EarthLink Spammer discovered in 2000 [1], to today's multifarious bot threats, these automated programs have become tools of multinational criminal enterprises, threatening the security and functionality of diverse industries.
Bots, both benign and malicious, now form a significant portion of internet traffic. Good bots assist in indexing websites for search engines and monitoring website performance. In contrast, bad bots engage in harmful activities like data scraping, scalping, DDoS attacks, and fraud, often found on social media platforms emulating human behavior.
The influence of bots on internet traffic is staggering, with nearly half of it attributed to bots in 2022. The sophistication of these bots, especially in evasion techniques, underscores the need for advanced detection and mitigation strategies. Bot Influence Across Industries According to Imperva's report, of all internet traffic in 2022, 47.4% was bots, out of which 30.2% were bad bots, 17.3% were good bots and remaining humans. There has been a consistent year-over-year increase in bad bot traffic, accompanied by a notable decrease in human traffic compared to previous years. The pervasive influence of bots extends beyond mere numbers. These automated entities are reshaping the digital landscape. From swaying public opinion and altering perceived popularity to disseminating misinformation, their impact is widespread and multifaceted. Consider these real-world scenarios that highlight the tangible, often costly, consequences of bad bot activities: - Gaming: Bots disrupt online games by account takeovers and automated farming of virtual currency or experience points.
- Telecom and ISPs: Bots scrape sensitive data and overload networks, hampering company's infrastructures.
- Community and Society: Spam bots spread misinformation and malicious links.
- Computing and IT: Bots launch DDoS attacks.
- Travel: Bots are used for scraping and account takeovers in travel websites, and abuse the business logic of booking engines.
- Retail: Bots scrape pricing and product information, and perform gift card abuse and credit card fraud.
- Financial Services: Bots execute account takeovers, credit card fraud and arbitrage operations in crypto exchanges.
- Healthcare: Bots breach sensitive data, including medical records and health information. Bots can also launch DDoS attacks, disrupting services and communication. Additionally, they spread misinformation, leading to mistreatment.
Safeguarding Businesses from Bots and Online Fraud For initial steps in bot detection, solutions like Botometer and Bot Sentinel are notable. These tools rely on human-generated labels based on publicly available data, analyzing factors like account names, activity frequency, and social media interactions. However, this approach, as highlighted by Yoel Roth from Twitter [2], has limitations, including potential biases and a failure to capture the diverse ways humans interact with the internet. Combating advanced bots requires using a combination of machine learning, product and user interaction knowledge, and following best cyber security practices. Imperva lists the following best practices as a starting point of combating bots. - Risk Identification: Recognize potential bot hotspots. For instance, e-commerce sites launching limited-edition products or ticketing platforms with high-demand events can be prime targets. Key functionalities like login pages or checkout forms are susceptible to various bot attacks, including credential stuffing and card fraud.
- Vulnerability Reduction: Protect more than just your website. APIs and mobile apps are equally at risk. Sharing blocking information across systems is crucial for a robust defense.
- Threat Reduction through User-Agents: Many bots use outdated browser versions, unlike humans who regularly update theirs. Implementing blocks on browsers beyond their end-of-life can be an effective strategy.
- Combatting Proxy-Based Bots: Bots often use proxy services to mask their activities. Restricting access from known bulk IP providers can reduce bot traffic.
- Automation Tool Detection: Tools like Selenium and Web Driver are common in bot operations. Identifying and blocking these can curb bot activities.
- Traffic Evaluation: Analyze traffic patterns for irregularities, such as spikes in bounce rates, conversion rates, or specific URL requests, which can indicate bot activities.
- Monitoring Traffic: Establish baselines for login attempts and monitor deviations. For e-commerce, watch for unusual patterns in checkout and gift card validations, which might signal fraudulent activities.
- Stay Informed about Data Breaches: Awareness of global data breaches is vital. Compromised credentials are often exploited by bots in stuffing attacks and account takeovers.
- Evaluating and Implementing Bot Mitigation Solutions: Simple tweaks are no longer sufficient. The evolving sophistication of bots demands advanced, continuously updated defenses. A layered approach, including user profiling and fingerprinting, is essential to distinguish between beneficial and harmful bot activities.
Section 2: Spotlight Research Papers on Bot Detection In this section, we discuss some practical challenges and scientific advancements in bot detection. Here's a summary of two research papers that are picked for this edition: - TwiBot-22: Towards Graph-Based Twitter Bot Detection, (NeurIPS 2022) [4]: This paper proposes a comprehensive graph-based Twitter bot detection benchmark called TwiBot-22, which is the largest dataset to date and has considerably better annotation quality than existing datasets. The authors re-implemented 35 representative Twitter bot detection baselines and evaluated them on 9 datasets, including TwiBot-22, to promote a fair comparison of model performance and a holistic understanding of research progress. The proposed benchmark provides diversified entities and relations on the Twitter network and is expected to facilitate further research in this area.
The authors of the paper benchmarked feature-based, text-based and graph-based models to detect Twitter bots. Graph models where network and graph mining models are adopted to analyze the structure of Twitter to identify bots are the most advanced, achieve state-of-the-art performance, and help to tackle the many challenges in bot detection such as bot evolution and generalization. They used a combination of 8 hand crafted labeling functions and 7 feature engineering and neural network based models to generate labels for the training data. For example, they used heuristics such as “presence of spam keywords in tweets” to hand label the training data. - Real-Time Detection of Robotic Traffic in Online Advertising, (IAAI 2023) [3]: This paper proposes a deep neural network model called SLIDR (SLIce-Level Detection of Robots) that can detect invalid clicks on online ads in real-time. The model is trained with weak supervision, which means that instead of manually labeling each data point, they use a combination of heuristics and rules to generate labels for the training data. For example, they used heuristics such as ad clicks leading to a high monetary value sale as human clicks and all others as bots. Such labeling approaches despite being noisy can work better than unsupervised models in many real world scenarios.The key features employed in the model include volume and rate of clicks from a user, distinct sessions from the same IP address by the same user, time of click, and login flag. The novelty of this paper is not in model architecture but slice level detection. They choose thresholds that jointly optimize all slices of the ad click traffic across various devices and ad placements by fixing an overall false positive rate budget and finding thresholds that give the biggest bump to click invalidation rates without increasing slice level false positive rate too much. The performance comparison is done with click frequency based bot flagging approach and logistic regression on the same features.
Each of these studies contributes to a deeper understanding of botnet behaviors and offers innovative strategies for their detection and mitigation. By integrating these scientific insights with practical applications, as seen in the Imperva Bad Bot Report, we can develop more robust and effective solutions to the challenges posed by malicious bots in various industries. As we wrap up this edition, I'm excited to announce a new direction for our upcoming newsletters. Starting from the next edition, we will broaden our scope beyond Trust, Safety and Fraud towards a series of real-world industry use cases of machine learning applications. This expansion will include exclusive interviews with industry experts who are at the forefront of applying ML in innovative and impactful ways. These conversations aim to provide you with firsthand insights and practical knowledge from the field, enriching your understanding of how machine learning is shaping various industries. Stay tuned for this exciting journey into the practical world of ML applications, where theory meets practice in the most enlightening ways.
|