Episode 22 — Data Source Assessment and Collection: decide what to collect and prioritize

In this episode, we’re going to get very practical about how a S O C decides what data to collect, because collecting the right signals is the first real step toward seeing threats instead of guessing at them. New learners often assume that the best approach is to gather everything from everywhere, but that mindset quickly runs into real limits like cost, storage, performance, privacy, and sheer human attention. The more important idea is that data collection is a set of tradeoffs you can explain and defend, where you choose sources that answer the questions you need to answer when something suspicious happens. When you make those choices well, alerts become easier to trust, investigations become faster, and the S O C spends less time chasing shadows. This review-style lesson is meant to help you build a simple, repeatable way of thinking so you can prioritize data sources with confidence, even when you are not a specialist in every system you are monitoring.

A good starting point is to understand what a data source really is in a monitoring context, because that word can mean different things to different people. A data source can be a log stream from an operating system, a record of authentication activity from an identity service, network flow information from a router, or application events from a business system. What makes it useful is not the format, but the fact that it describes an action or a state change that matters to security, like a login attempt, a permission change, a new process starting, or a connection to an unfamiliar destination. Data sources also have different levels of detail, where some are high-level summaries and others are deep, granular events that reveal exactly what happened. That difference matters because deep data can be powerful for investigations, but it can also be expensive to store and difficult to analyze if you do not have a clear purpose for it. When you review data source assessment, keep the focus on what questions the data can answer and how reliably it answers them, because that keeps the process grounded.

The most beginner-friendly way to prioritize collection is to connect data to outcomes, which means deciding what you are trying to detect and what you are trying to prove. Detection is about noticing that something might be wrong, like a suspicious login pattern or an unusual sequence of access requests. Proof is about being able to reconstruct what happened well enough to make a confident decision, like confirming whether an account was actually abused or whether a system was truly altered. Some sources are great for detection but weak for proof, such as a simple alert that says something looked odd without showing the underlying activity. Other sources are great for proof but not ideal for detection, such as detailed system traces that you would only examine after you already suspect a problem. A strong collection plan balances these needs, because you want early warning without sacrificing the ability to investigate. This balance is also where many S O C teams struggle, because they collect large volumes of data that look impressive but do not align with how investigations actually unfold.

Another key concept is coverage, which is about whether your data sources reflect the systems that matter most to the organization. If you monitor a handful of less important systems very deeply but ignore the systems that handle identity, email, sensitive data, or external access, your coverage will be lopsided and your risk will remain high. Coverage also includes where attacks commonly start and where they commonly move, because many intrusions begin with identity abuse, phishing, or exposed services, and then expand through internal movement. For beginners, it helps to think of coverage as watching the doors, hallways, and valuables, rather than placing all your cameras in a quiet storage room. Identity activity, endpoint activity, and network activity tend to be foundational because they describe how access happens, how actions happen, and how systems communicate. Business-critical applications and cloud services often add the context that turns a technical event into a meaningful business impact. When you assess data sources, you are essentially asking whether you are collecting information from the places where a real incident would leave footprints.

Data source assessment also includes understanding signal-to-noise, because not all logs are equally informative. Some sources produce huge volumes of routine events that rarely matter, and if you ingest them without filtering or purpose, you can flood your systems and your analysts with clutter. Other sources produce fewer events but have high investigative value, like administrative actions, authentication failures, or configuration changes. Noise is not just a matter of volume, because a low-volume source can still be noisy if it produces lots of confusing or inconsistent fields. Signal is also not just a matter of severity, because a low-severity event can become high-signal when it appears in a suspicious sequence, such as multiple small signs of probing or privilege discovery. The goal is not to eliminate noise completely, because some noise is unavoidable, but to avoid paying high costs for data that rarely improves decisions. A strong collection plan looks for sources that are both reliable and meaningful, and it is willing to treat some sources as optional until a specific use case justifies them.

A simple way to make collection decisions defensible is to use prioritization categories, even if you never write them down as formal labels. One category is must-have sources, which are the streams you need for basic visibility and investigation, like authentication, endpoint activity, and key network or gateway telemetry. Another category is high-value sources, which strongly improve detection or investigation for important systems, like administrative changes in cloud environments or access logs for sensitive data platforms. A third category is situational sources, which might be turned on for specific risk periods, specific projects, or specific incident types, such as deeper application logs during a major system migration. A final category is nice-to-have sources, which might be interesting but do not justify the cost or complexity yet, especially for a beginner program. Thinking in categories helps you avoid a flat list where everything seems equally important. It also supports a realistic growth plan, where you build a solid foundation first and expand based on actual needs rather than ambition.

When you move from assessment to collection, a key idea is that data collection is a pipeline, not a single action. Data has to be generated, captured, transported, stored, and then made searchable, and each step can change the quality and trustworthiness of what you receive. For example, if a system generates logs but they are not enabled correctly, you might get partial events that miss key fields, which makes them less useful. If transport is unreliable or delayed, you might receive events too late to support timely response, which changes the nature of your monitoring from near-real-time to historical review. If storage settings compress or discard important fields, you might lose the details you need during an investigation. If parsing and normalization are inconsistent, you might not be able to correlate events across sources, which makes patterns hard to spot. Thinking of collection as a pipeline helps you ask the right questions, because you stop assuming that turning on a log source automatically gives you usable security visibility.

A beginner-friendly review point is the difference between events, logs, and telemetry, because these words often get mixed together. An event is a single recorded occurrence, like a login success, a process start, or a permission change. A log is a record format that stores events, often grouped in files or streams, and it can include structured fields or plain text. Telemetry is a broader word that includes logs but can also include summaries, measurements, and flow records, such as network flow data that describes who talked to whom and how much. This matters because different telemetry types answer different questions, and they have different costs and benefits. Network flow data, for example, can be excellent for seeing broad communication patterns without capturing content, which can help with privacy and volume, while detailed packet content can be much heavier and more sensitive. Understanding these differences helps you make smarter collection tradeoffs because you can choose the least intrusive, least expensive telemetry that still answers your security questions. That is a strong habit for both exam thinking and real monitoring design.

Collection decisions should also account for trust, because not every source is equally hard to tamper with or equally reliable under attack. Some logs live on the same system that an attacker might compromise, which means an attacker might delete or alter them if they gain control. Other logs can be shipped off-system quickly, which improves integrity because it reduces the attacker’s ability to rewrite history. The timing of collection matters, because sending events frequently and buffering safely reduces the chance of losing critical evidence during an outage or compromise. The system that receives the logs must also be protected, because central collection becomes a high-value target and a single point of failure if it is not designed carefully. For beginners, it is helpful to think of logs like security camera footage, where keeping the recordings only on the camera itself is riskier than storing them somewhere protected. A good collection plan takes this into account by favoring sources and pipelines that preserve integrity and availability.

Prioritization also needs to include the human side, because a S O C is not just collecting data for storage, it is collecting data to support decisions that people must make. If analysts cannot understand the data, or if the data lacks context like user identity, asset ownership, or system role, the investigation time grows and the chance of mistakes increases. Context often comes from enrichment sources, such as identity directories, asset inventories, or business ownership records, and those can be as important as the raw security logs themselves. It is common for beginners to focus on technical logs and forget that security decisions are often about impact and intent, which require business context. Even a simple event becomes more meaningful when you know whether the system is a test server or a critical database, and whether the account is a normal user or a privileged administrator. When you review collection, include the question of what context is needed to make the raw events interpretable. That focus helps you build a collection plan that produces usable information rather than a confusing pile of records.

A strong spaced-review takeaway is that prioritization should be risk-based, not convenience-based, because convenience tends to create blind spots. It is easy to collect what is already accessible, what produces clean logs, or what a vendor enables by default, but those are not always the sources tied to the highest risk. Risk-based prioritization looks at what would hurt the organization most if compromised, what is most exposed, and what is most likely to be abused, especially through identity and access pathways. It also considers where detection gives the most leverage, such as monitoring administrative actions, changes to security controls, and authentication anomalies. For beginners, it can help to remember that attackers often try to look normal, so the most valuable sources are the ones that reveal subtle deviations in access, privilege, and behavior. Convenience-based collection often creates a false sense of coverage, because it looks like you have many sources, but the sources do not align with the threats that matter. Risk-based thinking keeps your collection plan purposeful and defensible.

Finally, remember that data collection is not a one-time decision, because environments change and threats change, and collection plans must be revisited with new knowledge. New systems get deployed, business processes shift, and attackers adapt, which means the S O C must periodically reassess what data is still valuable and what gaps have emerged. Feedback from investigations is one of the best drivers of improvement, because every incident or near-miss reveals which sources were helpful and which sources were missing or confusing. Changes should be deliberate, because adding new sources without understanding their impact can create cost spikes, performance issues, or analyst overload. This is why strong S O C teams treat collection as a living program with clear priorities and a habit of continuous improvement rather than a chaotic expansion. When you approach the exam, the most important idea is that deciding what to collect is an exercise in reasoning, where you connect threats, coverage, data quality, and human decision-making. If you can explain why a source matters, how it supports detection and investigation, and what tradeoffs it introduces, you are thinking like someone who can build reliable monitoring rather than someone who just gathers data.

Episode 22 — Data Source Assessment and Collection: decide what to collect and prioritize
Broadcast by