flowchart TD
D["Data Collection<br>Methods"]
D --> S["Surveys<br>Collect new data<br>from people"]
D --> W["Web Scraping<br>Extract data from<br>public web pages"]
D --> A["APIs<br>Request data from<br>systems built to serve it"]
D --> DB["Databases<br>Query data already<br>captured by systems"]
D --> Se["Sensors and IoT<br>Capture physical-world<br>measurements"]
D --> L["Logs and<br>Instrumentation<br>Records produced by<br>running systems"]
D --> Q["Qualitative Methods<br>Interviews, observation,<br>diary studies"]
style D fill:#e3f2fd,stroke:#1976D2
style S fill:#fce4ec,stroke:#AD1457
style W fill:#fff3e0,stroke:#EF6C00
style A fill:#fff8e1,stroke:#F9A825
style DB fill:#e8f5e9,stroke:#388E3C
style Se fill:#ede7f6,stroke:#4527A0
style L fill:#f3e5f5,stroke:#6A1B9A
style Q fill:#eceff1,stroke:#455A64
20 Data Collection Methods: Surveys, Web Scraping, APIs, and Databases
20.1 Why Data Collection Matters
Every analytical conclusion is only as good as the data it rests on, and every dataset bears the marks of the method by which it was collected.
The choice of data-collection method is a decision the analyst makes before any model is fit and before any chart is drawn. It determines what the data can support, where bias may have entered, what privacy and legal obligations apply, and what subsequent processing will be needed.
Four families of method dominate modern analytics work: surveys (collect new data directly from people), web scraping (extract data from public web pages), APIs (request data from systems designed to serve it), and databases (query data already captured by operational systems). Each has its own strengths, costs, and pitfalls. A working analyst chooses among them with intent.
20.2 Categories of Data Collection Methods
20.3 Surveys
A survey is a structured instrument for collecting new data directly from people. Surveys remain the standard way to measure attitudes, perceptions, intentions, and self-reported behaviour, and the empirical foundations of modern survey practice are set out in Survey Methodology by Robert M. Groves et al. (2009).
The principal forms of survey:
- Cross-sectional: A single point in time across a sample of respondents.
- Longitudinal (Panel): The same respondents surveyed repeatedly over time.
- Repeated Cross-Sectional: New respondents drawn from the same population at successive time points.
- Online, Telephone, In-Person, Postal: Different modes with different cost, reach, and bias profiles.
The four pillars of a defensible survey:
- Question design: Clear, unambiguous, single-barrelled questions; balanced response options; appropriate scales (Likert, semantic differential, ranking).
- Sampling: Random or stratified random sampling for population-level inference; convenience sampling for exploratory work; quota sampling where representativeness across known segments matters.
- Mode and channel: Online for cost and speed; in-person for sensitive subjects; mixed-mode for hard-to-reach populations.
- Non-response: A survey with 80 per cent non-response can be wholly unrepresentative, regardless of sample size. Plan for follow-up and weighting.
Common tools include Google Forms, Microsoft Forms, SurveyMonkey, Qualtrics, Typeform, and SurveyCTO; for academic and field research, Open Data Kit (ODK) and KoboToolbox are widely used.
Common survey pitfalls:
- Leading questions that suggest an answer in their wording.
- Double-barrelled questions that ask two things at once (“Was the service fast and friendly?”).
- Ambiguous scales without clear anchors at each end.
- Selection bias when only motivated respondents reply.
- Acquiescence bias in agree-disagree formats; respondents tend to agree.
- Recall bias when asking about events more than a few weeks past.
- Order effects where the order of questions or options shifts answers.
20.4 Web Scraping
Web scraping is the automated extraction of data from public web pages. It is widely used to gather competitor prices, product catalogues, public listings, news articles, and social-media content where no API is offered.
The technical toolkit:
- HTML parsers: BeautifulSoup and lxml in Python; rvest in R; jsoup in Java.
- HTTP clients: requests, httpx, aiohttp in Python; httr in R.
- Headless browsers: Selenium, Playwright, and Puppeteer for sites whose content renders only through JavaScript.
- Crawling frameworks: Scrapy in Python; Apache Nutch and Heritrix at very large scale.
-
Polite crawling: Honour
robots.txt, throttle requests, identify the crawler in the User-Agent header, and respect site terms of service.
Web scraping is technically simple but legally and ethically fraught:
- robots.txt: A site’s robots.txt file specifies which paths automated crawlers may visit. Honouring it is a courtesy and increasingly a legal expectation.
- Terms of Service: Many sites prohibit scraping in their terms; violation can produce contractual liability.
- Copyright and Database Rights: Extracted content may be protected by copyright, sui generis database rights (in the EU and UK), or by similar provisions in other jurisdictions.
- Personal Data: Scraped data containing personal information is subject to GDPR, India’s DPDPA, and similar privacy regulations.
- Server Load: Aggressive crawling can disrupt the site you depend on. Rate-limit, cache, and throttle.
- Detection and Blocking: Cloudflare, reCAPTCHA, and similar services aggressively detect and block scrapers; designing around these protections often crosses ethical lines.
The general rule for legitimate analytics work: prefer an API if one exists; if scraping, do so politely, lawfully, and with explicit acknowledgement that the practice carries risk.
20.5 APIs
An Application Programming Interface (API) is a contract a system offers for retrieving or modifying its data programmatically. Where an API exists, it is almost always preferable to scraping: it is faster, more reliable, more structured, and is the channel the source intends external systems to use.
The dominant API styles:
- REST: Resource-oriented, HTTP-based, JSON payloads. The standard pattern in modern web services.
- GraphQL: A single endpoint accepting structured queries; the client specifies the fields it wants. Avoids over-fetching.
- SOAP and XML-RPC: Older XML-based protocols, still common in regulated and legacy enterprise contexts.
- Webhooks: Server-pushed notifications when an event occurs, the inverse of polling.
- Streaming APIs: WebSockets, Server-Sent Events, and message brokers for continuous data feeds.
Practical concerns:
- Authentication: API keys, OAuth 2.0, JWT, mutual TLS, IP allow-lists. Each has its own setup and key-management implications.
- Rate Limits: Most APIs limit calls per second, per minute, or per day. Build in throttling and exponential back-off.
- Pagination: Most APIs paginate large result sets; the client must iterate through pages.
- Versioning: Consumed APIs change. Pin to a specific version and monitor deprecation notices.
- Error Handling: Network failures, partial responses, and rate-limit hits are routine; design retries and idempotent operations from the start.
- Cost: Many commercial APIs (Bloomberg, Refinitiv, Twilio, weather data) carry per-call or per-record charges.
Indian and global examples include public data portals (data.gov.in, the World Bank API, IMF, OECD), tax and identity APIs (GST, Aadhaar e-KYC, DigiLocker, Account Aggregator), payment platforms (Razorpay, PayU, Stripe), and SaaS platforms (Salesforce, HubSpot, Slack, Atlassian).
20.6 Databases and Data Warehouses
The most overlooked source of data is data the organisation already owns. Operational systems — point-of-sale, ERP, CRM, billing — capture the firm’s daily activity at very high granularity, and a competent analyst will reach for these before commissioning new data collection.
For analytical work, raw operational data is usually integrated into a data warehouse or data lake, structured for query rather than for transactions. The standard reference on dimensional modelling for analytical databases remains The Data Warehouse Toolkit by Ralph Kimball & Margy Ross (2013).
The principal database categories an analyst encounters:
- Operational (OLTP) Databases: Optimised for many small writes and reads; the systems of record. Oracle, SQL Server, MySQL, PostgreSQL.
- Analytical (OLAP) Warehouses: Optimised for large aggregations and complex queries. Snowflake, BigQuery, Redshift, Synapse, Databricks SQL.
- Data Lakes and Lakehouses: Schema-on-read storage of raw and semi-structured data. S3, Azure Data Lake, Google Cloud Storage, Databricks Delta Lake.
- NoSQL Stores: Document (MongoDB, Couchbase), key-value (Redis, DynamoDB), column-family (Cassandra, HBase), graph (Neo4j).
- Time-Series Databases: InfluxDB, TimescaleDB, kdb+ for high-frequency telemetry.
- Search Engines: Elasticsearch and OpenSearch for log and document retrieval.
The toolkit:
- SQL: The lingua franca of analytical databases. Indispensable for any working analyst.
- ETL and ELT Tools: Informatica, Talend, Fivetran, Airbyte, dbt, custom Python and Spark pipelines for moving data from operational sources to the analytical layer.
- Notebooks and Connectors: Jupyter, Databricks, Tableau, Power BI, ODBC, JDBC connectors that let the analyst query the warehouse in their preferred environment.
- Reverse ETL: Tools that move data back from the warehouse into operational systems (Hightouch, Census).
20.7 Sensors, Logs, and Qualitative Methods
Three further families of data collection are increasingly common in mature analytics work:
Sensors and IoT: Industrial sensors, environmental monitors, fleet telematics, wearable devices, smart-city instrumentation. Typically high-velocity, semi-structured, with quality and drift concerns.
Logs and Application Instrumentation: Web-server logs, application event logs, mobile app telemetry, clickstream. Often the largest source of behavioural data the firm has, but heavy on engineering work to make analysable.
Qualitative Methods: Interviews, ethnographic observation, diary studies, focus groups, open-ended survey responses. Provide depth and explanation that quantitative methods cannot. Subject to a different set of methodological standards: thematic analysis, grounded theory, content analysis.
A complete analytical programme draws on all of these where relevant. The choice between qualitative and quantitative is rarely either-or; the strongest insights often come from combining both.
20.8 Choosing the Right Method
| Method | Best For | Cost | Speed | Quality |
|---|---|---|---|---|
| Surveys | Attitudes, perceptions, self-reported behaviour | Medium to high | Slow | Variable; depends on design and response |
| Web Scraping | Public web data when no API exists | Low to medium | Fast initial, fragile over time | Variable; structure can change without notice |
| APIs | Structured data the source intends to share | Low to medium (often free, sometimes paid) | Fast | High when properly versioned |
| Operational Databases | Internal transactional data | Largely sunk | Fast | High if governed |
| Data Warehouses and Lakes | Integrated analytical data | Sunk in platform investment | Fast | High if integrated and curated |
| Sensors and IoT | Physical-world measurements | High up-front | Continuous | Variable; drift is common |
| Logs and Instrumentation | Behavioural and operational telemetry | Medium engineering effort | Continuous | Variable; volume is high |
| Qualitative | Depth and explanation, hypothesis generation | Medium to high | Slow | High when methodology is rigorous |
A pragmatic decision rule:
- For internal questions about the firm’s own activity, start with the operational and warehouse databases.
- For external structured questions, prefer an API.
- For external public data without an API, scrape carefully and lawfully.
- For attitudes and perceptions that no transactional system captures, run a survey.
- For physical-world questions, sensors and IoT.
- For why questions that quantitative methods cannot answer, use qualitative methods alongside.
20.9 Common Pitfalls
New Survey When Existing Data Suffices: Commissioning a new survey when the answer is already in the operational database. Surveys are slow and biased; existing data is fast and complete.
Scraping When an API Exists: Building fragile scrapers against sites that publish a stable API. The API would have been faster, cheaper, and more lawful.
Ignoring Terms of Service: Treating technically possible as legally permitted. Scraping in particular carries real legal risk in many jurisdictions.
Survey Fatigue: Long surveys that drop completion rates and bias the sample toward the patient few who finish.
Convenience Sampling Treated as Representative: Drawing conclusions about a population from the people who happened to respond.
Question Bias: Leading or double-barrelled questions that produce the answer the designer expected.
No Rate Limit or Back-Off: Scrapers and API clients that hit servers as fast as the network allows, getting blocked or causing harm.
Hard-Coded API Keys: Credentials checked into source control or shared in scripts. Use secrets management.
No Versioning Discipline: Consuming APIs without pinning to a version, then breaking when the source changes.
Sensor Drift Unmonitored: Industrial sensor data treated as ground truth, when the underlying instruments age, are replaced, or are recalibrated.
Logs Treated as Forever: Application logs collected and stored forever without lifecycle policies; storage cost spirals.
No Privacy Review: Personal data collected from any of these methods without DPDPA/GDPR-style review, consent, or minimisation.
Single-Method Reliance: Treating one method as the entire data programme, when triangulation across methods would produce better evidence.
20.10 Illustrative Cases
The following short cases illustrate how a working analyst chooses among collection methods. They describe common situations and the reasoning behind the design.
Customer Satisfaction at a Retail Bank — Survey Plus Operational Data
A retail bank wants to understand customer satisfaction across branches. The operational database captures transactions, complaints, and call-centre records, but says nothing about perception. The team launches a short post-interaction satisfaction survey through SMS and the mobile app, joins the survey responses to the operational records, and uses both together to identify which operational issues actually drive perception. Neither method alone would have answered the question.
Competitor Pricing for an E-Commerce Firm — Scraping with Discipline
An Indian e-commerce firm tracks competitor prices for a thousand SKUs daily. No API is offered; the team writes a polite scraper that honours robots.txt, throttles to a few requests per minute per site, identifies itself in the User-Agent, caches aggressively, and rotates IP addresses through a legitimate proxy service. The legal team reviews the scraping practice quarterly to ensure compliance with terms of service and Indian computer-misuse law.
Macroeconomic Context for a Bank’s Risk Models — APIs
A bank’s credit-risk model needs macroeconomic indicators — inflation, unemployment, GDP growth, exchange rates — at monthly frequency. The team uses APIs from RBI’s data publication service, the World Bank, and the IMF, with each request authenticated by API key. Scheduled jobs pull the latest releases each month into the warehouse, and the model’s feature pipeline reads from there. The arrangement is reliable, lawful, and almost free of running cost.
Industrial Predictive Maintenance — Sensors and Logs
A manufacturing plant instruments a critical line with vibration and temperature sensors, plus the line’s existing programmable-logic-controller logs. Sensor telemetry streams into a time-series database; PLC events stream into Kafka and from there into the warehouse. The combined dataset feeds a predictive-maintenance model. A sensor-replacement procedure is built into operations to flag and recalibrate when sensor drift becomes detectable.
Why Customers Churn — Qualitative Plus Quantitative
A subscription business has a quantitative churn model that predicts which customers will leave but cannot say why. The team supplements the model with quarterly in-depth interviews of recently churned customers and a thematic analysis of the open-text reason given at cancellation. The qualitative findings reshape the next iteration of the quantitative model and also feed product-improvement priorities.
20.11 Hands-On Exercise: Multi-Source Data Collection in Power BI
Aim: Use Power BI’s Get Data feature to collect data from five different categories of source — a database, a CSV file, a public REST API, a web page (HTML table), and an external Excel file — and combine them into a single integrated report.
Scenario: An analyst at Yuvijen Stores Pvt Ltd wants to answer a cross-source question: how does the firm’s monthly online traffic correlate with state-level population and macroeconomic indicators? No single internal system carries all of this; the data must be collected from five sources.
Deliverable: A Power BI report that integrates the five sources and renders a single visualisation answering the question.
20.11.1 Step 1 — Internal Sales from a Database
For most analytical projects, the firm’s own operational database is the natural starting point. Power BI ships with connectors for nearly every major database engine.
In Power BI Desktop:
- Home → Get Data → SQL Server (or MySQL, PostgreSQL, SQLite, Oracle, Azure SQL, etc.).
- Enter the server and database name; choose Direct Query for live operational data or Import for analytical extracts.
- Authenticate with Windows, database, or organisational credentials.
- Select the
sales_monthlyandstore_mastertables. - Click Transform Data to refine the queries (filter columns, change types, rename) before loading.
If a database is not available for the exercise, the same workflow applies to SQLite or even an Access file. The principle is the same — connect, authenticate, select, transform.
20.11.2 Step 2 — Internal Store Master from a CSV File
Many internal systems still export to flat files. Power BI handles them natively:
-
Home → Get Data → Text/CSV, select
store_master.csv. - Verify the inferred column types in the preview pane.
- Click Load for a quick load, or Transform Data to refine first.
This category covers virtually any tabular file that arrives on a shared drive — CSV, TSV, fixed-width text — and is the most common starting point in many small analytics teams.
20.11.3 Step 3 — Public REST API (JSON)
For a structured external API, Power BI’s Web connector handles JSON responses directly. A useful free example is the World Bank API — for instance, India’s annual GDP series:
https://api.worldbank.org/v2/country/IND/indicator/NY.GDP.MKTP.CD?format=json&date=2018:2025
In Power BI Desktop:
- Home → Get Data → Web, paste the URL.
- Power BI loads the response as a JSON list.
- Click Transform Data. In the Power Query editor, expand the JSON list to a table.
- Drill into the records, expand the nested objects, and select the
dateandvaluecolumns. - Apply.
For data.gov.in or other government open-data APIs, the same workflow applies. The key Power BI moves are Get Data → Web and the JSON expansion in the Power Query editor.
20.11.4 Step 4 — Web Page (HTML Table)
Power BI can extract HTML tables from a public web page without any external scraping tool:
- Home → Get Data → Web, paste a URL such as a Wikipedia page listing Indian states by population.
- Power BI’s Navigator displays each detected table on the page.
- Tick the table you want and click Transform Data.
- Clean the data: remove citation footnotes, change types, rename columns.
This is the practical answer to web scraping in a no-code BI workflow. It works for any well-formed HTML table; for less structured pages, Power BI’s Add Table Using Examples (under Get Data → Web) lets the analyst train an extractor with two or three example values.
20.11.5 Step 5 — External Industry Data (Excel)
Industry associations and consultancies publish reports as Excel workbooks. Power BI imports them directly:
-
Home → Get Data → Excel Workbook, select
industry_benchmarks.xlsx. - Pick the relevant sheet from the Navigator.
- Transform Data to remove header rows, convert ranges to tables, and tidy types.
The discipline is the same as for any tabular import — verify the schema, document the source, and load only the columns the report actually needs.
20.11.6 Step 6 — Combine and Visualise
In the Power BI Model view:
- Define relationships across the five sources using common keys (state, month, store).
- Mark the date table as a date table.
- Build a single visual that exercises all five sources — for example, a scatter plot of online traffic by state (internal database + web HTML) coloured by industry growth rate (external Excel) and sized by GDP per capita (REST API).
The single chart now stands on five sources and demonstrates the full collection workflow in one artefact.
20.11.7 Step 7 — Connect to the Visualisation Layer
The hands-on illustrates a recurring pattern in real analytics work:
- Internal sources carry the firm’s own activity at high granularity but are inward-looking.
- External sources add the macro context — population, GDP, industry trends — that internal data cannot produce.
- The dashboard or report is what brings the two together visually.
A serious data-collection programme is therefore not about choosing one source over another; it is about combining them so the resulting visualisation tells the audience something neither source alone could say.
Power BI file (yuvijen-multi-source.pbix), the source CSV, JSON URL, HTML page reference, and Excel file, plus screen recordings of each Get Data workflow will be embedded here.
Summary
| Concept | Description |
|---|---|
| Foundations | |
| Why Data Collection Matters | Every analytical conclusion is only as good as the data it rests on; method shapes what data can support |
| Categories of Methods | |
| Surveys | Structured instrument for collecting new data directly from people |
| Web Scraping | Automated extraction of data from public web pages |
| APIs | Contract a system offers for retrieving or modifying its data programmatically |
| Databases | Store of data already captured by operational systems and integrated for analysis |
| Sensors and IoT | Capture physical-world measurements at high velocity |
| Logs and Instrumentation | Records produced by running systems; behavioural and operational telemetry |
| Qualitative Methods | Interviews, observation, diary studies; provide depth and explanation |
| Survey Forms and Pillars | |
| Cross-Sectional Survey | Single point in time across a sample of respondents |
| Longitudinal Survey | Same respondents surveyed repeatedly over time |
| Repeated Cross-Sectional | New respondents drawn from the same population at successive time points |
| Online Mode | Online mode is fast and cheap but sample-biased toward digital users |
| Telephone Mode | Telephone mode reaches non-digital populations but is more costly |
| In-Person Mode | In-person mode is best for sensitive subjects but slowest and most expensive |
| Question Design | Clear unambiguous single-barrelled questions with appropriate scales |
| Sampling | Random or stratified random for inference; convenience for exploration; quota for representativeness |
| Mode and Channel | Online for cost, in-person for sensitivity, mixed-mode for hard-to-reach populations |
| Non-Response | An eighty per cent non-response can render a survey unrepresentative regardless of size |
| Survey Tools | Google Forms, SurveyMonkey, Qualtrics, Typeform; ODK and KoboToolbox for field research |
| Survey Pitfalls | |
| Leading Questions | Pitfall of wording that suggests an answer |
| Double-Barrelled Questions | Pitfall of asking two things at once such as fast-and-friendly |
| Acquiescence Bias | Pitfall of agree-disagree formats where respondents tend to agree |
| Recall Bias | Pitfall of asking about events more than a few weeks past |
| Order Effects | Pitfall of question or option order shifting answers |
| Web Scraping Toolkit | |
| HTML Parsers | BeautifulSoup, lxml, rvest, jsoup for parsing web HTML |
| HTTP Clients | requests, httpx, aiohttp, httr for fetching pages |
| Headless Browsers | Selenium, Playwright, Puppeteer for sites that render through JavaScript |
| Crawling Frameworks | Scrapy, Apache Nutch, Heritrix for crawling at scale |
| Polite Crawling | Honour robots.txt, throttle, identify the crawler, respect terms of service |
| Web Scraping Constraints | |
| robots.txt | Site file specifying which paths automated crawlers may visit |
| Terms of Service | Many sites prohibit scraping; violation can produce contractual liability |
| Copyright and Database Rights | Extracted content may be protected by copyright or sui generis database rights |
| Personal Data in Scraping | Scraped personal data is subject to GDPR, DPDPA, and similar privacy regulations |
| Server Load | Aggressive crawling can disrupt the site you depend on; throttle and cache |
| Scraper Detection and Blocking | Cloudflare and reCAPTCHA aggressively block scrapers; designing around protection often crosses ethical lines |
| API Styles | |
| REST | Resource-oriented HTTP-based JSON APIs; the standard pattern in modern web services |
| GraphQL | Single endpoint accepting structured queries; client specifies fields it wants |
| SOAP and XML-RPC | Older XML-based protocols, still common in regulated and legacy enterprise contexts |
| Webhooks | Server-pushed notifications when an event occurs; inverse of polling |
| Streaming APIs | WebSockets, Server-Sent Events, message brokers for continuous data feeds |
| API Practical Concerns | |
| API Authentication | API keys, OAuth 2.0, JWT, mutual TLS, IP allow-lists with their own setup |
| Rate Limits | Most APIs limit calls per second, per minute, per day; build in throttling and back-off |
| Pagination | Most APIs paginate large result sets; client must iterate through pages |
| API Versioning | Pin to a specific version and monitor deprecation notices |
| API Error Handling | Network failures and rate-limit hits are routine; design retries and idempotency |
| API Cost | Many commercial APIs charge per call or per record |
| Database Categories | |
| Operational Databases | OLTP systems optimised for many small writes and reads; systems of record |
| Analytical Warehouses | OLAP systems optimised for large aggregations and complex queries |
| Data Lakes and Lakehouses | Schema-on-read storage of raw and semi-structured data |
| NoSQL Stores | Document, key-value, column-family, and graph stores for flexible structures |
| Time-Series Databases | InfluxDB, TimescaleDB, kdb+ for high-frequency telemetry |
| Search Engines | Elasticsearch and OpenSearch for log and document retrieval |
| Database Toolkit | |
| SQL | Lingua franca of analytical databases; indispensable for any working analyst |
| ETL and ELT Tools | Informatica, Talend, Fivetran, Airbyte, dbt, custom pipelines for moving data |
| Notebooks and Connectors | Jupyter, Databricks, Tableau, Power BI, ODBC, JDBC connectors |
| Reverse ETL | Tools that move data back from the warehouse into operational systems |
| Other Methods | |
| Sensors and IoT Use | Industrial sensors, environmental monitors, fleet telematics, wearables, smart-city |
| Logs and Instrumentation Use | Web-server, application, mobile app telemetry, clickstream; largest behavioural source |
| Qualitative Methods Use | Provide depth and why-questions that quantitative methods cannot answer |
| Choosing the Method | |
| Decision Rule for Method | Internal first, then API, then scraping, then survey, then sensors, then qualitative |
| Common Pitfalls | |
| New Survey When Existing Data Suffices | Pitfall of commissioning a new survey when the answer is already in the operational database |
| Scraping When API Exists | Pitfall of building fragile scrapers against sites that publish a stable API |
| Ignoring Terms of Service | Pitfall of treating technically possible as legally permitted |
| Survey Fatigue | Pitfall of long surveys that drop completion rates and bias the sample |
| Convenience Sampling | Pitfall of drawing population conclusions from the people who happened to respond |
| Question Bias | Pitfall of leading or double-barrelled questions that produce the expected answer |
| No Rate Limit or Back-Off | Pitfall of scrapers and clients that hit servers as fast as the network allows |
| Hard-Coded API Keys | Pitfall of credentials checked into source control or shared in scripts |
| No Versioning Discipline | Pitfall of consuming APIs without pinning to a version and breaking on source changes |
| Sensor Drift Unmonitored | Pitfall of treating sensor data as ground truth when instruments age, fail, or are recalibrated |
| Logs Treated as Forever | Pitfall of application logs collected and stored forever without lifecycle policies |
| No Privacy Review | Pitfall of personal data collected from any method without privacy review or consent |
| Single-Method Reliance | Pitfall of treating one method as the entire data programme rather than triangulating |