20 Data Collection Methods: Surveys, Web Scraping, APIs, and Databases

20.1 Why Data Collection Matters

Every analytical conclusion is only as good as the data it rests on, and every dataset bears the marks of the method by which it was collected.

The choice of data-collection method is a decision the analyst makes before any model is fit and before any chart is drawn. It determines what the data can support, where bias may have entered, what privacy and legal obligations apply, and what subsequent processing will be needed.

Four families of method dominate modern analytics work: surveys (collect new data directly from people), web scraping (extract data from public web pages), APIs (request data from systems designed to serve it), and databases (query data already captured by operational systems). Each has its own strengths, costs, and pitfalls. A working analyst chooses among them with intent.

20.2 Categories of Data Collection Methods

flowchart TD
    D["Data Collection<br>Methods"]
    D --> S["Surveys<br>Collect new data<br>from people"]
    D --> W["Web Scraping<br>Extract data from<br>public web pages"]
    D --> A["APIs<br>Request data from<br>systems built to serve it"]
    D --> DB["Databases<br>Query data already<br>captured by systems"]
    D --> Se["Sensors and IoT<br>Capture physical-world<br>measurements"]
    D --> L["Logs and<br>Instrumentation<br>Records produced by<br>running systems"]
    D --> Q["Qualitative Methods<br>Interviews, observation,<br>diary studies"]
    style D fill:#e3f2fd,stroke:#1976D2
    style S fill:#fce4ec,stroke:#AD1457
    style W fill:#fff3e0,stroke:#EF6C00
    style A fill:#fff8e1,stroke:#F9A825
    style DB fill:#e8f5e9,stroke:#388E3C
    style Se fill:#ede7f6,stroke:#4527A0
    style L fill:#f3e5f5,stroke:#6A1B9A
    style Q fill:#eceff1,stroke:#455A64

20.3 Surveys

A survey is a structured instrument for collecting new data directly from people. Surveys remain the standard way to measure attitudes, perceptions, intentions, and self-reported behaviour, and the empirical foundations of modern survey practice are set out in Survey Methodology by Robert M. Groves et al. (2009).

The principal forms of survey:

Cross-sectional: A single point in time across a sample of respondents.
Longitudinal (Panel): The same respondents surveyed repeatedly over time.
Repeated Cross-Sectional: New respondents drawn from the same population at successive time points.
Online, Telephone, In-Person, Postal: Different modes with different cost, reach, and bias profiles.

The four pillars of a defensible survey:

Question design: Clear, unambiguous, single-barrelled questions; balanced response options; appropriate scales (Likert, semantic differential, ranking).
Sampling: Random or stratified random sampling for population-level inference; convenience sampling for exploratory work; quota sampling where representativeness across known segments matters.
Mode and channel: Online for cost and speed; in-person for sensitive subjects; mixed-mode for hard-to-reach populations.
Non-response: A survey with 80 per cent non-response can be wholly unrepresentative, regardless of sample size. Plan for follow-up and weighting.

Common tools include Google Forms, Microsoft Forms, SurveyMonkey, Qualtrics, Typeform, and SurveyCTO; for academic and field research, Open Data Kit (ODK) and KoboToolbox are widely used.

Common survey pitfalls:

Leading questions that suggest an answer in their wording.
Double-barrelled questions that ask two things at once (“Was the service fast and friendly?”).
Ambiguous scales without clear anchors at each end.
Selection bias when only motivated respondents reply.
Acquiescence bias in agree-disagree formats; respondents tend to agree.
Recall bias when asking about events more than a few weeks past.
Order effects where the order of questions or options shifts answers.

20.4 Web Scraping

Web scraping is the automated extraction of data from public web pages. It is widely used to gather competitor prices, product catalogues, public listings, news articles, and social-media content where no API is offered.

The technical toolkit:

HTML parsers: BeautifulSoup and lxml in Python; rvest in R; jsoup in Java.
HTTP clients: requests, httpx, aiohttp in Python; httr in R.
Headless browsers: Selenium, Playwright, and Puppeteer for sites whose content renders only through JavaScript.
Crawling frameworks: Scrapy in Python; Apache Nutch and Heritrix at very large scale.
Polite crawling: Honour robots.txt, throttle requests, identify the crawler in the User-Agent header, and respect site terms of service.

Web scraping is technically simple but legally and ethically fraught:

robots.txt: A site’s robots.txt file specifies which paths automated crawlers may visit. Honouring it is a courtesy and increasingly a legal expectation.
Terms of Service: Many sites prohibit scraping in their terms; violation can produce contractual liability.
Copyright and Database Rights: Extracted content may be protected by copyright, sui generis database rights (in the EU and UK), or by similar provisions in other jurisdictions.
Personal Data: Scraped data containing personal information is subject to GDPR, India’s DPDPA, and similar privacy regulations.
Server Load: Aggressive crawling can disrupt the site you depend on. Rate-limit, cache, and throttle.
Detection and Blocking: Cloudflare, reCAPTCHA, and similar services aggressively detect and block scrapers; designing around these protections often crosses ethical lines.

The general rule for legitimate analytics work: prefer an API if one exists; if scraping, do so politely, lawfully, and with explicit acknowledgement that the practice carries risk.

20.5 APIs

An Application Programming Interface (API) is a contract a system offers for retrieving or modifying its data programmatically. Where an API exists, it is almost always preferable to scraping: it is faster, more reliable, more structured, and is the channel the source intends external systems to use.

The dominant API styles:

REST: Resource-oriented, HTTP-based, JSON payloads. The standard pattern in modern web services.
GraphQL: A single endpoint accepting structured queries; the client specifies the fields it wants. Avoids over-fetching.
SOAP and XML-RPC: Older XML-based protocols, still common in regulated and legacy enterprise contexts.
Webhooks: Server-pushed notifications when an event occurs, the inverse of polling.
Streaming APIs: WebSockets, Server-Sent Events, and message brokers for continuous data feeds.

Practical concerns:

Authentication: API keys, OAuth 2.0, JWT, mutual TLS, IP allow-lists. Each has its own setup and key-management implications.
Rate Limits: Most APIs limit calls per second, per minute, or per day. Build in throttling and exponential back-off.
Pagination: Most APIs paginate large result sets; the client must iterate through pages.
Versioning: Consumed APIs change. Pin to a specific version and monitor deprecation notices.
Error Handling: Network failures, partial responses, and rate-limit hits are routine; design retries and idempotent operations from the start.
Cost: Many commercial APIs (Bloomberg, Refinitiv, Twilio, weather data) carry per-call or per-record charges.

Indian and global examples include public data portals (data.gov.in, the World Bank API, IMF, OECD), tax and identity APIs (GST, Aadhaar e-KYC, DigiLocker, Account Aggregator), payment platforms (Razorpay, PayU, Stripe), and SaaS platforms (Salesforce, HubSpot, Slack, Atlassian).

20.6 Databases and Data Warehouses

The most overlooked source of data is data the organisation already owns. Operational systems — point-of-sale, ERP, CRM, billing — capture the firm’s daily activity at very high granularity, and a competent analyst will reach for these before commissioning new data collection.

For analytical work, raw operational data is usually integrated into a data warehouse or data lake, structured for query rather than for transactions. The standard reference on dimensional modelling for analytical databases remains The Data Warehouse Toolkit by Ralph Kimball & Margy Ross (2013).

The principal database categories an analyst encounters:

Operational (OLTP) Databases: Optimised for many small writes and reads; the systems of record. Oracle, SQL Server, MySQL, PostgreSQL.
Analytical (OLAP) Warehouses: Optimised for large aggregations and complex queries. Snowflake, BigQuery, Redshift, Synapse, Databricks SQL.
Data Lakes and Lakehouses: Schema-on-read storage of raw and semi-structured data. S3, Azure Data Lake, Google Cloud Storage, Databricks Delta Lake.
NoSQL Stores: Document (MongoDB, Couchbase), key-value (Redis, DynamoDB), column-family (Cassandra, HBase), graph (Neo4j).
Time-Series Databases: InfluxDB, TimescaleDB, kdb+ for high-frequency telemetry.
Search Engines: Elasticsearch and OpenSearch for log and document retrieval.

The toolkit:

SQL: The lingua franca of analytical databases. Indispensable for any working analyst.
ETL and ELT Tools: Informatica, Talend, Fivetran, Airbyte, dbt, custom Python and Spark pipelines for moving data from operational sources to the analytical layer.
Notebooks and Connectors: Jupyter, Databricks, Tableau, Power BI, ODBC, JDBC connectors that let the analyst query the warehouse in their preferred environment.
Reverse ETL: Tools that move data back from the warehouse into operational systems (Hightouch, Census).

20.7 Sensors, Logs, and Qualitative Methods

Three further families of data collection are increasingly common in mature analytics work:

Sensors and IoT: Industrial sensors, environmental monitors, fleet telematics, wearable devices, smart-city instrumentation. Typically high-velocity, semi-structured, with quality and drift concerns.
Logs and Application Instrumentation: Web-server logs, application event logs, mobile app telemetry, clickstream. Often the largest source of behavioural data the firm has, but heavy on engineering work to make analysable.
Qualitative Methods: Interviews, ethnographic observation, diary studies, focus groups, open-ended survey responses. Provide depth and explanation that quantitative methods cannot. Subject to a different set of methodological standards: thematic analysis, grounded theory, content analysis.

A complete analytical programme draws on all of these where relevant. The choice between qualitative and quantitative is rarely either-or; the strongest insights often come from combining both.

20.8 Choosing the Right Method

Comparison of Data Collection Methods

Method	Best For	Cost	Speed	Quality
Surveys	Attitudes, perceptions, self-reported behaviour	Medium to high	Slow	Variable; depends on design and response
Web Scraping	Public web data when no API exists	Low to medium	Fast initial, fragile over time	Variable; structure can change without notice
APIs	Structured data the source intends to share	Low to medium (often free, sometimes paid)	Fast	High when properly versioned
Operational Databases	Internal transactional data	Largely sunk	Fast	High if governed
Data Warehouses and Lakes	Integrated analytical data	Sunk in platform investment	Fast	High if integrated and curated
Sensors and IoT	Physical-world measurements	High up-front	Continuous	Variable; drift is common
Logs and Instrumentation	Behavioural and operational telemetry	Medium engineering effort	Continuous	Variable; volume is high
Qualitative	Depth and explanation, hypothesis generation	Medium to high	Slow	High when methodology is rigorous

A pragmatic decision rule:

For internal questions about the firm’s own activity, start with the operational and warehouse databases.
For external structured questions, prefer an API.
For external public data without an API, scrape carefully and lawfully.
For attitudes and perceptions that no transactional system captures, run a survey.
For physical-world questions, sensors and IoT.
For why questions that quantitative methods cannot answer, use qualitative methods alongside.

20.9 Common Pitfalls

New Survey When Existing Data Suffices: Commissioning a new survey when the answer is already in the operational database. Surveys are slow and biased; existing data is fast and complete.
Scraping When an API Exists: Building fragile scrapers against sites that publish a stable API. The API would have been faster, cheaper, and more lawful.
Ignoring Terms of Service: Treating technically possible as legally permitted. Scraping in particular carries real legal risk in many jurisdictions.
Survey Fatigue: Long surveys that drop completion rates and bias the sample toward the patient few who finish.
Convenience Sampling Treated as Representative: Drawing conclusions about a population from the people who happened to respond.
Question Bias: Leading or double-barrelled questions that produce the answer the designer expected.
No Rate Limit or Back-Off: Scrapers and API clients that hit servers as fast as the network allows, getting blocked or causing harm.
Hard-Coded API Keys: Credentials checked into source control or shared in scripts. Use secrets management.
No Versioning Discipline: Consuming APIs without pinning to a version, then breaking when the source changes.
Sensor Drift Unmonitored: Industrial sensor data treated as ground truth, when the underlying instruments age, are replaced, or are recalibrated.
Logs Treated as Forever: Application logs collected and stored forever without lifecycle policies; storage cost spirals.
No Privacy Review: Personal data collected from any of these methods without DPDPA/GDPR-style review, consent, or minimisation.
Single-Method Reliance: Treating one method as the entire data programme, when triangulation across methods would produce better evidence.

20.10 Illustrative Cases

The following short cases illustrate how a working analyst chooses among collection methods. They describe common situations and the reasoning behind the design.

Customer Satisfaction at a Retail Bank — Survey Plus Operational Data

A retail bank wants to understand customer satisfaction across branches. The operational database captures transactions, complaints, and call-centre records, but says nothing about perception. The team launches a short post-interaction satisfaction survey through SMS and the mobile app, joins the survey responses to the operational records, and uses both together to identify which operational issues actually drive perception. Neither method alone would have answered the question.

Competitor Pricing for an E-Commerce Firm — Scraping with Discipline

An Indian e-commerce firm tracks competitor prices for a thousand SKUs daily. No API is offered; the team writes a polite scraper that honours robots.txt, throttles to a few requests per minute per site, identifies itself in the User-Agent, caches aggressively, and rotates IP addresses through a legitimate proxy service. The legal team reviews the scraping practice quarterly to ensure compliance with terms of service and Indian computer-misuse law.

Macroeconomic Context for a Bank’s Risk Models — APIs

A bank’s credit-risk model needs macroeconomic indicators — inflation, unemployment, GDP growth, exchange rates — at monthly frequency. The team uses APIs from RBI’s data publication service, the World Bank, and the IMF, with each request authenticated by API key. Scheduled jobs pull the latest releases each month into the warehouse, and the model’s feature pipeline reads from there. The arrangement is reliable, lawful, and almost free of running cost.

Industrial Predictive Maintenance — Sensors and Logs

A manufacturing plant instruments a critical line with vibration and temperature sensors, plus the line’s existing programmable-logic-controller logs. Sensor telemetry streams into a time-series database; PLC events stream into Kafka and from there into the warehouse. The combined dataset feeds a predictive-maintenance model. A sensor-replacement procedure is built into operations to flag and recalibrate when sensor drift becomes detectable.

Why Customers Churn — Qualitative Plus Quantitative

A subscription business has a quantitative churn model that predicts which customers will leave but cannot say why. The team supplements the model with quarterly in-depth interviews of recently churned customers and a thematic analysis of the open-text reason given at cancellation. The qualitative findings reshape the next iteration of the quantitative model and also feed product-improvement priorities.

20.11 Hands-On Exercise: Multi-Source Data Collection in Power BI

Aim: Use Power BI’s Get Data feature to collect data from five different categories of source — a database, a CSV file, a public REST API, a web page (HTML table), and an external Excel file — and combine them into a single integrated report.

Scenario: An analyst at Yuvijen Stores Pvt Ltd wants to answer a cross-source question: how does the firm’s monthly online traffic correlate with state-level population and macroeconomic indicators? No single internal system carries all of this; the data must be collected from five sources.

Deliverable: A Power BI report that integrates the five sources and renders a single visualisation answering the question.

20.11.1 Step 1 — Internal Sales from a Database

For most analytical projects, the firm’s own operational database is the natural starting point. Power BI ships with connectors for nearly every major database engine.

In Power BI Desktop:

Home → Get Data → SQL Server (or MySQL, PostgreSQL, SQLite, Oracle, Azure SQL, etc.).
Enter the server and database name; choose Direct Query for live operational data or Import for analytical extracts.
Authenticate with Windows, database, or organisational credentials.
Select the sales_monthly and store_master tables.
Click Transform Data to refine the queries (filter columns, change types, rename) before loading.

If a database is not available for the exercise, the same workflow applies to SQLite or even an Access file. The principle is the same — connect, authenticate, select, transform.

20.11.2 Step 2 — Internal Store Master from a CSV File

Many internal systems still export to flat files. Power BI handles them natively:

Home → Get Data → Text/CSV, select store_master.csv.
Verify the inferred column types in the preview pane.
Click Load for a quick load, or Transform Data to refine first.

This category covers virtually any tabular file that arrives on a shared drive — CSV, TSV, fixed-width text — and is the most common starting point in many small analytics teams.

20.11.3 Step 3 — Public REST API (JSON)

For a structured external API, Power BI’s Web connector handles JSON responses directly. A useful free example is the World Bank API — for instance, India’s annual GDP series:

https://api.worldbank.org/v2/country/IND/indicator/NY.GDP.MKTP.CD?format=json&date=2018:2025

In Power BI Desktop:

Home → Get Data → Web, paste the URL.
Power BI loads the response as a JSON list.
Click Transform Data. In the Power Query editor, expand the JSON list to a table.
Drill into the records, expand the nested objects, and select the date and value columns.
Apply.

For data.gov.in or other government open-data APIs, the same workflow applies. The key Power BI moves are Get Data → Web and the JSON expansion in the Power Query editor.

20.11.4 Step 4 — Web Page (HTML Table)

Power BI can extract HTML tables from a public web page without any external scraping tool:

Home → Get Data → Web, paste a URL such as a Wikipedia page listing Indian states by population.
Power BI’s Navigator displays each detected table on the page.
Tick the table you want and click Transform Data.
Clean the data: remove citation footnotes, change types, rename columns.

This is the practical answer to web scraping in a no-code BI workflow. It works for any well-formed HTML table; for less structured pages, Power BI’s Add Table Using Examples (under Get Data → Web) lets the analyst train an extractor with two or three example values.

20.11.5 Step 5 — External Industry Data (Excel)

Industry associations and consultancies publish reports as Excel workbooks. Power BI imports them directly:

Home → Get Data → Excel Workbook, select industry_benchmarks.xlsx.
Pick the relevant sheet from the Navigator.
Transform Data to remove header rows, convert ranges to tables, and tidy types.

The discipline is the same as for any tabular import — verify the schema, document the source, and load only the columns the report actually needs.

20.11.6 Step 6 — Combine and Visualise

In the Power BI Model view:

Define relationships across the five sources using common keys (state, month, store).
Mark the date table as a date table.
Build a single visual that exercises all five sources — for example, a scatter plot of online traffic by state (internal database + web HTML) coloured by industry growth rate (external Excel) and sized by GDP per capita (REST API).

The single chart now stands on five sources and demonstrates the full collection workflow in one artefact.

20.11.7 Step 7 — Connect to the Visualisation Layer

The hands-on illustrates a recurring pattern in real analytics work:

Internal sources carry the firm’s own activity at high granularity but are inward-looking.
External sources add the macro context — population, GDP, industry trends — that internal data cannot produce.
The dashboard or report is what brings the two together visually.

A serious data-collection programme is therefore not about choosing one source over another; it is about combining them so the resulting visualisation tells the audience something neither source alone could say.

Files and Screen Recordings

Power BI file (yuvijen-multi-source.pbix), the source CSV, JSON URL, HTML page reference, and Excel file, plus screen recordings of each Get Data workflow will be embedded here.

Summary

Concept	Description
Foundations
Why Data Collection Matters	Every analytical conclusion is only as good as the data it rests on; method shapes what data can support
Categories of Methods
Surveys	Structured instrument for collecting new data directly from people
Web Scraping	Automated extraction of data from public web pages
APIs	Contract a system offers for retrieving or modifying its data programmatically
Databases	Store of data already captured by operational systems and integrated for analysis
Sensors and IoT	Capture physical-world measurements at high velocity
Logs and Instrumentation	Records produced by running systems; behavioural and operational telemetry
Qualitative Methods	Interviews, observation, diary studies; provide depth and explanation
Survey Forms and Pillars
Cross-Sectional Survey	Single point in time across a sample of respondents
Longitudinal Survey	Same respondents surveyed repeatedly over time
Repeated Cross-Sectional	New respondents drawn from the same population at successive time points
Online Mode	Online mode is fast and cheap but sample-biased toward digital users
Telephone Mode	Telephone mode reaches non-digital populations but is more costly
In-Person Mode	In-person mode is best for sensitive subjects but slowest and most expensive
Question Design	Clear unambiguous single-barrelled questions with appropriate scales
Sampling	Random or stratified random for inference; convenience for exploration; quota for representativeness
Mode and Channel	Online for cost, in-person for sensitivity, mixed-mode for hard-to-reach populations
Non-Response	An eighty per cent non-response can render a survey unrepresentative regardless of size
Survey Tools	Google Forms, SurveyMonkey, Qualtrics, Typeform; ODK and KoboToolbox for field research
Survey Pitfalls
Leading Questions	Pitfall of wording that suggests an answer
Double-Barrelled Questions	Pitfall of asking two things at once such as fast-and-friendly
Acquiescence Bias	Pitfall of agree-disagree formats where respondents tend to agree
Recall Bias	Pitfall of asking about events more than a few weeks past
Order Effects	Pitfall of question or option order shifting answers
Web Scraping Toolkit
HTML Parsers	BeautifulSoup, lxml, rvest, jsoup for parsing web HTML
HTTP Clients	requests, httpx, aiohttp, httr for fetching pages
Headless Browsers	Selenium, Playwright, Puppeteer for sites that render through JavaScript
Crawling Frameworks	Scrapy, Apache Nutch, Heritrix for crawling at scale
Polite Crawling	Honour robots.txt, throttle, identify the crawler, respect terms of service
Web Scraping Constraints
robots.txt	Site file specifying which paths automated crawlers may visit
Terms of Service	Many sites prohibit scraping; violation can produce contractual liability
Copyright and Database Rights	Extracted content may be protected by copyright or sui generis database rights
Personal Data in Scraping	Scraped personal data is subject to GDPR, DPDPA, and similar privacy regulations
Server Load	Aggressive crawling can disrupt the site you depend on; throttle and cache
Scraper Detection and Blocking	Cloudflare and reCAPTCHA aggressively block scrapers; designing around protection often crosses ethical lines
API Styles
REST	Resource-oriented HTTP-based JSON APIs; the standard pattern in modern web services
GraphQL	Single endpoint accepting structured queries; client specifies fields it wants
SOAP and XML-RPC	Older XML-based protocols, still common in regulated and legacy enterprise contexts
Webhooks	Server-pushed notifications when an event occurs; inverse of polling
Streaming APIs	WebSockets, Server-Sent Events, message brokers for continuous data feeds
API Practical Concerns
API Authentication	API keys, OAuth 2.0, JWT, mutual TLS, IP allow-lists with their own setup
Rate Limits	Most APIs limit calls per second, per minute, per day; build in throttling and back-off
Pagination	Most APIs paginate large result sets; client must iterate through pages
API Versioning	Pin to a specific version and monitor deprecation notices
API Error Handling	Network failures and rate-limit hits are routine; design retries and idempotency
API Cost	Many commercial APIs charge per call or per record
Database Categories
Operational Databases	OLTP systems optimised for many small writes and reads; systems of record
Analytical Warehouses	OLAP systems optimised for large aggregations and complex queries
Data Lakes and Lakehouses	Schema-on-read storage of raw and semi-structured data
NoSQL Stores	Document, key-value, column-family, and graph stores for flexible structures
Time-Series Databases	InfluxDB, TimescaleDB, kdb+ for high-frequency telemetry
Search Engines	Elasticsearch and OpenSearch for log and document retrieval
Database Toolkit
SQL	Lingua franca of analytical databases; indispensable for any working analyst
ETL and ELT Tools	Informatica, Talend, Fivetran, Airbyte, dbt, custom pipelines for moving data
Notebooks and Connectors	Jupyter, Databricks, Tableau, Power BI, ODBC, JDBC connectors
Reverse ETL	Tools that move data back from the warehouse into operational systems
Other Methods
Sensors and IoT Use	Industrial sensors, environmental monitors, fleet telematics, wearables, smart-city
Logs and Instrumentation Use	Web-server, application, mobile app telemetry, clickstream; largest behavioural source
Qualitative Methods Use	Provide depth and why-questions that quantitative methods cannot answer
Choosing the Method
Decision Rule for Method	Internal first, then API, then scraping, then survey, then sensors, then qualitative
Common Pitfalls
New Survey When Existing Data Suffices	Pitfall of commissioning a new survey when the answer is already in the operational database
Scraping When API Exists	Pitfall of building fragile scrapers against sites that publish a stable API
Ignoring Terms of Service	Pitfall of treating technically possible as legally permitted
Survey Fatigue	Pitfall of long surveys that drop completion rates and bias the sample
Convenience Sampling	Pitfall of drawing population conclusions from the people who happened to respond
Question Bias	Pitfall of leading or double-barrelled questions that produce the expected answer
No Rate Limit or Back-Off	Pitfall of scrapers and clients that hit servers as fast as the network allows
Hard-Coded API Keys	Pitfall of credentials checked into source control or shared in scripts
No Versioning Discipline	Pitfall of consuming APIs without pinning to a version and breaking on source changes
Sensor Drift Unmonitored	Pitfall of treating sensor data as ground truth when instruments age, fail, or are recalibrated
Logs Treated as Forever	Pitfall of application logs collected and stored forever without lifecycle policies
No Privacy Review	Pitfall of personal data collected from any method without privacy review or consent
Single-Method Reliance	Pitfall of treating one method as the entire data programme rather than triangulating