Which function can be used to read the entire contents of a text file line by line?

File Input/Output

File Input/Ouput (IO) requires 3 steps:

Nội dung chính Show

File Input/Output
Directory and File Management
Text Processing
Web Scraping
Which function can be used to read an entire line of text data?
Which method is used to read the contents of a file line by line?
Which function will read entire content of file?
Which function can be used for reading entire content of a file into a string while using a file object for reading from a file?

Open the file for read or write or both.
Read/Write data.
Close the file to free the resouces.

Python provides built-in functions and modules to support these operations.

Opening/Closing a File

open(file, [mode='r']) -> fileObj: Open the file and return a file object. The available modes are: 'r' (read-only - default), 'w' (write - erase all contents for existing file), 'a' (append), 'r+' (read and write). You can also use 'rb', 'wb', 'ab', 'rb+' for binary mode (raw-byte) operations. You can optionally specify the text encoding via keyword parameter encoding, e.g., encoding="utf-8".
fileObj.close(): Flush and close the file stream.

Reading/Writing Text Files

The fileObj returned after the file is opened maintains a file pointer. It initially positions at the beginning of the file and advances whenever read/write operations are performed.

Reading Line/Lines from a Text File

fileObj.readline() -> str: (most commonly-used) Read next line (upto and include newline) and return a string (including newline). It returns an empty string after the end-of-file (EOF).
fileObj.readlines() -> [str]: Read all lines into a list of strings.
fileObj.read() -> str: Read the entire file into a string.

Writing Line to a Text File

fileObj.write(str) -> int: Write the given string to the file and return the number of characters written. You need to explicitly terminate the str with a '\n', if needed. The '\n' will be translated to the platform-dependent newline ('\r\n' for Windows or '\n' for Unixes/Mac OS).

Examples

>>> f = open('test.txt', 'w')
>>> f.write('apple\n')
>>> f.write('orange\n')
>>> f.write('pear\n')
>>> f.close()    



>>> f = open('test.txt', 'r')
>>> f.readline()         
'apple\n'
>>> f.readlines()        
['orange\n', 'pear\n']
>>> f.readline()         
''
>>> f.close()

 
>>> f = open('test.txt', 'r')
>>> f.read()              
'apple\norange\npear\n'
>>> f.close()


>>> f = open('test.txt')
>>> line = f.readline()   
>>> while line:
        line = line.rstrip()  
        
        print(line)
        line = f.readline()
apple
orange
pear
>>> f.close()

Processing Text File Line-by-Line

We can use a with-statement to open a file, which will be closed automatically upon exit, and a for-loop to read line-by-line as follows:

with open('path/to/file.txt', 'r') as f:  
    for line in f:           
        line = line.strip()

The with-statement is equivalent to the try-finally statement as follows:

try:
    f = open('path/to/file.txt')
    for line in f:
        line = line.strip()
        
finally:
    f.close()

Example: Line-by-line File Copy

The following script copies a file into another line-by-line, prepending each line with the line number.

import sys
import os

def main():
    
    if len(sys.argv) != 3:
        print(__doc__)
        sys.exit(1)   
    fileIn  = sys.argv[1]
    fileOut = sys.argv[2]

    
    if not os.path.isfile(fileIn):
        print("error: {} does not exist".format(fileIn))
        sys.exit(1)

    
    if os.path.isfile(fileOut):
        print("{} exists. Override (y/n)?".format(fileOut))
        reply = input().strip().lower()
        if reply[0] != 'y':
           sys.exit(1)

    
    with open(fileIn, 'r') as fpIn, open(fileOut, 'w') as fpOut:
        lineNumber = 0
        for line in fpIn:
            lineNumber += 1
            line = line.rstrip()   
            fpOut.write("{}: {}\n".format(lineNumber, line))
            
        print("Number of lines: {}\n".format(lineNumber))

if __name__ == '__main__':
    main()

Binary File Operations

[TODO] Intro

fileObj.tell() -> int: returns the current stream position. The current stream position is the number of bytes from the beginning of the file in binary mode, and an opaque number in text mode.
fileObj.seek(offset): sets the current stream position to offset bytes from the beginning of the file.

For example [TODO]

Directory and File Management

In Python, directory and file management are supported by modules os, os.path, shutil, ...

Path Operations Using Module os.path

In Python, a path could refer to:

a file,
a directory, or
a symlink (symbolic link).

A path could be absolute (beginning with root) or relative to the current working directory (CWD).

The path separator is platform-dependent (Windows use '\', while Unixes/Mac OS use '/'). The os.path module supports platform-independent operations on paths, by handling the path separator intelligently.

Checking Path Existence and Type

os.path.exists(path) -> bool: Check if the given path exists.
os.path.isfile(file_path), os.path.isdir(dir_path), os.path.islink(link_path) -> bool: Check if the given path is a file, a directory, or a symlink.

For examples,

>>> import os
>>> os.path.exists('/usr/bin')  
True
>>> os.path.isfile('/usr/bin')  
False
>>> os.path.isdir('/usr/bin')   
True

Forming a New Path

The path separator is platform-dependent (Windows use '\', while Unixes/Mac OS use '/'). For portability, It is important NOT to hardcode the path separator. The os.path module supports platform-independent operations on paths, by handling the path separator intelligently.

os.path.sep: the path separator of the current system.
os.path.join(path, *paths): Form and return a path by joining one or more path components by inserting the platform-dependent path separator ('/' or '\'). To form an absoute path, you need to begin with a os.path.sep, as root.

For examples,

>>> import os
>>> print(os.path.sep)    
/


>>> print(os.path.join(os.path.sep, 'etc', 'apache2', 'httpd.conf'))
/etc/apache2/httpd.conf


>>> print(os.path.join('..', 'apache2', 'httpd.conf'))
../apache2/httpd.conf

Manipulating Directory-name and Filename

os.path.dirname(path): Return the directory name of the given path (file, directory or symlink). The returned directory name could be absolute or relative, depending on the path given.
os.path.abspath(path): Return the absolute path name (starting from the root) of the given path. This could be an absolute filename, an absolute directory-name, or an absolute symlink.

For example, to form an absolute path of a file called out.txt in the same directory as in.txt, you may extract the absolute directory name of the in.txt, then join with out.txt, as follows:

os.path.join(os.path.dirname(os.path.abspath('in.txt')), 'out.txt')

os.path.join(os.path.dirname('in.txt'), 'out.txt')

For example,

import os

print('__file__:', __file__)     
print('dirname():', os.path.dirname(__file__))  

print('abspath():', os.path.abspath(__file__))  
print('dirname(abspath()):', os.path.dirname(os.path.abspath(__file__)))

When a module is loaded in Python, __file__ is set to the module name. Try running this script with various __file__ references and study their output:

$ python3 ./test_ospath.py
$ python3 test_ospath.py
$ python3 ../parent_dir/test_ospath.py  
$ python3 /path/to/test_ospath.py

Handling Symlink (Unixes/Mac OS)

os.path.realpath(path): (for symlinks) Similar to abspath(), but return the canonical path, eliminating any symlinks encountered.

For example,

import os

print('__file__:', __file__)
print('abspath():', os.path.abspath(__file__))  
print('realpath():', os.path.realpath(__file__))

$ python3 test_realpath.py
# Same output for abspath() and realpath() becuase there is no symlink


$ ln -s test_realpath.py test_realpath_link.py

$ python3 test_realpath_link.py
#abspath(): /path/to/test_realpath_link.py
#realpath(): /path/to/test_realpath.py (symlink resolved)

Directory & File Managament Using Modules os and shutil

The modules os and shutil provide interface to the Operating System and System Shell.

However,

If you just want to read or write a file, use built-in function open().
If you just want to manipulate paths (files, directories and symlinks), use os.path module.
If you want to read all the lines in all the files on the command-line, use fileinput module.
To create temporary files/directories, use tempfile module.

Directory Management

os.getcwd(): Return the current working directory (CWD).
os.chdir(dir_path): Change the CWD.
os.mkdir(dir_path, mode=0777): Create a directory with the given mode expressed in octal (which will be further masked by environment variable umask). mode is ignored in Windows.
os.mkdirs(dir_path, mode=0777): Similar to mkdir, but create the intermediate sub-directories, if needed.
os.rmdir(dir_path): Remove an empty directory. You could use os.path.isdir(path) to check if the path exists.
shutil.rmtree(dir_path): Remove a directory and all its contents.

File Management

os.rename(src_file, dest_file): Rename a file.
os.remove(file) or os.unlink(file): Remove the file. You could use os.path.isfile(file) to check if the file exists.

For examples [TODO],

>>> import os
>>> dir(os)          
......
>>> help(os)         
......
>>> help(os.getcwd)  
......

>>> os.getcwd()                   
... current working directory ...
>>> os.listdir()                  
... contents of current directory ...
>>> os.chdir('test-python')       
>>> exec(open('hello.py').read()) 
>>> os.system('ls -l')            
>>> os.name                       
'posix'
>>> os.makedir('sub_dir')            
>>> os.makedirs('/path/to/sub_dir')  
>>> os.remove('filename')            
>>> os.rename('oldFile', 'newFile')

List a Directory

os.listdir(path='.') -> [path]: list all the entries in a given directory (exclude '.' and '..'), default to the current directory.

For examples,

>>> import os
>>> help(os.listdir)
......
>>> os.listdir()    
[..., ..., ...]


>>> for f in sorted(os.listdir('/usr')): print(f)
......
>>> for f in sorted(os.listdir('/usr')): print(os.path.abspath(f))
......

List a Directory Recursively via os.walk()

os.walk(top, topdown=True, onerror=None, followlinks=False): recursively list all the entries starting from top.

For example,

import sys
import os

def main():
    
    if len(sys.argv) > 2:  
        print(__doc__)
        sys.exit(1)        
    elif len(sys.argv) == 2:
        dir = sys.argv[1]  
    else:
        dir = '.'          

    
    if not os.path.isdir(dir):
        print('error: {} does not exists'.format(dir))
        sys.exit(1)

    
    for curr_dir, subdirs, files in os.walk(dir):
        "
        print('D:', os.path.abspath(curr_dir))    
        for subdir in sorted(subdirs):  
            print('SD:', os.path.abspath(subdir))
        for file in sorted(files):      
            print(os.path.join(os.path.abspath(curr_dir), file))  

if __name__ == '__main__':
    main()

List a Directory Recursively via Module glob (Python 3.5)

[TODO] Intro

import sys
import os
import glob  

def main():
    
    if len(sys.argv) > 2:  
        print(__doc__)
        sys.exit(1)        
    elif len(sys.argv) == 2:
        dir = sys.argv[1]  
    else:
        dir = '.'          

    
    if not os.path.isdir(dir):
        print('error: {} does not exists'.format(dir))
        sys.exit(1)

    
    for file in glob.glob(dir + '/**/*.txt', recursive=True):
        
        print(file)

    print('----------------------------')

    
    for file in glob.glob(dir + '/**', recursive=True):
        
        if os.path.isdir(file):
            print('D:', file)
        else:
            print(file)

if __name__ == '__main__':
    main()

Copying File

shutil.copyfile(src, dest): Copy from src to dest.

Shell Command [TODO]

os.system(command_str): Run a shell command. (In Python 3, use subprocess.call() instead.)

Environment Variables [TODO]

os.getenv(varname, value=None): Returns the environment variable if it exists, or value if it doesn't, with default of None.
os.putenv(varname, value): Set environment variable to value.
os.unsetenv(varname): Delete (Unset) the environment variable.

fileinput Module

The fileinput module provides support for processing lines of input from one or more files given in the command-line arguments (sys.argv). For example, create the following script called "test_fileinput.py":

import fileinput

def main():
    
    lineNumber = 0
    for line in fileinput.input():
        line = line.rstrip()   
        lineNumber += 1
        print('{}: {}'.format(lineNumber, line))

if __name__ == '__main__':
    main()

Text Processing

For simple text string operations such as string search and replacement, you can use the built-in string functions (e.g., str.replace(old, new)). For complex pattern search and replacement, you need to master regular expression (regex).

String Operations

The built-in class str provides many member functions for text string manipulation. Suppose that s is a str object.

Strip whitespaces (blank, tab and newline)

s.strip()-> str: Return a copy of the string s with leading and trailing whitespaces removed. Whitespaces includes blank, tab and newline.
s.strip([chars]) -> str: Strip the leading/trailing characters given, instead of whitespaces.
s.rstrip(), s.lstrip() -> str: Strip the right (trailing) whitespaces and the left (leading) whitespaces, respectively.

s.rstrip() is the most commonly-used to strip the trailing spaces/newline. The leading whitespaces are usually significant.

Uppercase/Lowercase

s.upper(), s.lower() -> str: Return a copy of string s converted to uppercase and lowercase, respectively.
s.isupper(), s.islower() -> bool: Check if the string is uppercase/lowercase, respectively.

Find

s.find(key_str, [start], [end]) -> int|-1: Return the lowest index in slice s[start:end] (default to entire string); or -1 if not found.
s.index(key_str, [start], [end]) -> int|ValueError: Similar to find(), but raises ValueError if not found.
s.startswith(key_str, [start], [end]), s.endsswith(key_str, [start], [end]) -> bool: Check is the string begins or ends with key_str.

For examples,

>>> s = '/test/in.txt'
>>> s.find('in')
6
>>> s[0 : s.find('in')] + 'out.txt'
'/test/out.txt'

Find and Replace

s.replace(old, new, [count]) -> str: Return a copy with all occurrences of old replaced by new. The optional parameter count limits the number of occurrences to replace, default to all occurrences.

str.replace() is ideal for simple text string replacement, without the need for pattern matching.

For examples,

>>> s = 'hello hello hello, world'
>>> help(s.replace)
>>> s.replace('ll', '**')
'he**o he**o he**o, world'
>>> s.replace('ll', '**', 2)
'he**o he**o hello, world'

Split into Tokens and Join

s.split([sep], [maxsplit=-1]) -> [str]: Return a list of words using sep as delimiter string. The default delimiter is whitespaces (blank, tab and newline). The maxSplit limits the maximum number of split operations, with default -1 means no limit.
sep.join([str]) -> str: Reverse of split(). Join the list of string with sep as separator.

For examples,

>>> 'apple, orange, pear'.split()       
['apple,', 'orange,', 'pear']
>>> 'apple, orange, pear'.split(', ')   
['apple', 'orange', 'pear']
>>> 'apple, orange, pear'.split(', ', maxsplit=1)  
['apple', 'orange, pear']

>>> ', '.join(['apple', 'orange, pear'])
'apple, orange, pear'

Regular Expression in Module re

References:

Python's Regular Expression HOWTO @ https://docs.python.org/3/howto/regex.html.
Python's re - Regular expression operations @ https://docs.python.org/3/library/re.html.

I assume that you are familiar with regex, otherwise, you could read:

"Regex By Examples" for a summary of regex syntax and examples.
"Regular Expressions" for full coverage.

The re module provides support for regular expressions (regex).

>>> import re
>>> dir(re)   
......
>>> help(re)  
......

Backslash (\), Python Raw String r'...' vs Regular String

Regex's syntax uses backslash (\):

for metacharacters such as \d (digit), \D (non-digit), \s (space), \S (non-space), \w (word), \W (non-word)
to escape special regex characters, e.g., \. for ., \+ for +, \* for *, \? for ?. You also need to write \\ to match \.

On the other hand, Python' regular strings also use backslash for escape sequences, e.g., \n for newline, \t for tab. Again, you need to write \\ for \.

To write the regex pattern \d+ (one or more digits) in a Python regular string, you need to write '\\d+'. This is cumbersome and error-prone.

Python's solution is using raw string with a prefix r in the form of r'...'. It ignores interpretation of the Python's string escape sequence. For example, r'\n' is '\'+'n' (two characters) instead of newline (one character). Using raw string, you can write r'\d+' for regex pattern \d+ (instead of regular string '\\d+').

Furthermore, Python denotes parenthesized back references (or capturing groups) as \1, \2, \3, ..., which can be written as raw strings r'\1', r'\2' instead of regular string '\\1' and '\\2'. Take note that some languages uses $1, $2, ... for the back references.

I suggest that you use raw strings for regex pattern strings and replacement strings.

Compiling (Creating) a Regex Pattern Object

re.compile(regexStr, [modifiers]) -> regexObj: Compile a regex pattern into a regex object, which can then be used for matching operations.

For examples,

>>> import re
>>> p1 = re.compile(r'[1-9][0-9]*|0')
      
>>> type(p1)


>>> p2 = re.compile(r'^\w{6,10}$')
      

>>> p3 = re.compile(r'xy*', re.IGNORECASE)

Invoking Regex Operations

You can invoke most of the regex functions in two ways:

regexObj.func(str): Apply compiled regex object to str, via SRE_Pattern's member function func().
re.func(regexObj|regexStr, str): Apply regex object (compiled) or regexStr (uncompiled) to str, via re's module-level function func(). These module-level functions are shortcuts to the above that do not require you to compile a regex object first, but miss the modifiers if regexStr is used.

Find using finaAll()

regexObj.findall(str) -> [str]: Return a list of all the matching substrings.
re.findall(regexObj|regexStr, str) -> [str]: same as above.

For examples,

>>> p1 = re.compile(r'[1-9][0-9]*|0')   
>>> p1.findall('123 456')
['123', '456']
>>> p1.findall('abc')
[]
>>> p1.findall('abc123xyz456_7_00')
['123', '456', '7', '0', '0']


>>> re.findall(r'[1-9][0-9]*|0', '123 456')  
['123', '456']
>>> re.findall(r'[1-9][0-9]*|0', 'abc')
[]
>>> re.findall(r'[1-9][0-9]*|0', 'abc123xyz456_7_00')
['123', '456', '7', '0', '0']

Replace using sub() and subn()

regexObj.sub(replaceStr, inStr, [count=0]) -> outStr: Substitute (Replace) the matched substrings in the given inStr with the replaceStr, up to count occurrences, with default of all.
regexObj.subn(replaceStr, inStr, [count=0]) -> (outStr, count): Similar to sub(), but return a new string together with the number of replacements in a 2-tuple.
re.sub(regexObj|regexStr, replaceStr, inStr, [count=0]) -> outStr: same as above.
re.subn(regexObj|regexStr, replaceStr, inStr, [count=0]) -> (outStr, count): same as above.

For examples,

>>> p1 = re.compile(r'[1-9][0-9]*|0')   
>>> p1.sub(r'**', 'abc123xyz456_7_00')
'abc**xyz**_**_****'
>>> p1.subn(r'**', 'abc123xyz456_7_00')
('abc**xyz**_**_****', 5)    
>>> p1.sub(r'**', 'abc123xyz456_7_00', count=3)
'abc**xyz**_**_00'


>>> re.sub(r'[1-9][0-9]*|0', r'**', 'abc123xyz456_7_00')  
'abc**xyz**_**_****'
>>> re.sub(p1, r'**', 'abc123xyz456_7_00')  
'abc**xyz**_**_****'
>>> re.subn(p1, r'**', 'abc123xyz456_7_00', count=3)
('abc**xyz**_**_00', 3)
>>> re.subn(p1, r'**', 'abc123xyz456_7_00', count=10)  
('abc**xyz**_**_****', 5)

Notes: For simple string replacement, use str.replace(old, new, [max=-1]) -> str which is more efficient. See above section.

Using Parenthesized Back-References \1, \2, ... in Substitution and Pattern

In Python, regex parenthesized back-references (capturing groups) are denoted as \1, \2, .... You could use raw string (e.g., r'\1') to avoid escaping backslash in regular string (e.g., '\\1').

For examples,

>>> re.sub(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc')
'bbb aaa ccc'
>>> re.sub(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc ddd')
'bbb aaa ddd ccc'
>>> re.subn(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc ddd eee')
('bbb aaa ddd ccc eee', 2)


>>> re.subn(r'(\w+) \1', r'\1', 'hello hello world again again')
('hello world again', 2)

Find using search() and Match Object

regexObj.search(inStr, [begin], [end]) -> matchObj:
re.search(regexObj|regexStr, inStr, [begin], [end]) -> matchObj:

The search() returns a special Match object encapsulating the first match (or None if there is no matches). You can then use the following methods to process the resultant Match object:

matchObj.group(): Return the matched substring.
matchObj.start(): Return the starting matched position (inclusive).
matchObj.end(): Return the ending matched position (exclusive).
matchObj.span(): Return a tuple of (start, end) matched position.

For example,

>>> p1 = re.compile(r'[1-9][0-9]*|0')
>>> inStr = 'abc123xyz456_7_00'
>>> m = p1.search(inStr)
>>> m
<_sre.SRE_Match object; span=(3, 6), match='123'>
>>> m.group()
'123'
>>> m.span()
(3, 6)
>>> m.start()
3
>>> m.end()
6


>>>  m = p1.search(inStr, m.end())
>>>  m
<_sre.SRE_Match object; span=(9, 12), match='456'>


>>> m = p1.search(inStr)
>>> while m:
        print(m, m.group())
        m = p1.search(inStr, m.end())
<_sre.SRE_Match object; span=(3, 6), match='123'>
123
<_sre.SRE_Match object; span=(9, 12), match='456'>
456
<_sre.SRE_Match object; span=(13, 14), match='7'>
7
<_sre.SRE_Match object; span=(15, 16), match='0'>
0
<_sre.SRE_Match object; span=(16, 17), match='0'>
0

To retrieve the back-references (or capturing groups) inside the Match object:

matchObj.groups(): return a tuple of captured groups (or back-references)
matchObj.group(n): return the capturing group n, where n starts at 1.
matchObj.lastindex: last index of the capturing group

>>> p2 = re.compile('(A)(\w+)', re.IGNORECASE)  
>>> inStr = 'This is an apple.'
>>> m = p2.search(inStr)
>>> while m:
        print(m)
        print(m.group())    
        print(m.groups())   
        for idx in range(1, m.lastindex + 1):  
            print(m.group(idx), end=',')   
        print()
        m = p2.search(inStr, m.end())
<_sre.SRE_Match object; span=(8, 10), match='an'>
an
('a', 'n')
a,n,
<_sre.SRE_Match object; span=(11, 16), match='apple'>
apple
('a', 'pple')
a,pple,

Find using match() and fullmatch()

regexObj.match(inStr, [begin], [end]) -> matchObj:
regexObj.fullmatch(inStr, [begin], [end]) -> matchObj:
re.match(regexObj|regexStr, inStr, [begin], [end]) -> matchObj:
re.fullmatch(regexObj|regexStr, inStr, [begin], [end]) -> matchObj:

The search() matches anywhere in the given inStr[begin:end]. On the other hand, the match() matches from the start of inStr[begin:end] (similar to regex pattern ^...); while the fullmatch() matches the entire inStr[begin:end] (similar to regex pattern ^...$).

For example,

>>> p1 = re.compile(r'[1-9][0-9]*|0')   
>>> m = p1.match('aaa123zzz456')   
>>> m

>>> m = p1.match('123zzz456')      
>>> m
<_sre.SRE_Match object; span=(0, 3), match='123'>


>>> m = p1.fullmatch('123456')     
>>> m
<_sre.SRE_Match object; span=(0, 6), match='123456'>
>>> m = p1.fullmatch('123456abc')
>>> m

Find using finditer()

regexObj.finditer(inStr) -> matchIterator
re.finditer(regexObj|regexStr, inStr) -> matchIterator

The finditer() is similar to findall(). The findall() returns a list of matched substrings. The finditer() returns an iterator to Match objects. For examples,

>>> p1 = re.compile(r'[1-9][0-9]*|0')
>>> inStr = 'abc123xyz456_7_00'
>>> p1.findall(inStr)   
['123', '456', '7', '0', '0']
>>> for s in p1.findall(inStr):  
        print(s, end=' ')
123 456 7 0 0


>>> for m in p1.finditer(inStr):  
        print(m)
<_sre.SRE_Match object; span=(3, 6), match='123'>
<_sre.SRE_Match object; span=(9, 12), match='456'>
<_sre.SRE_Match object; span=(13, 14), match='7'>
<_sre.SRE_Match object; span=(15, 16), match='0'>
<_sre.SRE_Match object; span=(16, 17), match='0'>

>>> for m in p1.finditer(inStr):
        print(m.group(), end=' ')
123 456 7 0 0

Spliting String into Tokens

regexObj.split(inStr) -> [str]:
re.split(regexObj|regexStr, inStr) -> [str]:

The split() splits the given inStr into a list, using the regex's Pattern as delimiter (separator). For example,

>>> p1 = re.compile(r'[1-9][0-9]*|0')
>>> p1.split('aaa123bbb456ccc')
['aaa', 'bbb', 'ccc']

>>> re.split(r'[1-9][0-9]*|0', 'aaa123bbb456ccc')
['aaa', 'bbb', 'ccc']

Notes: For simple delimiter, use str.split([sep]), which is more efficient. See above section.

Web Scraping

References:

Beautiful Soup Documentation @ https://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Web Scraping (or web harvesting or web data extraction) refers to reading the raw HTML page to retrieve desired data. Needless to say, you need to master HTML, CSS and JavaScript.

Python supports web scraping via packages requests and BeautifulSoup (bs4).

Install Packages

You could install the relevant packages using pip as follows:

$ pip install requests
$ pip install bs4

Step 0: Inspect the Target Webpage

Press F12 on the target webpage to turn on the "F12 debugger".
Choose "Inspector".
Click the "Select" (the left-most icon with a arrow) and point your mouse at the desired part of the HTML page. Study the codes.

Step 1: Send a HTTP GET request to the target URL to retrieve the raw HTML page using module requests

>>> import requests

>>> url = "http://your_target_webpage"


>>> response = requests.get(url)

>>> type(response)

>>> response

>>> help(response)
......
>>> print(response.text)     
......
>>> print(response.content)  
......

Step 2: Parse the HTML Text into a Tree-Structure using BeautifulSoup and Search the Desired Data

>>> from bs4 import BeautifulSoup


>>> soup = BeautifulSoup(response.text, "html.parser")


>>> type(soup)

>>> help(soup)
......

>>> img_tag = soup.find('img')
>>> img_tag


>>> img_tags = soup.findAll('img')
>>> img_tags
[, , , ...]


>>> soup.find('div', attrs = {'id':'test'})

>>> soup.findAll('div', attrs = {'class':'error'})

You could write out the selected data to a file:

with open(filename, 'w') as fp:
    for row in rows:
        fp.wrire(row + '\n')

You could also use csv module to write out rows of data with a header:

>>> import csv
>>> with open(filename, 'w') as fp:
        writer = csv.DictWriter(fp, ['colHeader1', 'colHeader2', 'colHeader3'])
        writer.writeheader()
        for row in rows:
            writer.writerow(row)

Step 3: Download Selected Document Using urllib.request

You may want to download documents such as text files or images.

>>> import urllib.request

>>> downloadUrl = '.....'
>>> file = '......'
>>> urllib.request.urlretrieve(download_url, file)

Step 4: Delay

To avoid spamming a website with download requests (and flagged as a spammer), you need to pause your code for a while.

>>> import time

>>> time.sleep(1)

REFERENCES & RESOURCES

Which function can be used to read an entire line of text data?

readline() function reads a line of the file and return it in the form of the string. It takes a parameter n, which specifies the maximum number of bytes that will be read.

Which method is used to read the contents of a file line by line?

We can use java. io. BufferedReader readLine() method to read file line by line to String.

Which function will read entire content of file?

The read() function is designed to be called once, and it returns the entire contents of the file.

Which function can be used for reading entire content of a file into a string while using a file object for reading from a file?

read() : This function reads the entire file and returns a string. readline() : This function reads lines from that file and returns as a string.

Which function can be used to read the entire contents of a text file line by line?

File Input/Output

Opening/Closing a File

Reading/Writing Text Files

Reading Line/Lines from a Text File

Writing Line to a Text File

Examples

Processing Text File Line-by-Line

Example: Line-by-line File Copy

Binary File Operations

Directory and File Management

Path Operations Using Module os.path

Checking Path Existence and Type

Forming a New Path

Manipulating Directory-name and Filename

Handling Symlink (Unixes/Mac OS)

Directory & File Managament Using Modules os and shutil

Directory Management

File Management

List a Directory

List a Directory Recursively via os.walk()

List a Directory Recursively via Module glob (Python 3.5)

Copying File

Shell Command [TODO]

Environment Variables [TODO]

fileinput Module

Text Processing

String Operations

Strip whitespaces (blank, tab and newline)

Uppercase/Lowercase

Find

Find and Replace

Split into Tokens and Join

Regular Expression in Module re

Backslash (\), Python Raw String r'...' vs Regular String

Compiling (Creating) a Regex Pattern Object

Invoking Regex Operations

Find using finaAll()

Replace using sub() and subn()

Using Parenthesized Back-References \1, \2, ... in Substitution and Pattern

Find using search() and Match Object

Find using match() and fullmatch()

Find using finditer()

Spliting String into Tokens

Web Scraping

References:

Install Packages

Step 0: Inspect the Target Webpage

Step 1: Send a HTTP GET request to the target URL to retrieve the raw HTML page using module requests

Step 2: Parse the HTML Text into a Tree-Structure using BeautifulSoup and Search the Desired Data

Step 3: Download Selected Document Using urllib.request

Step 4: Delay

Which function can be used to read an entire line of text data?

Which method is used to read the contents of a file line by line?

Which function will read entire content of file?

Which function can be used for reading entire content of a file into a string while using a file object for reading from a file?

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội